Anomaly detection is the task of identifying observations whose character- istics are significantly different from the rest of the data. Such observations are known as anomalies or outliers. The goal of an anomaly detection al- gorithm is to discover the real anomalies and avoid falsely labeling normal objects as anomalous. In other words, a good anomaly detector must have a high detection rate and a low false alarm rate. Applications of anomaly detection include the detection of fraud, network intrusions, unusual patterns of disease, and ecosystem disturbances.
Example 1.4 (Credit Card trYaud Detection). A credit card company records the transactions made by every credit card holder, along with personal information such as credit limit, age, annual income, and address. since the number of fraudulent cases is relatively small compared to the number of legitimate transactions, anomaly detection techniques can be applied to build a profile of legitimate transactions for the users. When a new transaction arrives, it is compared against the profile of the user. If the characteristics of the transaction are very different from the previously created profile, then the transaction is flagged as potentially fraudulent. I
1.5 Scope and Organization of the Book
This book introduces the major principles and techniques used in data mining from an algorithmic perspective. A study of these principles and techniques is essential for developing a better understanding of how data mining technology can be applied to various kinds of data. This book also serves as a starting point for readers who are interested in doing research in this field.
We begin the technical discussion of this book with a chapter on data (Chapter 2), which discusses the basic types of data, data quality, prepro- cessing techniques, and measures of similarity and dissimilarity. Although this material can be covered quickly, it provides an essential foundation for data analysis. Chapter 3, on data exploration, discusses summary statistics, visualization techniques, and On-Line Analytical Processing (OLAP). These techniques provide the means for quickly gaining insight into a data set.
Chapters 4 and 5 cover classification. Chapter 4 provides a foundation by discussing decision tree classifiers and several issues that are important to all classification: overfitting, performance evaluation, and the comparison of different classification models. Using this foundation, Chapter 5 describes a number of other important classification techniques: rule-based systems, nearest-neighbor classifiers, Bayesian classifiers, artificial neural networks, sup- port vector machines, and ensemble classifiers, which are collections of classi-
!2 Chapter 1 lntroduction
fiers. The multiclass and imbalanced class problems are also discussed. These
topics can be covered independently. Association analysis is explored in Chapters 6 and 7. Chapter 6 describes
the basics of association analysis: frequent itemsets, association rules, and
some of the algorithms used to generate them. Specific types of frequent
itemsets-maximal, closed, and hyperclique-that are important for data min-
ing are also discussed, and the chapter concludes with a discussion of evalua-
tion measures for association analysis. Chapter 7 considers a variety of more
advanced topics, including how association analysis can be applied to categor-
ical and continuous data or to data that has a concept hierarchy. (A concept
hierarchy is a hierarchical categorization of objects, e.g., store items, clothing,
shoes, sneakers.) This chapter also describes how association analysis can be
extended to find sequential patterns (patterns involving order), patterns in
graphs, and negative relationships (if one item is present, then the other is
not). Cluster analysis is discussed in Chapters 8 and 9. Chapter 8 first describes
the different types of clusters and then presents three specific clustering tech-
niques: K-means, agglomerative hierarchical clustering, and DBSCAN. This
is followed by a discussion of techniques for validating the results of a cluster-
ing algorithm. Additional clustering concepts and techniques are explored in
Chapter 9, including fiszzy and probabilistic clustering, Self-Organizing Maps
(SOM), graph-based clustering, and density-based clustering. There is also a
discussion of scalability issues and factors to consider when selecting a clus-
tering algorithm. The last chapter, Chapter 10, is on anomaly detection. After some basic
definitions, several different types of anomaly detection are considered: sta-
tistical, distance-based, density-based, and clustering-based. Appendices A
through E give a brief review of important topics that are used in portions of
the book: linear algebra, dimensionality reduction, statistics, regression, and
optimization. The subject of data mining, while relatively young compared to statistics
or machine learning, is already too large to cover in a single book. Selected
references to topics that are only briefly covered, such as data quality’ are
provided in the bibliographic notes of the appropriate chapter. References to
topics not covered in this book, such as data mining for streams and privacy-
preserving data mining, are provided in the bibliographic notes of this chapter.
Bibliographic Notes 13
1.6 Bibliographic Notes
The topic of data mining has inspired many textbooks. Introductory text- books include those by Dunham [10], Han and Kamber l2L), Hand et al. [23], and Roiger and Geatz [36]. Data mining books with a stronger emphasis on business applications include the works by Berry and Linoff [2], Pyle [34], and Parr Rud [33]. Books with an emphasis on statistical learning include those by Cherkassky and Mulier [6], and Hastie et al. 124]. Some books with an emphasis on machine learning or pattern recognition are those by Duda et al. [9], Kantardzic [25], Mitchell [31], Webb [41], and Witten and F]ank [42]. There are also some more specialized books: Chakrabarti [a] (web mining), Fayyad et al. [13] (collection of early articles on data mining), Fayyad et al.
111] (visualization), Grossman et al. [18] (science and engineering), Kargupta and Chan [26] (distributed data mining), Wang et al. [a0] (bioinformatics), and Zaki and Ho [44] (parallel data mining).
There are several conferences related to data mining. Some of the main conferences dedicated to this field include the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), the IEEE In- ternational Conference on Data Mining (ICDM), the SIAM International Con- ference on Data Mining (SDM), the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), and the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Data min- ing papers can also be found in other major conferences such as the ACM SIGMOD/PODS conference, the International Conference on Very Large Data Bases (VLDB), the Conference on Information and Knowledge Management (CIKM), the International Conference on Data Engineering (ICDE), the In- ternational Conference on Machine Learning (ICML), and the National Con- ference on Artificial Intelligence (AAAI).
Journal publications on data mining include IEEE Transact’ions on Knowl- edge and Data Engi,neering, Data Mi,ning and Knowledge Discouery, Knowl- edge and Information Systems, Intelli,gent Data Analysi,s, Inforrnati,on Sys- tems, and lhe Journal of Intelligent Informati,on Systems.
There have been a number of general articles on data mining that define the field or its relationship to other fields, particularly statistics. Fayyad et al. [12] describe data mining and how it fits into the total knowledge discovery process. Chen et al. [5] give a database perspective on data mining. Ramakrishnan and Grama [35] provide a general discussion of data mining and present several viewpoints. Hand [22] describes how data mining differs from statistics, as does Fliedman lf 4]. Lambert [29] explores the use of statistics for large data sets and provides some comments on the respective roles of data mining and statistics.
1_.6
L4 Chapter 1 Introduction
Glymour et al. 116] consider the lessons that statistics may have for data
mining. Smyth et aL [38] describe how the evolution of data mining is being
driven by new types of data and applications, such as those involving streams,
graphs, and text. Emerging applications in data mining are considered by Han
et al. [20] and Smyth [37] describes some research challenges in data mining.
A discussion of how developments in data mining research can be turned into
practical tools is given by Wu et al. [43]. Data mining standards are the
subject of a paper by Grossman et al. [17]. Bradley [3] discusses how data
mining algorithms can be scaled to large data sets. With the emergence of new data mining applications have come new chal-
lenges that need to be addressed. For instance, concerns about privacy breaches
as a result of data mining have escalated in recent years, particularly in ap-
plication domains such as Web commerce and health care. As a result, there
is growing interest in developing data mining algorithms that maintain user
privacy. Developing techniques for mining encrypted or randomized data is
known as privacy-preserving data mining. Some general references in this
area include papers by Agrawal and Srikant l1], Clifton et al. [7] and Kargupta
et al. [27]. Vassilios et al. [39] provide a survey. Recent years have witnessed a growing number of applications that rapidly
generate continuous streams of data. Examples of stream data include network
traffic, multimedia streams, and stock prices. Several issues must be considered
when mining data streams, such as the limited amount of memory available,
the need for online analysis, and the change of the data over time. Data
mining for stream data has become an important area in data mining. Some
selected publications are Domingos and Hulten [8] (classification), Giannella
et al. [15] (association analysis), Guha et al. [19] (clustering), Kifer et al. [28] (change detection), Papadimitriou et al. [32] (time series), and Law et al. [30] (dimensionality reduction).
[1]
12l
l o l[‘)]
[4]
Bibliography R. Agrawal and R. Srikant. Privacy-preserving data mining. ln Proc. of 2000 ACM-
SIGMOD IntI. Conf. on Management of Data, pages 439-450, Dallas, Texas, 2000.
ACM Press.
M. J. A. Berry and G. Linofi. Data Mtni,ng Technr,ques: For Marketing, Sales, and’
Customer Relati,onship Management. Wiley Computer Publishing, 2nd edition, 2004.
P. S. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scaling mining algorithms
to large databases. Communicati,ons of the ACM, 45(8):38 43,2002.
S. Chakrabarti. Mini.ng the Web: Di.scoueri.ng Knouledge from Hypertert Data’ Morgan
Kaufmann, San Flancisco, CA, 2003.
Bibliography 15
[5] M.-s. chen, J. Han, and P. s. Yu. Data Mining: An overview from a Database Perspective. IEEE Transact’ions on Knowled,ge abd Data Engineering, g(6):g66-gg3, 1996.
[6] v. cherkassky and F. Mulier. Learn’ing from Data: concepts, Theory, and, Method,s. Wiley Interscience, 1g98.
[7] c. clifton, M. Kantarcioglu, and J. vaidya. Defining privacy for data mining. In National Sc’ience Foundat’ion workshop on Nert Generation Data Mining, pages 126- 133, Baltimore, MD, November 2002.
f8] P’ Domingos and G. Hulten. Mining high-speed data streams. In Proc. of the 6th Intt. conf. on Knowled,ge Discouery and Data M’in’ing, pages z1-80, Boston, Massachusetts, 2000. ACM Press.
J9] R. o’ Duda, P. E. Hart, and D. G. stork. Pattern classification John wiley & sons, Inc., New York, 2nd edition, 2001.