2.2 Data Quality 36 2.2.I Measurement and Data Collection Issues 37 2.2.2 Issues Related to Applications
2.3 Data Preprocessing 2.3.L Aggregation 2.3.2 Sampling 2.3.3 Dimensionality Reduction 2.3.4 Feature Subset Selection 2.3.5 Feature Creation 2.3.6 Discretization and Binarization 2.3:7 Variable Tlansformation .
2.4 Measures of Similarity and Dissimilarity . . . 2.4.L Basics 2.4.2 Similarity and Dissimilarity between Simple Attributes . 2.4.3 Dissimilarities between Data Objects . 2.4.4 Similarities between Data Objects
43 44 45 47 50 52 55 57 63 65 66 67 69 72
xiv Contents
2.4.5 Examples of Proximity Measures 2.4.6 Issues in Proximity Calculation 2.4.7 Selecting the Right Proximity Measure
2.5 BibliographicNotes 2.6 Exercises
Exploring Data 3.i The Iris Data Set 3.2 Summary Statistics
3.2.L Frequencies and the Mode 3.2.2 Percentiles 3.2.3 Measures of Location: Mean and Median 3.2.4 Measures of Spread: Range and Variance 3.2.5 Multivariate Summary Statistics 3.2.6 Other Ways to Summarize the Data
3.3 Visualization 3.3.1 Motivations for Visualization 3.3.2 General Concepts 3.3.3 Techniques 3.3.4 Visualizing Higher-Dimensional Data . 3.3.5 Do’s and Don’ts
3.4 OLAP and Multidimensional Data Analysis 3.4.I Representing Iris Data as a Multidimensional Array 3.4.2 Multidimensional Data: The General Case . 3.4.3 Analyzing Multidimensional Data 3.4.4 Final Comments on Multidimensional Data Analysis Bibliographic Notes Exercises
Classification: Basic Concepts, Decision Tlees, and Model Evaluation 4.1 Preliminaries 4.2 General Approach to Solving a Classification Problem 4.3 Decision Tlee Induction
4.3.1 How a Decision Tlee Works 4.3.2 How to Build a Decision TYee 4.3.3 Methods for Expressing Attribute Test Conditions 4.3.4 Measures for Selecting the Best Split . 4.3.5 Algorithm for Decision Tlee Induction 4.3.6 An Examole: Web Robot Detection
3.5 3.6
73 80 83 84 88
97 98 98 99
100 101 102 704 105 105 105 106 110 724 130 131 131 133 135 139 139 747
L45 746 748 150 150 151 155 158 164 166
Contents xv
4.3.7 Characteristics of Decision Tlee Induction 4.4 Model Overfitting
4.4.L Overfitting Due to Presence of Noise 4.4.2 Overfitting Due to Lack of Representative Samples 4.4.3 Overfitting and the Multiple Comparison Procedure 4.4.4 Estimation of Generalization Errors 4.4.5 Handling Overfitting in Decision Tlee Induction
4.5 Evaluating the Performance of a Classifier 4.5.I Holdout Method 4.5.2 Random Subsampling . . . 4.5.3 Cross-Validation 4.5.4 Bootstrap
4.6 Methods for Comparing Classifiers 4.6.L Estimating a Confidence Interval for Accuracy 4.6.2 Comparing the Performance of Two Models . 4.6.3 Comparing the Performance of Two Classifiers
4.7 BibliographicNotes 4.8 Exercises
5 Classification: Alternative Techniques 5.1 Rule-Based Classifier
5.1.1 How a Rule-Based Classifier Works 5.1.2 Rule-Ordering Schemes 5.1.3 How to Build a Rule-Based Classifier 5.1.4 Direct Methods for Rule Extraction 5.1.5 Indirect Methods for Rule Extraction 5.1.6 Characteristics of Rule-Based Classifiers
5.2 Nearest-Neighbor classifiers 5.2.L Algorithm 5.2.2 Characteristics of Nearest-Neighbor Classifiers
5.3 Bayesian Classifiers 5.3.1 Bayes Theorem 5.3.2 Using the Bayes Theorem for Classification 5.3.3 Naive Bayes Classifier 5.3.4 Bayes Error Rate 5.3.5 Bayesian Belief Networks
5.4 Artificial Neural Network (ANN) 5.4.I Perceptron 5.4.2 Multilayer Artificial Neural Network 5.4.3 Characteristics of ANN
168 172 L75 L77 178 179 184 186 186 187 187 188 188 189 191 192 193 198
207 207 209 2 I I 2r2 2r3 22L 223 223 225 226 227 228 229 23L 238 240 246 247 25r 255
xvi Contents
5.5 Support Vector Machine (SVM) 5.5.1 Maximum Margin Hyperplanes 5.5.2 Linear SVM: Separable Case 5.5.3 Linear SVM: Nonseparable Case 5.5.4 Nonlinear SVM . 5.5.5 Characteristics of SVM Ensemble Methods 5.6.1 Rationale for Ensemble Method 5.6.2 Methods for Constructing an Ensemble Classifier 5.6.3 Bias-Variance Decomposition 5.6.4 Bagging 5.6.5 Boosting 5.6.6 Random Forests 5.6.7 Empirical Comparison among Ensemble Methods Class Imbalance Problem 5.7.1 Alternative Metrics 5.7.2 The Receiver Operating Characteristic Curve 5.7.3 Cost-Sensitive Learning . . 5.7.4 Sampling-Based Approaches . Multiclass Problem Bibliographic Notes Exercises
5 .6
o . t
256 256 259 266 270 276 276 277 278 28r 283 285 290 294 294 295 298 302 305 306 309 315
c .6
5.9 5.10
Association Analysis: Basic Concepts and Algorithms 327 6.1 Problem Definition . 328 6.2 Flequent Itemset Generation 332
6.2.I The Apri,ori Principle 333 6.2.2 Fbequent Itemset Generation in the Apri,ori, Algorithm . 335 6.2.3 Candidate Generation and Pruning . . . 338 6.2.4 Support Counting 342 6.2.5 Computational Complexity 345
6.3 Rule Generatiorr 349 6.3.1 Confidence-Based Pruning 350 6.3.2 Rule Generation in Apri,ori, Algorithm 350 6.3.3 An Example: Congressional Voting Records 352
6.4 Compact Representation of Fbequent Itemsets 353 6.4.7 Maximal Flequent Itemsets 354 6.4.2 Closed Frequent Itemsets 355
6.5 Alternative Methods for Generating Frequent Itemsets 359 6.6 FP-Growth Alsorithm 363
Contents xvii
6.6.1 FP-tee Representation 6.6.2 Frequent Itemset Generation in FP-Growth Algorithm .
6.7 Evaluation of Association Patterns 6.7.l Objective Measures of Interestingness 6.7.2 Measures beyond Pairs of Binary Variables 6.7.3 Simpson’s Paradox
6.8 Effect of Skewed Support Distribution 6.9 Bibliographic Notes
363 366 370 37r 382 384 386 390 404
4L5 415 4t8 418 422 424 426 429 429 431 436 439 442 443 444 447 448 453 457 457 458 458
460 461 463 465 469 473
6.10 Exercises
7 Association Analysis: Advanced 7.I Handling Categorical Attributes 7.2 Handling Continuous Attributes
Concepts
7.2.I Discretization-Based Methods 7.2.2 Statistics-Based Methods 7.2.3 Non-discretizalion Methods Handling a Concept Hierarchy Seouential Patterns 7.4.7 Problem Formulation 7.4.2 Sequential Pattern Discovery 7.4.3 Timing Constraints 7.4.4 Alternative Counting Schemes
7.5 Subgraph Patterns 7.5.1 Graphs and Subgraphs . 7.5.2 Frequent Subgraph Mining 7.5.3 Apri,od-like Method 7.5.4 Candidate Generation 7.5.5 Candidate Pruning 7.5.6 Support Counting
7.6 Infrequent Patterns 7.6.7 Negative Patterns 7.6.2 Negatively Correlated Patterns 7.6.3 Comparisons among Infrequent Patterns, Negative Pat-
terns, and Negatively Correlated Patterns 7.6.4 Techniques for Mining Interesting Infrequent Patterns 7.6.5 Techniques Based on Mining Negative Patterns 7.6.6 Techniques Based on Support Expectation .
7.7 Bibliographic Notes 7.8 Exercises
7.3 7.4
xviii Contents
Cluster Analysis: Basic Concepts and Algorithms 8.1 Overview
8.1.1 What Is Cluster Analysis? 8.I.2 Different Types of Clusterings . 8.1.3 Different Types of Clusters
8.2 K-means 8.2.7 The Basic K-means Algorithm 8.2.2 K-means: Additional Issues 8.2.3 Bisecting K-means 8.2.4 K-means and Different Types of Clusters 8.2.5 Strengths and Weaknesses 8.2.6 K-means as an Optimization Problem
8.3 Agglomerative Hierarchical Clustering 8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm 8.3.2 Specific Techniques 8.3.3 The Lance-Williams Formula for Cluster Proximity . 8.3.4 Key Issues in Hierarchical Clustering . 8.3.5 Strengths and Weaknesses DBSCAN 8.4.1 Tladitional Density: Center-Based Approach 8.4.2 The DBSCAN Algorithm 8.4.3 Strengths and Weaknesses Cluster Evaluation 8.5.1 Overview 8.5.2 Unsupervised Cluster Evaluation Using Cohesion and
Separation 8.5.3 Unsupervised Cluster Evaluation Using the Proximity
Matrix 8.5.4 Unsupervised Evaluation of Hierarchical Clustering . 8.5.5 Determining the Correct Number of Clusters 8.5.6 Clustering Tendency 8.5.7 Supervised Measures of Cluster Validity 8.5.8 Assessing the Significance of Cluster Validity Measures .
8.4
8.5
487 490 490 49r 493 496 497 506 508 510 510 513 515 516 518 524 524 526 526 527 528 530 532 533
536
542 544 546 547 548 553 ooo
559 8.6 Bibliograph 8.7 Exercises
ic Notes
Cluster Analysis: Additional Issues and Algorithms 569 9.1 Characteristics of Data, Clusters, and Clustering Algorithms . 570
9.1.1 Example: Comparing K-means and DBSCAN . . . . . . 570 9.1.2 Data Characteristics 577
Contents xix
9.1.3 Cluster Characteristics . . 573 9.L.4 General Characteristics of Clustering Algorithms 575
9.2 Prototype-Based Clustering 577 9.2.1 F\zzy Clustering 577 9.2.2 Clustering Using Mixture Models 583 9.2.3 Self-Organizing Maps (SOM) 594
9.3 Density-Based Clustering 600 9.3.1 Grid-Based Clustering 601 9.3.2 Subspace Clustering 604 9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based
Clustering 608 9.4 Graph-Based Clustering 612
9.4.1 Sparsification 613 9.4.2 Minimum Spanning Tlee (MST) Clustering . . . 674 9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities
Using METIS 616 9.4.4 Chameleon: Hierarchical Clustering with Dynamic
Modeling 9.4.5 Shared Nearest Neighbor Similarity 9.4.6 The Jarvis-Patrick Clustering Algorithm 9.4.7 SNN Density 9.4.8 SNN Density-Based Clustering
9.5 Scalable Clustering Algorithms 9.5.1 Scalability: General Issues and Approaches 9.5.2 BIRCH 9.5.3 CURE
9.6 Which Clustering Algorithm? 9.7 Bibliographic Notes 9.8 Exercises
616 622 625 627 629 630 630 633 635 639 643 647
10 Anomaly Detection 651 10.1 Preliminaries 653
10.1.1 Causes of Anomalies 653 10.1.2 Approaches to Anomaly Detection 654 10.1.3 The Use of Class Labels 655 10.1.4 Issues 656
10.2 Statistical Approaches 658 t0.2.7 Detecting Outliers in a Univariate Normal Distribution 659 10.2.2 Outl iersinaMult ivar iateNormalDistr ibut ion . . . . . 661 10.2.3 A Mixture Model Approach for Anomaly Detection. 662
xx Contents
10.2.4 Strengths and Weaknesses 10.3 Proximity-Based Outlier Detection
10.3.1 Strengths and Weaknesses 10.4 Density-Based Outlier Detection
10.4.1 Detection of Outliers Using Relative Density 70.4.2 Strengths and Weaknesses
10.5 Clustering-Based Techniques 10.5.1 Assessing the Extent to Which an Object Belongs to a
Cluster 10.5.2 Impact of Outliers on the Initial Clustering 10.5.3 The Number of Clusters to Use 10.5.4 Strengths and Weaknesses
665 666 666 668 669 670 67L
672 674 674 674 675 680
685 b6i)
10.6 Bibliograph 10.7 Exercises
ic Notes
Appendix A Linear Algebra A.1 Vectors
A.1.1 Definition 685 4.I.2 Vector Addition and Multiplication by a Scalar 685 A.1.3 Vector Spaces 687 4.7.4 The Dot Product, Orthogonality, and Orthogonal
Projections 688 A.1.5 Vectors and Data Analysis 690
42 Matrices 691 A.2.1 Matrices: Definitions 691 A-2.2 Matrices: Addition and Multiplication by a Scalar 692 4.2.3 Matrices: Multiplication 693 4.2.4 Linear tansformations and Inverse Matrices 695 4.2.5 Eigenvalue and Singular Value Decomposition . 697 4.2.6 Matrices and Data Analysis 699
A.3 Bibliographic Notes 700
Appendix B Dimensionality Reduction 7OL 8.1 PCA and SVD 70I
B.1.1 Principal Components Analysis (PCA) 70L 8.7.2 SVD . 706
8.2 Other Dimensionality Reduction Techniques 708 8.2.I Factor Analysis 708 8.2.2 Locally Linear Embedding (LLE) . 770 8.2.3 Multidimensional Scaling, FastMap, and ISOMAP 7I2
Contents xxi
8.2.4 Common Issues B.3 Bibliographic Notes
Appendix C Probability and Statistics C.1 Probability
C.1.1 Expected Values C.2 Statistics
C.2.L Point Estimation C.2.2 Central Limit Theorem C.2.3 Interval Estimation
C.3 Hypothesis Testing
Appendix D Regression D.1 Preliminaries D.2 Simple Linear Regression
D.2.L Least Square Method D.2.2 Analyzing Regression Errors D.2.3 Analyzing Goodness of Fit
D.3 Multivariate Linear Regression D.4 Alternative Least-Square Regression Methods
Appendix E Optimization E.1 Unconstrained Optimizafion
E.1.1 Numerical Methods 8.2 Constrained Optimization
E.2.I Equality Constraints 8.2.2 Inequality Constraints
Author Index
Subject Index
Copyright Permissions
715 7L6
7L9 7L9 722 723 724 724 725 726
739 739 742 746 746 747
750
758
769
729 729 730 731 733 735 736 737
1
Introduction
Rapid advances in data collection and storage technology have enabled or- ganizations to accumulate vast amounts of data. However, extracting useful information has proven extremely challenging. Often, traditional data analy- sis tools and techniques cannot be used because of the massive size of a data set. Sometimes, the non-traditional nature of the data means that traditional approaches cannot be applied even if the data set is relatively small. In other situations, the questions that need to be answered cannot be addressed using existing data analysis techniques, and thus, new methods need to be devel- oped.
Data mining is a technology that blends traditional data analysis methods with sophisticated algorithms for processing large volumes of data. It has also opened up exciting opportunities for exploring and analyzing new types of data and for analyzing old types of data in new ways. In this introductory chapter, we present an overview of data mining and outline the key topics to be covered in this book. We start with a description of some well-known applications that require new techniques for data analysis.