discussion:
What’s simple random sampling? Is it possible to sample data instances using a distribution different from the uniform distribution? If so, give an example of a probability distribution of the data instances that is different from uniform (i.e., equal probability).
Chapter 2 Q&A: See the attachments
Chapet 2 video: https://www.screencast.com/t/swJSFMPlDU (Part 1)
https://www.screencast.com/t/TZZrWRgXu6f (Part 2)
Chapter 3 Q&A: See the attachments
Textbook : Read Tan, Steinbach, & Kumar – Chapter 3 – Exploring Data
Intro to Data Mining
Chapter 2 Assignment
1. What’s an attribute? What’s a data instance?
2. What’s noise? How can noise be reduced in a dataset?
3. Define outlier. Describe 2 different approaches to detect outliers in a dataset.
4. Describe 3 different techniques to deal with missing values in a dataset. Explain when each of these techniques would be most appropriate.
5. Given a sample dataset with missing values, apply an appropriate technique to deal with them.
6. Give 2 examples in which aggregation is useful.
7. Given a sample dataset, apply aggregation of data values.
8. What’s sampling?
9. What’s simple random sampling? Is it possible to sample data instances using a distribution different from the uniform distribution? If so, give an example of a probability distribution of the data instances that is different from uniform (i.e., equal probability).
10. What’s stratified sampling?
11. What’s “the curse of dimensionality”?
12. Provide a brief description of what Principal Components Analysis (PCA) does. [Hint: See Appendix A and your lecture notes.] State what’s the input and what the output of PCA is.
13. What’s the difference between dimensionality reduction and feature selection?
14. Describe in detail 2 different techniques for feature selection.
15. Given a sample dataset (represented by a set of attributes, a correlation matrix, a covariance matrix, …), apply feature selection techniques to select the best attributes to keep (or equivalently, the best attributes to remove).
16. What’s the difference between feature selection and feature extraction?
17. Give two examples of data in which feature extraction would be useful.
18. Given a sample dataset, apply feature extraction.
19. What’s data discretization and when is it needed?
20. What’s the difference between supervised and unsupervised discretization?
21. Given a sample dataset, apply unsupervised (e.g., equal width, equal frequency) discretization, or supervised discretization (e.g., using entropy).
22. Describe 2 approaches to handle nominal attributes with too many values.
23. Given a dataset, apply variable transformation: Either a simple given function, normalization, or standardization.
24. Definition of Correlation and Covariance, and how to use them in data preprocessing (see pp. 7678).
Intro to Data Mining
Chapter 3 Assignment
[Your Name Here]
1. Obtain one of the data sets available at the UCI Machine Learning Repository and apply as many of the different visualization techniques described in the chapter as possible. The bibliographic notes and book Web site provide pointers to visualization software.
2. Identify at least two advantages and two disadvantages of using color to visually represent information.
3. What are the arrangement issues that arise with respect to threedimensional plots?
4. Discuss the advantages and disadvantages of using sampling to reduce the number of data objects that need to be displayed. Would simple random sampling (without replacement) be a good approach to sampling? Why or why not?
5. Describe how you would create visualizations to display information that describes the following types of systems.
a) Computer networks. Be sure to include both the static aspects of the network, such as connectivity, and the dynamic aspects, such as traﬃc.
b) The distribution of speciﬁc plant and animal species around the world fora speciﬁc moment in time.
c) The use of computer resources, such as processor time, main memory, and disk, for a set of benchmark database programs.
d) The change in occupation of workers in a particular country over the last thirty years. Assume that you have yearly information about each person that also includes gender and level of education.
Be sure to address the following issues:
· Representation. How will you map objects, attributes, and relationships to visual elements?
· Arrangement. Are there any special considerations that need to be taken into account with respect to how visual elements are displayed? Speciﬁc examples might be the choice of viewpoint, the use of transparency, or the separation of certain groups of objects.
· Selection. How will you handle a large number of attributes and data objects
6. Describe one advantage and one disadvantage of a stem and leaf plot with respect to a standard histogram.
7. How might you address the problem that a histogram depends on the number and location of the bins?
8. Describe how a box plot can give information about whether the value of an attribute is symmetrically distributed. What can you say about the symmetry of the distributions of the attributes shown in Figure 3.11?
9. Compare sepal length, sepal width, petal length, and petal width, using Figure3.12.
10. Comment on the use of a box plot to explore a data set with four attributes: age, weight, height, and income.
11. Give a possible explanation as to why most of the values of petal length and width fall in the buckets along the diagonal in Figure 3.9.
12. Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal width and petal length attributes.
13. Simple line plots, such as that displayed in Figure 2.12 on page 56, which shows two time series, can be used to eﬀectively display highdimensional data. For example, in Figure 2.12 it is easy to tell that the frequencies of the two time series are diﬀerent. What characteristic of time series allows the eﬀective visualization of highdimensional data?
14. Describe the types of situations that produce sparse or dense data cubes. Illustrate with examples other than those used in the book.
15. How might you extend the notion of multidimensional data analysis so that the target variable is a qualitative variable? In other words, what sorts of summary statistics or data visualizations would be of interest?
16. Construct a data cube from Table 3.14. Is this a dense or sparse data cube? If it is sparse, identify the cells that are empty.
17. Discuss the diﬀerences between dimensionality reduction based on aggregation and dimensionality reduction based on techniques such as PCA and SVD.
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
What is Data?
 Collection of data objects and their attributes
 An attribute is a property or characteristic of an object
 Examples: eye color of a person, temperature, etc.
 Attribute is also known as variable, field, characteristic, or feature
 A collection of attributes describe an object
 Object is also known as record, point, case, sample, entity, or instance
Attributes
Objects
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Attribute Values
 Attribute values are numbers or symbols assigned to an attribute
 Distinction between attributes and attribute values
 Same attribute can be mapped to different attribute values
 Example: height can be measured in feet or meters
 Different attributes can be mapped to the same set of values
 Example: Attribute values for ID and age are integers
 But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Measurement of Length
 The way you measure an attribute is somewhat may not match the attributes properties.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Types of Attributes
 There are different types of attributes
 Nominal
 Examples: ID numbers, eye color, zip codes
 Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale from 110), grades, height in {tall, medium, short}
 Interval
 Examples: calendar dates, temperatures in Celsius or Fahrenheit.
 Ratio
 Examples: temperature in Kelvin, length, time, counts
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Properties of Attribute Values
 The type of an attribute depends on which of the following properties it possesses:
 Distinctness: =
 Order: < >
 Addition: + –
 Multiplication: * /
 Nominal attribute: distinctness
 Ordinal attribute: distinctness & order
 Interval attribute: distinctness, order & addition
 Ratio attribute: all 4 properties
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Discrete and Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values
 Examples: zip codes, counts, or the set of words in a collection of documents
 Often represented as integer variables.
 Note: binary attributes are a special case of discrete attributes
 Continuous Attribute
 Has real numbers as attribute values
 Examples: temperature, height, or weight.
 Practically, real values can only be measured and represented using a finite number of digits.
 Continuous attributes are typically represented as floatingpoint variables.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Types of data sets
 Record
 Data Matrix
 Document Data
 Transaction Data
 Graph
 World Wide Web
 Molecular Structures
 Ordered
 Spatial Data
 Temporal Data
 Sequential Data
 Genetic Sequence Data
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Important Characteristics of Structured Data
 Dimensionality
 Curse of Dimensionality
 Sparsity
 Only presence counts
 Resolution
 Patterns depend on the scale
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Record Data
 Data that consists of a collection of records, each of which consists of a fixed set of attributes
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Tid  Refund  MaritalStatus  TaxableIncome  Cheat 
1  Yes  Single  125K  No 
2  No  Married  100K  No 
3  No  Single  70K  No 
4  Yes  Married  120K  No 
5  No  Divorced  95K  Yes 
6  No  Married  60K  No 
7  Yes  Divorced  220K  No 
8  No  Single  85K  Yes 
9  No  Married  75K  No 
10  No  Single  90K  Yes 
10
Data Matrix
 If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute
 Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Document Data
 Each document becomes a `term’ vector,
 each term is a component (attribute) of the vector,
 the value of each component is the number of times the corresponding term occurs in the document.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Document 1�
season�
timeout�
lost�
win�
game�
score�
ball�
play�
coach�
team�
Document 2�
Document 3�
3�
0�
5�
0�
2�
6�
0�
2�
0�
2�
0�
0�
7�
0�
2�
1�
0�
0�
3�
0�
0�
1�
0�
0�
1�
2�
2�
0�
3�
0�
Transaction Data
 A special type of record data, where
 each record (transaction) involves a set of items.
 For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
TID  Items 
1  Bread, Coke, Milk 
2  Beer, Bread 
3  Beer, Coke, Diaper, Milk 
4  Beer, Bread, Diaper, Milk 
5  Coke, Diaper, Milk 
Graph Data
 Examples: Generic graph and HTML Links
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Chemical Data
 Benzene Molecule: C6H6
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Ordered Data
 Sequences of transactions
An element of the sequence
Items/Events
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Ordered Data
 Genomic sequence data
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Ordered Data
 SpatioTemporal Data
Average Monthly Temperature of land and ocean
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Data Quality
 What kinds of data quality problems?
 How can we detect problems with the data?
 What can we do about these problems?
 Examples of data quality problems:
 Noise and outliers
 missing values
 duplicate data
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen
Two Sine Waves
Two Sine Waves + Noise
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Outliers
 Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Missing Values
 Reasons for missing values
 Information is not collected
(e.g., people decline to give their age and weight)  Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)  Handling missing values
 Eliminate Data Objects
 Estimate Missing Values
 Ignore the Missing Value During Analysis
 Replace with all possible values (weighted by their probabilities)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Duplicate Data
 Data set may include data objects that are duplicates, or almost duplicates of one another
 Major issue when merging data from heterogeous sources
 Examples:
 Same person with multiple email addresses
 Data cleaning
 Process of dealing with duplicate data issues
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Data Preprocessing
 Aggregation
 Sampling
 Dimensionality Reduction
 Feature subset selection
 Feature creation
 Discretization and Binarization
 Attribute Transformation
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Aggregation
 Combining two or more attributes (or objects) into a single attribute (or object)
 Purpose
 Data reduction
 Reduce the number of attributes or objects
 Change of scale
 Cities aggregated into regions, states, countries, etc
 More “stable” data
 Aggregated data tends to have less variability
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Aggregation
Standard Deviation of Average Monthly Precipitation
Standard Deviation of Average Yearly Precipitation
Variation of Precipitation in Australia
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sampling
 Sampling is the main technique employed for data selection.
 It is often used for both the preliminary investigation of the data and the final data analysis.
 Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming.
 Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sampling …
 The key principle for effective sampling is the following:
 using a sample will work almost as well as using the entire data sets, if the sample is representative
 A sample is representative if it has approximately the same property (of interest) as the original set of data
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Types of Sampling
 Simple Random Sampling
 There is an equal probability of selecting any particular item
 Sampling without replacement
 As each item is selected, it is removed from the population
 Sampling with replacement
 Objects are not removed from the population as they are selected for the sample.
 In sampling with replacement, the same object can be picked up more than once
 Stratified sampling
 Split the data into several partitions; then draw random samples from each partition
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sample Size
8000 points 2000 Points 500 Points
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sample Size
 What sample size is necessary to get at least one object from each of 10 groups.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Curse of Dimensionality
 When dimensionality increases, data becomes increasingly sparse in the space that it occupies
 Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful
 Randomly generate 500 points
 Compute difference between max and min distance between any pair of points
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Dimensionality Reduction
 Purpose:
 Avoid curse of dimensionality
 Reduce amount of time and memory required by data mining algorithms
 Allow data to be more easily visualized
 May help to eliminate irrelevant features or reduce noise
 Techniques
 Principle Component Analysis
 Singular Value Decomposition
 Others: supervised and nonlinear techniques
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Dimensionality Reduction: PCA
 Goal is to find a projection that captures the largest amount of variation in data
x2
x1
e
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Dimensionality Reduction: PCA
 Find the eigenvectors of the covariance matrix
 The eigenvectors define the new space
x2
x1
e
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Dimensionality Reduction: ISOMAP
 Construct a neighbourhood graph
 For each pair of points in the graph, compute the shortest path distances – geodesic distances
By: Tenenbaum, de Silva, Langford (2000)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Dimensionality Reduction: PCA
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Feature Subset Selection
 Another way to reduce dimensionality of data
 Redundant features
 duplicate much or all of the information contained in one or more other attributes
 Example: purchase price of a product and the amount of sales tax paid
 Irrelevant features
 contain no information that is useful for the data mining task at hand
 Example: students’ ID is often irrelevant to the task of predicting students’ GPA
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Feature Subset Selection
 Techniques:
 Bruteforce approch:
 Try all possible feature subsets as input to data mining algorithm
 Embedded approaches:
 Feature selection occurs naturally as part of the data mining algorithm
 Filter approaches:
 Features are selected before data mining algorithm is run
 Wrapper approaches:
 Use the data mining algorithm as a black box to find best subset of attributes
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Feature Creation
 Create new attributes that can capture the important information in a data set much more efficiently than the original attributes
 Three general methodologies:
 Feature Extraction
 domainspecific
 Mapping Data to New Space
 Feature Construction
 combining features
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Mapping Data to a New Space
Two Sine Waves
Two Sine Waves + Noise
Frequency
 Fourier transform
 Wavelet transform
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Discretization Using Class Labels
 Entropy based approach
3 categories for both x and y
5 categories for both x and y
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Discretization Without Using Class Labels
Data
Equal interval width
Equal frequency
Kmeans
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Attribute Transformation
 A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
 Simple functions: xk, log(x), ex, x
 Standardization and Normalization
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are.
 Is higher when objects are more alike.
 Often falls in the range [0,1]
 Dissimilarity
 Numerical measure of how different are two data objects
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
 Proximity refers to a similarity or dissimilarity
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Euclidean Distance
 Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
 Standardization is necessary, if scales differ.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Euclidean Distance
Distance Matrix
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sheet1
point  x  y  
0  2  
p2  2  0  
p3  3  1  
p4  5  1  
point  x  y  
p1  0  2  
p2  2  0  
p3  3  1  
p4  5  1 
p1
Sheet2
Sheet3
Sheet1
point  x  y  
0  2  
p2  2  0  
p3  3  1  
p4  5  1  
point  x  y  
p1  0  2  
p2  2  0  
p3  3  1  
p4  5  1  
p1  p2  p3  p4  
p1  0  2.828  3.162  5.099  
p2  2.828  0  1.414  3.162  
p3  3.162  1.414  0  2  
p4  5.099  3.162  2  0 
p1
Sheet2
Sheet3
Minkowski Distance
 Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Minkowski Distance: Examples
 r = 1. City block (Manhattan, taxicab, L1 norm) distance.
 A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors
 r = 2. Euclidean distance
 r . “supremum” (Lmax norm, L norm) distance.
 This is the maximum difference between any component of the vectors
 Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Minkowski Distance
Distance Matrix
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Sheet1
point  x  y  
0  2  
p2  2  0  
p3  3  1  
p4  5  1  
point  x  y  
p1  0  2  
p2  2  0  
p3  3  1  
p4  5  1  
p1  p2  p3  p4  
p1  0  2.828  3.162  5.099  
p2  2.828  0  1.414  3.162  
p3  3.162  1.414  0  2  
p4  5.099  3.162  2  0  
point  x  y  
p1  0  2  
p2  2  0  
p3  3  1  
p4  5  1  
L1  p1  p2  p3  p4  
p1  0  4  4  6  
p2  4  0  2  4  
p3  4  2  0  2  
p4  6  4  2  0  
L2  p1  p2  p3  p4  
p1  0  2.828  3.162  5.099  
p2  2.828  0  1.414  3.162  
p3  3.162  1.414  0  2  
p4  5.099  3.162  2  0  
p1  p2  p3  p4  
p1  0  2  3  5  
p2  2  0  1  3  
p3  3  1  0  2  
p4  5  3  2  0 
p1
Sheet2
Sheet3
Sheet1
point  x  y  
0  2  
p2  2  0  
p3  3  1  
p4  5  1  
point  x  y  
p1  0  2  
p2  2  0  
p3  3  1  
p4  5  1  
p1  p2  p3  p4  
p1  0  2.828  3.162  5.099  
p2  2.828  0  1.414  3.162  
p3  3.162  1.414  0  2  
p4  5.099  3.162  2  0  
point  x  y  
p1  0  2  
p2  2  0  
p3  3  1  
p4  5  1  
L1  p1  p2  p3  p4  
p1  0  4  4  6  
p2  4  0  2  4  
p3  4  2  0  2  
p4  6  4  2  0  
L2  p1  p2  p3  p4  
p1  0  2.828  3.162  5.099  
p2  2.828  0  1.414  3.162  
p3  3.162  1.414  0  2  
p4  5.099  3.162  2  0  
p1  p2  p3  p4  
p1  0  2  3  5  
p2  2  0  1  3  
p3  3  1  0  2  
p4  5  3  2  0 
p1
Sheet2
Sheet3
Sheet1
point  x  y  
0  2  
p2  2  0  
p3  3  1  
p4  5  1  
point  x  y  
p1  0  2  
p2  2  0  
p3  3  1  
p4  5  1  
p1  p2  p3  p4  
p1  0  2.828  3.162  5.099  
p2  2.828  0  1.414  3.162  
p3  3.162  1.414  0  2  
p4  5.099  3.162  2  0  
point  x  y  
p1  0  2  
p2  2  0  
p3  3  1  
p4  5  1  
L1  p1  p2  p3  p4  
p1  0  4  4  6  
p2  4  0  2  4  
p3  4  2  0  2  
p4  6  4  2  0  
L2  p1  p2  p3  p4  
p1  0  2.828  3.162  5.099  
p2  2.828  0  1.414  3.162  
p3  3.162  1.414  0  2  
p4  5.099  3.162  2  0  
p1  p2  p3  p4  
p1  0  2  3  5  
p2  2  0  1  3  
p3  3  1  0  2  
p4  5  3  2  0 
p1
Sheet2
Sheet3
Sheet1
point  x  y  
0  2  
p2  2  0  
p3  3  1  
p4  5  1  
point  x  y  
p1  0  2  
p2  2  0  
p3  3  1  
p4  5  1  
p1  p2  p3  p4  
p1  0  2.828  3.162  5.099  
p2  2.828  0  1.414  3.162  
p3  3.162  1.414  0  2  
p4  5.099  3.162  2  0  
point  x  y  
p1  0  2  
p2  2  0  
p3  3  1  
p4  5  1  
L1  p1  p2  p3  p4  
p1  0  4  4  6  
p2  4  0  2  4  
p3  4  2  0  2  
p4  6  4  2  0  
L2  p1  p2  p3  p4  
p1  0  2.828  3.162  5.099  
p2  2.828  0  1.414  3.162  
p3  3.162  1.414  0  2  
p4  5.099  3.162  2  0  
p1  p2  p3  p4  
p1  0  2  3  5  
p2  2  0  1  3  
p3  3  1  0  2  
p4  5  3  2  0 
p1
Sheet2
Sheet3
Mahalanobis Distance
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
is the covariance matrix of the input data X
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Mahalanobis Distance
Covariance Matrix:
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Common Properties of a Distance
 Distances, such as the Euclidean distance, have some well known properties.
d(p, q) 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
d(p, q) = d(q, p) for all p and q. (Symmetry)
d(p, r) d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
 A distance that satisfies these properties is a metric
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Common Properties of a Similarity
 Similarities, also have some well known properties.
s(p, q) = 1 (or maximum similarity) only if p = q.
s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data objects), p and q.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Similarity Between Binary Vectors
 Common situation is that objects, p and q, have only binary attributes
 Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
 Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of notbothzero attributes values
= (M11) / (M01 + M10 + M11)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Cosine Similarity
 If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 d2) / d1 d2 ,
where indicates vector dot product and  d  is the length of vector d.
 Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
d1 = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
d2 = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Extended Jaccard Coefficient (Tanimoto)
 Variation of Jaccard for continuous or count attributes
 Reduces to Jaccard for binary attributes
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Correlation
 Correlation measures the linear relationship between objects
 To compute correlation, we standardize data objects, p and q, and then take their dot product
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Visually Evaluating Correlation
Scatter plots showing the similarity from –1 to 1.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
General Approach for Combining Similarities
 Sometimes attributes are of many different types, but an overall similarity is needed.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Using Weights to Combine Similarities
 May not want to treat all attributes the same.
 Use weights wk which are between 0 and 1 and sum to 1.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Density
 Densitybased clustering require a notion of density
 Examples:
 Euclidean density
 Euclidean density = number of points per unit volume
 Probability density
 Graphbased density
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Euclidean Density – Cellbased
 Simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Euclidean Density – Centerbased
 Euclidean density is the number of points within a specified radius of the point
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Singl
e
90K
Yes
10
Document 1
s
e
a
s
o
n
t
i
m
e
o
u
t
l
o
s
t
w
i
n
g
a
m
e
s
c
o
r
e
b
a
l
l
p
l
a
y
c
o
a
c
h
t
e
a
m
Document 2
Document 3
3050260202
0
0
702100300
100122030
TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Diaper, Milk
5
2
1
2
5
<a href=”papers/papers.html#bbbb”>
Data Mining </a>
<li>
<a href=”papers/papers.html#aaaa”>
Graph Partitioning </a>
<li>
<a href=”papers/papers.html#aaaa”>
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href=”papers/papers.html#ffff”>
NBody Computation and Dense Linear System Solvers
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
1
2
3
5
5
7
8
15
10
4
A
B
C
D
E
Dimensions = 10
Dimensions = 40
Dimensions = 80
Dimensions = 120
Dimensions = 160
Dimensions = 206
å
=
–
=
n
k
k
k
q
p
dist
1
2
)
(
0
1
2
3
0
1
2
3
4
5
6
p1
p2
p3
p4
pointxy
p102
p220
p331
p451
p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
r
n
k
r
k
k
q
p
dist
1
1
)


(
å
=
–
=
pointxy
p102
p220
p331
p451
L1p1p2p3p4
p10446
p24024
p34202
p46420
L2p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
L
p1p2p3p4
p10235
p22013
p33102
p45320
T
q
p
q
p
q
p
s
mahalanobi
)
(
)
(
)
,
(
1
–
å
–
=
–
å
=
–
–
–
=
S
n
i
k
ik
j
ij
k
j
X
X
X
X
n
1
,
)
)(
(
1
1
ú
û
ù
ê
ë
é
=
S
3
.
0
2
.
0
2
.
0
3
.
0
)
(
/
))
(
(
p
std
p
mean
p
p
k
k
–
=
¢
)
(
/
))
(
(
q
std
q
mean
q
q
k
k
–
=
¢
q
p
q
p
n
correlatio
¢
·
¢
=
)
,
(
Data Mining: Exploring Data
Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler
Look for accompanying R code on the course web site.brinsRectangle
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›
Techniques Used In Data Exploration
In EDA, as originally defined by Tukey – The focus was on visualization – Clustering and anomaly detection were viewed as
exploratory techniques – In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just exploratory
In our discussion of data exploration, we focus on – Summary statistics – Visualization – Online Analytical Processing (OLAP)brinsRectanglebrinsRectangle
What is data exploration?
• Key motivations of data exploration include – Helping to select the right tool for preprocessing or analysis – Making use of humans’ abilities to recognize patterns
• People can recognize patterns not captured by data analysis tools
• Related to the area of Exploratory Data Analysis (EDA) – Created by statistician John Tukey – Seminal book is “Exploratory Data Analysis” by Tukey – A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/index.htm
A preliminary exploration of the data to better understand its characteristics.http://www.itl.nist.gov/div898/handbook/index.htm
Iris Sample Data Set
• Many of the exploratory data techniques are illustrated with the Iris Plant data set. – Can be obtained from the UCI Machine Learning Repository
– From the statistician R.A. Fisher – Three flower types (classes):
• Setosa • Virginica • Versicolour
– Four (nonclass) attributes • Sepal width and length • Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.
Summary Statistics
• Summary statistics are numbers that summarize properties of the data
– Summarized properties include location and spread for continuous data
• Examples: location – mean spread – standard deviation
Most summary statistics can be calculated in a single pass through the data
Frequency and Mode
• The frequency of an attribute value is the percentage of time the value occurs in the data set – For example, given the attribute ‘gender’ and a
representative population of people, the gender ‘female’ occurs about 50% of the time.
• The mode of an attribute is the most frequent attribute value
• The notions of frequency and mode are typically used with categorical data
Measures of Location: Mean and Median
• The mean is the most common measure of the location of a set of points.
• However, the mean is very sensitive to outliers. • Thus, the median or a trimmed mean is also
commonly used.
Measures of Spread: Range and Variance
• Range is the difference between the max and min • The variance or standard deviation is the most
common measure of the spread of a set of points.
• However, this is also sensitive to outliers, so that other measures are often used.
Percentiles
x p x p
Median – 50% of the cases has a smaller value & 50%
are larger
Multivariate Summary Statistics Object x
1 x 2
1 12 15 2 2 4 … … … m 18 4
• Covariance between features i and j
• Correlation
si is the variance of feature i
sij= 1
m−1∑k=1 m
( xki− x̄i)(xkj− x̄ j)
r ij= sij si s j
Topics
• Exploratory Data Analysis • Summary Statistics • Visualization
Visualization
Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
• Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze
large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns
Example: Sea Surface Temperature • The following shows the Sea Surface Temperature
(SST) for July 1982 – Tens of thousands of data points are
summarized in a single figure
Representation
• Is the mapping of information to a visual format • Data objects, their attributes, and the relationships
among data objects are translated into graphical elements such as points, lines, shapes, and colors.
• Example: – Objects are often represented as points – Their attribute values can be represented as the
position of the points or the characteristics of the points, e.g., color, size, and shape – If position is used, then the relationships of points,
i.e., whether they form groups or a point is an outlier, is easily perceived.
Arrangement
• Is the placement of visual elements within a display
• Can make a large difference in how easy it is to understand the data
• Example:
Selection
• Is the elimination or the deemphasis of certain objects and attributes
• Selection may involve the choosing a subset of attributes – Dimensionality reduction is often used to reduce the
number of dimensions to two or three – Alternatively, pairs of attributes can be considered
• Selection may also involve choosing a subset of objects – A region of the screen can only show so many points – Can sample, but want to preserve points in sparse areas
Histograms
• Usually shows the distribution of values of a single variable • Divide the values into bins and show a bar plot of the number of objects in each bin. • The height of each bar indicates the number of objects • Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)
Empirical Cumulative Distribution Function (ECDF)
• Probability Density Function (PDF): describes the relative likelihood for this random variable to take on a given value
• Cumulative Distribution Function (CDF): Shows the distribution of data as the fraction of points that are less than this value.
CDF
Example: ECDF
TwoDimensional Histograms • Show the joint distribution of the values of two attributes • Example: petal width and petal length
– What does this tell us?
Box Plots • Invented by J. Tukey • Another way of displaying the distribution of data • Following figure shows the basic part of a box plot
outlier
25th percentile – 1.5 IQR
25th percentile
75th percentile
50th percentile
75th percentile + 1.5 IQR
IQR
Q1 Q3
IQR
Median
Q3 + 1.5 × IQRQ1 − 1.5 × IQR
−0.6745 σ 0.6745 σ 2.698 σ−2.698 σ
50%24.65% 24.65%
68.27% 15.73%15.73%
−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ
−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ
−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ
Example of Box Plots • Box plots can be used to compare attributes
Scatter Plots
– Attributes values determine the position – Twodimensional scatter plots most common,
but can have threedimensional scatter plots Often additional attributes can be displayed by
using the size, shape, and color of the markers that represent the objects – It is useful to have arrays of scatter plots can
compactly summarize the relationships of several pairs of attributes
• See example on the next slide
Scatter Plot Array of Iris Attributes
Contour Plots – Useful when a continuous attribute is measured
on a spatial grid – They partition the plane into regions of similar
values – The contour lines that form the boundaries of
these regions connect points with equal values – The most common example is contour maps of
elevation – Can also display temperature, rainfall, air
pressure, etc. An example for Sea Surface Temperature (SST) is
provided on the next slide
Contour Plot Example: SST Dec, 1998
Celsius
Matrix Plots
– Can plot a data matrix – Can be useful when objects are sorted
according to class – Typically, the attributes are normalized to
prevent one attribute from dominating the plot – Plots of similarity or distance matrices can also
be useful for visualizing the relationships between objects
Visualization of the Iris Data Matrix
standard deviation
Deviation form feature mean
Visualization of the Iris Correlation Matrix
Parallel Coordinates – Used to plot the attribute values of high
dimensional data – Instead of using perpendicular axes, use a set of
parallel axes – The attribute values of each object are plotted as a
point on each corresponding coordinate axis and the points are connected by a line
– Thus, each object is represented as a line – Often, the lines representing a distinct class of
objects group together, at least for some attributes – Ordering of attributes is important in seeing such
groupings
Parallel Coordinates Plots for Iris Data
Reordered features
Other Visualization Techniques
Star Plots – Similar approach to parallel coordinates, but axes radiate
from a central point – The line connecting the values of an object is a polygon
Chernoff Faces – Approach created by Herman Chernoff – This approach associates each attribute with a characteristic
of a face – The values of each attribute determine the appearance of the
corresponding facial characteristic – Each object becomes a separate face – Relies on human’s ability to distinguish faces
Star Plots for Iris Data
Setosa
Versicolor
Virginica
Chernoff Faces for Iris Data Setosa
Versicolor
Virginica
OLAP • OnLine Analytical Processing (OLAP) was
proposed by E. F. Codd, the father of the relational database.
• Relational databases put data into tables, while OLAP uses a multidimensional array representation. – Such representations of data previously existed in
statistics and other fields
• There are a number of data analysis and data exploration operations that are easier with such a data representation.
Creating a Multidimensional Array • Two key steps in converting tabular data into a
multidimensional array. – First, identify which attributes are to be the dimensions
and which attribute is to be the target attribute whose values appear as entries in the multidimensional array. • The attributes used as dimensions must have discrete values • The target value is typically a count or continuous value, e.g.,
the cost of an item • Can have no target variable at all except the count of objects
that have the same set of attribute values
– Second, find the value of each entry in the multidimensional array by summing the values (of the target attribute) or count of all objects that have the attribute values corresponding to that entry.
Example: Iris data • We show how the attributes, petal length, petal
width, and species type can be converted to a multidimensional array – First, we discretized the petal width and length to
have categorical values: low, medium, and high
– We get the following table – note the count attribute
Example: Iris data (continued) • Each unique tuple of petal width, petal length,
and species type identifies one element of the array.
• This element is assigned the corresponding count value.
• The figure illustrates the result.
• All nonspecified tuples are 0.
Example: Iris data (continued) • Slices of the multidimensional array are shown
by the following crosstabulations
• What do these tables tell us?
OLAP Operations: Data Cube • The key operation of a OLAP is the formation of a
data cube • A data cube is a multidimensional representation
of data, together with all possible aggregates. • By all possible aggregates, we mean the aggregates
that result by selecting a proper subset of the dimensions and summing over all remaining dimensions.
• For example, if we choose the species type dimension of the Iris data and sum over all other dimensions, the result will be a onedimensional entry with three entries, each of which gives the number of flowers of each type.
Data Cube Example • Consider a data set that records the sales of
products at a number of company stores at various dates.
• This data can be represented as a 3 dimensional array
• There are 3 twodimensional aggregates (3 choose 2 ), 3 onedimensional aggregates, and 1 zerodimensional aggregate (the overall total)
Data Cube Example (continued) • The following figure table shows one of the two
dimensional aggregates, along with two of the onedimensional aggregates, and the overall total
OLAP Operations: Slicing and Dicing • Slicing is selecting a group of cells from the
entire multidimensional array by specifying a specific value for one or more dimensions.
• Dicing involves selecting a subset of cells by specifying a range of attribute values.
– This is equivalent to defining a subarray from the complete array.
• In practice, both operations can also be accompanied by aggregation over some dimensions.
OLAP Operations: Rollup and Drilldown
• Attribute values often have a hierarchical structure. – Each date is associated with a year, month, and week.
– A location is associated with a continent, country, state (province, etc.), and city.
– Products can be divided into various categories, such as clothing, electronics, and furniture.
• Note that these categories often nest and form a tree or lattice – A year contains months which contains day
– A country contains a state which contains a city
OLAP Operations: Rollup and Drilldown
• This hierarchical structure gives rise to the roll up and drilldown operations.
– For sales data, we can aggregate (roll up) the sales across all the dates in a month.
– Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals.
– Likewise, we can drill down or roll up on the location or product ID attributes.