+1 (208) 254-6996 essayswallet@gmail.com
  

discussion: 

What’s simple random sampling? Is it possible to sample data instances using a distribution different from the uniform distribution? If so, give an example of a probability distribution of the data instances that is different from uniform (i.e., equal probability).

Chapter 2 Q&A: See the attachments 

      Chapet 2 video: https://www.screencast.com/t/swJSFMPlDU         (Part 1)

                                https://www.screencast.com/t/TZZrWRgXu6f         (Part 2)

Chapter 3 Q&A: See the attachments

                 Textbook : Read Tan, Steinbach, & Kumar – Chapter 3 – Exploring Data

Intro to Data Mining

Chapter 2 Assignment

1. What’s an attribute? What’s a data instance?

2. What’s noise? How can noise be reduced in a dataset?

3. Define outlier. Describe 2 different approaches to detect outliers in a dataset.

4. Describe 3 different techniques to deal with missing values in a dataset. Explain when each of these techniques would be most appropriate.

5. Given a sample dataset with missing values, apply an appropriate technique to deal with them.

6. Give 2 examples in which aggregation is useful.

7. Given a sample dataset, apply aggregation of data values.

8. What’s sampling?

9. What’s simple random sampling? Is it possible to sample data instances using a distribution different from the uniform distribution? If so, give an example of a probability distribution of the data instances that is different from uniform (i.e., equal probability).

10. What’s stratified sampling?

11. What’s “the curse of dimensionality”?

12. Provide a brief description of what Principal Components Analysis (PCA) does. [Hint: See Appendix A and your lecture notes.] State what’s the input and what the output of PCA is.

13. What’s the difference between dimensionality reduction and feature selection?

14. Describe in detail 2 different techniques for feature selection.

15. Given a sample dataset (represented by a set of attributes, a correlation matrix, a co-variance matrix, …), apply feature selection techniques to select the best attributes to keep (or equivalently, the best attributes to remove).

16. What’s the difference between feature selection and feature extraction?

17. Give two examples of data in which feature extraction would be useful.

18. Given a sample dataset, apply feature extraction.

19. What’s data discretization and when is it needed?

20. What’s the difference between supervised and unsupervised discretization?

21. Given a sample dataset, apply unsupervised (e.g., equal width, equal frequency) discretization, or supervised discretization (e.g., using entropy).

22. Describe 2 approaches to handle nominal attributes with too many values.

23. Given a dataset, apply variable transformation: Either a simple given function, normalization, or standardization.

24. Definition of Correlation and Covariance, and how to use them in data pre-processing (see pp. 76-78).

Intro to Data Mining

Chapter 3 Assignment

[Your Name Here]

1. Obtain one of the data sets available at the UCI Machine Learning Repository and apply as many of the different visualization techniques described in the chapter as possible. The bibliographic notes and book Web site provide pointers to visualization software.

2. Identify at least two advantages and two disadvantages of using color to visually represent information.

3. What are the arrangement issues that arise with respect to three-dimensional plots?

4. Discuss the advantages and disadvantages of using sampling to reduce the number of data objects that need to be displayed. Would simple random sampling (without replacement) be a good approach to sampling? Why or why not?

5. Describe how you would create visualizations to display information that de-scribes the following types of systems.

a) Computer networks. Be sure to include both the static aspects of the network, such as connectivity, and the dynamic aspects, such as traffic.

b) The distribution of specific plant and animal species around the world fora specific moment in time.

c) The use of computer resources, such as processor time, main memory, and disk, for a set of benchmark database programs.

d) The change in occupation of workers in a particular country over the last thirty years. Assume that you have yearly information about each person that also includes gender and level of education.

Be sure to address the following issues:

· Representation. How will you map objects, attributes, and relation-ships to visual elements?

· Arrangement. Are there any special considerations that need to be taken into account with respect to how visual elements are displayed? Specific examples might be the choice of viewpoint, the use of transparency, or the separation of certain groups of objects.

· Selection. How will you handle a large number of attributes and data objects

6. Describe one advantage and one disadvantage of a stem and leaf plot with respect to a standard histogram.

7. How might you address the problem that a histogram depends on the number and location of the bins?

8. Describe how a box plot can give information about whether the value of an attribute is symmetrically distributed. What can you say about the symmetry of the distributions of the attributes shown in Figure 3.11?

9. Compare sepal length, sepal width, petal length, and petal width, using Figure3.12.

10. Comment on the use of a box plot to explore a data set with four attributes: age, weight, height, and income.

11. Give a possible explanation as to why most of the values of petal length and width fall in the buckets along the diagonal in Figure 3.9.

12. Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal width and petal length attributes.

13. Simple line plots, such as that displayed in Figure 2.12 on page 56, which shows two time series, can be used to effectively display high-dimensional data. For example, in Figure 2.12 it is easy to tell that the frequencies of the two time series are different. What characteristic of time series allows the effective visualization of high-dimensional data?

14. Describe the types of situations that produce sparse or dense data cubes. Illustrate with examples other than those used in the book.

15. How might you extend the notion of multidimensional data analysis so that the target variable is a qualitative variable? In other words, what sorts of summary statistics or data visualizations would be of interest?

16. Construct a data cube from Table 3.14. Is this a dense or sparse data cube? If it is sparse, identify the cells that are empty. 

17. Discuss the differences between dimensionality reduction based on aggregation and dimensionality reduction based on techniques such as PCA and SVD.

Data Mining: Data

Lecture Notes for Chapter 2

Introduction to Data Mining

by

Tan, Steinbach, Kumar

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

What is Data?

  • Collection of data objects and their attributes
  • An attribute is a property or characteristic of an object
  • Examples: eye color of a person, temperature, etc.
  • Attribute is also known as variable, field, characteristic, or feature
  • A collection of attributes describe an object
  • Object is also known as record, point, case, sample, entity, or instance

Attributes

Objects

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Attribute Values

  • Attribute values are numbers or symbols assigned to an attribute
  • Distinction between attributes and attribute values
  • Same attribute can be mapped to different attribute values
  • Example: height can be measured in feet or meters
  • Different attributes can be mapped to the same set of values
  • Example: Attribute values for ID and age are integers
  • But properties of attribute values can be different

ID has no limit but age has a maximum and minimum value

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Measurement of Length

  • The way you measure an attribute is somewhat may not match the attributes properties.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Types of Attributes

  • There are different types of attributes
  • Nominal
  • Examples: ID numbers, eye color, zip codes
  • Ordinal
  • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}
  • Interval
  • Examples: calendar dates, temperatures in Celsius or Fahrenheit.
  • Ratio
  • Examples: temperature in Kelvin, length, time, counts

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Properties of Attribute Values

  • The type of an attribute depends on which of the following properties it possesses:
  • Distinctness: = 
  • Order: < >
  • Addition: + –
  • Multiplication: * /
  • Nominal attribute: distinctness
  • Ordinal attribute: distinctness & order
  • Interval attribute: distinctness, order & addition
  • Ratio attribute: all 4 properties

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Discrete and Continuous Attributes

  • Discrete Attribute
  • Has only a finite or countably infinite set of values
  • Examples: zip codes, counts, or the set of words in a collection of documents
  • Often represented as integer variables.
  • Note: binary attributes are a special case of discrete attributes
  • Continuous Attribute
  • Has real numbers as attribute values
  • Examples: temperature, height, or weight.
  • Practically, real values can only be measured and represented using a finite number of digits.
  • Continuous attributes are typically represented as floating-point variables.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Types of data sets

  • Record
  • Data Matrix
  • Document Data
  • Transaction Data
  • Graph
  • World Wide Web
  • Molecular Structures
  • Ordered
  • Spatial Data
  • Temporal Data
  • Sequential Data
  • Genetic Sequence Data

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Important Characteristics of Structured Data

  • Dimensionality
  • Curse of Dimensionality
  • Sparsity
  • Only presence counts
  • Resolution
  • Patterns depend on the scale

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Record Data

  • Data that consists of a collection of records, each of which consists of a fixed set of attributes

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

TidRefundMaritalStatusTaxableIncomeCheat
1YesSingle125KNo
2NoMarried100KNo
3NoSingle70KNo
4YesMarried120KNo
5NoDivorced95KYes
6NoMarried60KNo
7YesDivorced220KNo
8NoSingle85KYes
9NoMarried75KNo
10NoSingle90KYes

10

Data Matrix

  • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
  • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Document Data

  • Each document becomes a `term’ vector,
  • each term is a component (attribute) of the vector,
  • the value of each component is the number of times the corresponding term occurs in the document.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Document 1�

season�

timeout�

lost�

win�

game�

score�

ball�

play�

coach�

team�

Document 2�

Document 3�

3�

0�

5�

0�

2�

6�

0�

2�

0�

2�

0�

0�

7�

0�

2�

1�

0�

0�

3�

0�

0�

1�

0�

0�

1�

2�

2�

0�

3�

0�

Transaction Data

  • A special type of record data, where
  • each record (transaction) involves a set of items.
  • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

TIDItems
1Bread, Coke, Milk
2Beer, Bread
3Beer, Coke, Diaper, Milk
4Beer, Bread, Diaper, Milk
5Coke, Diaper, Milk

Graph Data

  • Examples: Generic graph and HTML Links

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Chemical Data

  • Benzene Molecule: C6H6

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Ordered Data

  • Sequences of transactions

An element of the sequence

Items/Events

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Ordered Data

  • Genomic sequence data

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Ordered Data

  • Spatio-Temporal Data

Average Monthly Temperature of land and ocean

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Data Quality

  • What kinds of data quality problems?
  • How can we detect problems with the data?
  • What can we do about these problems?
  • Examples of data quality problems:
  • Noise and outliers
  • missing values
  • duplicate data

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Noise

  • Noise refers to modification of original values
  • Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen

Two Sine Waves

Two Sine Waves + Noise

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Outliers

  • Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Missing Values

  • Reasons for missing values
  • Information is not collected
    (e.g., people decline to give their age and weight)
  • Attributes may not be applicable to all cases
    (e.g., annual income is not applicable to children)
  • Handling missing values
  • Eliminate Data Objects
  • Estimate Missing Values
  • Ignore the Missing Value During Analysis
  • Replace with all possible values (weighted by their probabilities)

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Duplicate Data

  • Data set may include data objects that are duplicates, or almost duplicates of one another
  • Major issue when merging data from heterogeous sources
  • Examples:
  • Same person with multiple email addresses
  • Data cleaning
  • Process of dealing with duplicate data issues

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Data Preprocessing

  • Aggregation
  • Sampling
  • Dimensionality Reduction
  • Feature subset selection
  • Feature creation
  • Discretization and Binarization
  • Attribute Transformation

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Aggregation

  • Combining two or more attributes (or objects) into a single attribute (or object)
  • Purpose
  • Data reduction
  • Reduce the number of attributes or objects
  • Change of scale
  • Cities aggregated into regions, states, countries, etc
  • More “stable” data
  • Aggregated data tends to have less variability

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Aggregation

Standard Deviation of Average Monthly Precipitation

Standard Deviation of Average Yearly Precipitation

Variation of Precipitation in Australia

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Sampling

  • Sampling is the main technique employed for data selection.
  • It is often used for both the preliminary investigation of the data and the final data analysis.
  • Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming.
  • Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Sampling …

  • The key principle for effective sampling is the following:
  • using a sample will work almost as well as using the entire data sets, if the sample is representative
  • A sample is representative if it has approximately the same property (of interest) as the original set of data

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Types of Sampling

  • Simple Random Sampling
  • There is an equal probability of selecting any particular item
  • Sampling without replacement
  • As each item is selected, it is removed from the population
  • Sampling with replacement
  • Objects are not removed from the population as they are selected for the sample.
  • In sampling with replacement, the same object can be picked up more than once
  • Stratified sampling
  • Split the data into several partitions; then draw random samples from each partition

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Sample Size

8000 points 2000 Points 500 Points

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Sample Size

  • What sample size is necessary to get at least one object from each of 10 groups.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Curse of Dimensionality

  • When dimensionality increases, data becomes increasingly sparse in the space that it occupies
  • Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful
  • Randomly generate 500 points
  • Compute difference between max and min distance between any pair of points

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Dimensionality Reduction

  • Purpose:
  • Avoid curse of dimensionality
  • Reduce amount of time and memory required by data mining algorithms
  • Allow data to be more easily visualized
  • May help to eliminate irrelevant features or reduce noise
  • Techniques
  • Principle Component Analysis
  • Singular Value Decomposition
  • Others: supervised and non-linear techniques

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Dimensionality Reduction: PCA

  • Goal is to find a projection that captures the largest amount of variation in data

x2

x1

e

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Dimensionality Reduction: PCA

  • Find the eigenvectors of the covariance matrix
  • The eigenvectors define the new space

x2

x1

e

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Dimensionality Reduction: ISOMAP

  • Construct a neighbourhood graph
  • For each pair of points in the graph, compute the shortest path distances – geodesic distances

By: Tenenbaum, de Silva, Langford (2000)

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Dimensionality Reduction: PCA

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Feature Subset Selection

  • Another way to reduce dimensionality of data
  • Redundant features
  • duplicate much or all of the information contained in one or more other attributes
  • Example: purchase price of a product and the amount of sales tax paid
  • Irrelevant features
  • contain no information that is useful for the data mining task at hand
  • Example: students’ ID is often irrelevant to the task of predicting students’ GPA

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Feature Subset Selection

  • Techniques:
  • Brute-force approch:
  • Try all possible feature subsets as input to data mining algorithm
  • Embedded approaches:
  • Feature selection occurs naturally as part of the data mining algorithm
  • Filter approaches:
  • Features are selected before data mining algorithm is run
  • Wrapper approaches:
  • Use the data mining algorithm as a black box to find best subset of attributes

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Feature Creation

  • Create new attributes that can capture the important information in a data set much more efficiently than the original attributes
  • Three general methodologies:
  • Feature Extraction
  • domain-specific
  • Mapping Data to New Space
  • Feature Construction
  • combining features

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Mapping Data to a New Space

Two Sine Waves

Two Sine Waves + Noise

Frequency

  • Fourier transform
  • Wavelet transform

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Discretization Using Class Labels

  • Entropy based approach

3 categories for both x and y

5 categories for both x and y

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Discretization Without Using Class Labels

Data

Equal interval width

Equal frequency

K-means

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Attribute Transformation

  • A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
  • Simple functions: xk, log(x), ex, |x|
  • Standardization and Normalization

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Similarity and Dissimilarity

  • Similarity
  • Numerical measure of how alike two data objects are.
  • Is higher when objects are more alike.
  • Often falls in the range [0,1]
  • Dissimilarity
  • Numerical measure of how different are two data objects
  • Lower when objects are more alike
  • Minimum dissimilarity is often 0
  • Upper limit varies
  • Proximity refers to a similarity or dissimilarity

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Euclidean Distance

  • Euclidean Distance

Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

  • Standardization is necessary, if scales differ.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Euclidean Distance

Distance Matrix

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Sheet1

pointxy
02
p220
p331
p451
pointxy
p102
p220
p331
p451

p1

Sheet2

Sheet3

Sheet1

pointxy
02
p220
p331
p451
pointxy
p102
p220
p331
p451
p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220

p1

Sheet2

Sheet3

Minkowski Distance

  • Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Minkowski Distance: Examples

  • r = 1. City block (Manhattan, taxicab, L1 norm) distance.
  • A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors
  • r = 2. Euclidean distance
  • r  . “supremum” (Lmax norm, L norm) distance.
  • This is the maximum difference between any component of the vectors
  • Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Minkowski Distance

Distance Matrix

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Sheet1

pointxy
02
p220
p331
p451
pointxy
p102
p220
p331
p451
p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
pointxy
p102
p220
p331
p451
L1p1p2p3p4
p10446
p24024
p34202
p46420
L2p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
p1p2p3p4
p10235
p22013
p33102
p45320

p1

Sheet2

Sheet3

Sheet1

pointxy
02
p220
p331
p451
pointxy
p102
p220
p331
p451
p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
pointxy
p102
p220
p331
p451
L1p1p2p3p4
p10446
p24024
p34202
p46420
L2p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
p1p2p3p4
p10235
p22013
p33102
p45320

p1

Sheet2

Sheet3

Sheet1

pointxy
02
p220
p331
p451
pointxy
p102
p220
p331
p451
p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
pointxy
p102
p220
p331
p451
L1p1p2p3p4
p10446
p24024
p34202
p46420
L2p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
p1p2p3p4
p10235
p22013
p33102
p45320

p1

Sheet2

Sheet3

Sheet1

pointxy
02
p220
p331
p451
pointxy
p102
p220
p331
p451
p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
pointxy
p102
p220
p331
p451
L1p1p2p3p4
p10446
p24024
p34202
p46420
L2p1p2p3p4
p102.8283.1625.099
p22.82801.4143.162
p33.1621.41402
p45.0993.16220
p1p2p3p4
p10235
p22013
p33102
p45320

p1

Sheet2

Sheet3

Mahalanobis Distance

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

 is the covariance matrix of the input data X

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Mahalanobis Distance

Covariance Matrix:

B

A

C

A: (0.5, 0.5)

B: (0, 1)

C: (1.5, 1.5)

Mahal(A,B) = 5

Mahal(A,C) = 4

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Common Properties of a Distance

  • Distances, such as the Euclidean distance, have some well known properties.

d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)

d(p, q) = d(q, p) for all p and q. (Symmetry)

d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

  • A distance that satisfies these properties is a metric

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Common Properties of a Similarity

  • Similarities, also have some well known properties.

s(p, q) = 1 (or maximum similarity) only if p = q.

s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Similarity Between Binary Vectors

  • Common situation is that objects, p and q, have only binary attributes
  • Compute similarities using the following quantities

M01 = the number of attributes where p was 0 and q was 1

M10 = the number of attributes where p was 1 and q was 0

M00 = the number of attributes where p was 0 and q was 0

M11 = the number of attributes where p was 1 and q was 1

  • Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes

= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values

= (M11) / (M01 + M10 + M11)

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

SMC versus Jaccard: Example

p = 1 0 0 0 0 0 0 0 0 0

q = 0 0 0 0 0 0 1 0 0 1

M01 = 2 (the number of attributes where p was 0 and q was 1)

M10 = 1 (the number of attributes where p was 1 and q was 0)

M00 = 7 (the number of attributes where p was 0 and q was 0)

M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Cosine Similarity

  • If d1 and d2 are two document vectors, then

cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,

where  indicates vector dot product and || d || is the length of vector d.

  • Example:

d1 = 3 2 0 5 0 0 0 2 0 0

d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Extended Jaccard Coefficient (Tanimoto)

  • Variation of Jaccard for continuous or count attributes
  • Reduces to Jaccard for binary attributes

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Correlation

  • Correlation measures the linear relationship between objects
  • To compute correlation, we standardize data objects, p and q, and then take their dot product

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Visually Evaluating Correlation

Scatter plots showing the similarity from –1 to 1.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

General Approach for Combining Similarities

  • Sometimes attributes are of many different types, but an overall similarity is needed.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Using Weights to Combine Similarities

  • May not want to treat all attributes the same.
  • Use weights wk which are between 0 and 1 and sum to 1.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Density

  • Density-based clustering require a notion of density
  • Examples:
  • Euclidean density
  • Euclidean density = number of points per unit volume
  • Probability density
  • Graph-based density

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Euclidean Density – Cell-based

  • Simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

Euclidean Density – Center-based

  • Euclidean density is the number of points within a specified radius of the point

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

1.1

2.2

16.22

6.25

12.65

1.2

2.7

15.22

5.27

10.23

Thickness

Load

Distance

Projection

of y load

Projection

of x Load

1.1

2.2

16.22

6.25

12.65

1.2

2.7

15.22

5.27

10.23

Thickness

Load

Distance

Projection

of y load

Projection

of x Load

Tid

Refund

Marital

Status

Taxable

Income

Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Singl

e

90K

Yes

10

Document 1

s

e

a

s

o

n

t

i

m

e

o

u

t

l

o

s

t

w

i

n

g

a

m

e

s

c

o

r

e

b

a

l

l

p

l

a

y

c

o

a

c

h

t

e

a

m

Document 2

Document 3

3050260202

0

0

702100300

100122030

TID

Items

1

Bread, Coke, Milk

2

Beer, Bread

3

Beer, Coke, Diaper, Milk

4

Beer, Bread, Diaper, Milk

5

Coke, Diaper, Milk

5

2

1

2

5

<a href=”papers/papers.html#bbbb”>

Data Mining </a>

<li>

<a href=”papers/papers.html#aaaa”>

Graph Partitioning </a>

<li>

<a href=”papers/papers.html#aaaa”>

Parallel Solution of Sparse Linear System of Equations </a>

<li>

<a href=”papers/papers.html#ffff”>

N-Body Computation and Dense Linear System Solvers

GGTTCCGCCTTCAGCCCCGCGCC

CGCAGGGCCCGCCCCGCGCCGTC

GAGAAGGGCCCGCCTGGCGGGCG

GGGGGAGGCGGGGCCGCCCGAGC

CCAACCGAGTCCGACCAGGTGCC

CCCTCTGCTCGGCCTAGACCTGA

GCTCATTAGGCGGCAGCGGACAG

GCCAAGTAGAACACGCGAAGCGC

TGGGCTGCCTGCTGCGACCAGGG

1

2

3

5

5

7

8

15

10

4

A

B

C

D

E

Dimensions = 10

Dimensions = 40

Dimensions = 80

Dimensions = 120

Dimensions = 160

Dimensions = 206

å

=

=

n

k

k

k

q

p

dist

1

2

)

(

0

1

2

3

0

1

2

3

4

5

6

p1

p2

p3

p4

pointxy

p102

p220

p331

p451

p1p2p3p4

p102.8283.1625.099

p22.82801.4143.162

p33.1621.41402

p45.0993.16220

r

n

k

r

k

k

q

p

dist

1

1

)

|

|

(

å

=

=

pointxy

p102

p220

p331

p451

L1p1p2p3p4

p10446

p24024

p34202

p46420

L2p1p2p3p4

p102.8283.1625.099

p22.82801.4143.162

p33.1621.41402

p45.0993.16220

L

p1p2p3p4

p10235

p22013

p33102

p45320

T

q

p

q

p

q

p

s

mahalanobi

)

(

)

(

)

,

(

1

å

=

å

=

=

S

n

i

k

ik

j

ij

k

j

X

X

X

X

n

1

,

)

)(

(

1

1

ú

û

ù

ê

ë

é

=

S

3

.

0

2

.

0

2

.

0

3

.

0

)

(

/

))

(

(

p

std

p

mean

p

p

k

k

=

¢

)

(

/

))

(

(

q

std

q

mean

q

q

k

k

=

¢

q

p

q

p

n

correlatio

¢

·

¢

=

)

,

(

Data Mining: Exploring Data

Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Look for accompanying R code on the course web site.brinsRectangle

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#›

Techniques Used In Data Exploration

 In EDA, as originally defined by Tukey – The focus was on visualization – Clustering and anomaly detection were viewed as

exploratory techniques – In data mining, clustering and anomaly detection are

major areas of interest, and not thought of as just exploratory

 In our discussion of data exploration, we focus on – Summary statistics – Visualization – Online Analytical Processing (OLAP)brinsRectanglebrinsRectangle

What is data exploration?

• Key motivations of data exploration include – Helping to select the right tool for preprocessing or analysis – Making use of humans’ abilities to recognize patterns

• People can recognize patterns not captured by data analysis tools

• Related to the area of Exploratory Data Analysis (EDA) – Created by statistician John Tukey – Seminal book is “Exploratory Data Analysis” by Tukey – A nice online introduction can be found in Chapter 1 of the NIST

Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/index.htm

A preliminary exploration of the data to better understand its characteristics.http://www.itl.nist.gov/div898/handbook/index.htm

Iris Sample Data Set

• Many of the exploratory data techniques are illustrated with the Iris Plant data set. – Can be obtained from the UCI Machine Learning Repository

http://www.ics.uci.edu/~mlearn/MLRepository.html

– From the statistician R.A. Fisher – Three flower types (classes):

• Setosa • Virginica • Versicolour

– Four (non-class) attributes • Sepal width and length • Petal width and length Virginica. Robert H. Mohlenbrock. USDA

NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.

Summary Statistics

• Summary statistics are numbers that summarize properties of the data

– Summarized properties include location and spread for continuous data

• Examples: location – mean spread – standard deviation

-Most summary statistics can be calculated in a single pass through the data

Frequency and Mode

• The frequency of an attribute value is the percentage of time the value occurs in the data set – For example, given the attribute ‘gender’ and a

representative population of people, the gender ‘female’ occurs about 50% of the time.

• The mode of an attribute is the most frequent attribute value

• The notions of frequency and mode are typically used with categorical data

Measures of Location: Mean and Median

• The mean is the most common measure of the location of a set of points.

• However, the mean is very sensitive to outliers. • Thus, the median or a trimmed mean is also

commonly used.

Measures of Spread: Range and Variance

• Range is the difference between the max and min • The variance or standard deviation is the most

common measure of the spread of a set of points.

• However, this is also sensitive to outliers, so that other measures are often used.

Percentiles

x p x p

Median – 50% of the cases has a smaller value & 50%

are larger

Multivariate Summary Statistics Object x

1 x 2

1 12 15 2 2 4 … … … m 18 4

• Covariance between features i and j

• Correlation

si is the variance of feature i

sij= 1

m−1∑k=1 m

( xki− x̄i)(xkj− x̄ j)

r ij= sij si s j

Topics

• Exploratory Data Analysis • Summary Statistics • Visualization

Visualization

Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

• Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze

large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns

Example: Sea Surface Temperature • The following shows the Sea Surface Temperature

(SST) for July 1982 – Tens of thousands of data points are

summarized in a single figure

Representation

• Is the mapping of information to a visual format • Data objects, their attributes, and the relationships

among data objects are translated into graphical elements such as points, lines, shapes, and colors.

• Example: – Objects are often represented as points – Their attribute values can be represented as the

position of the points or the characteristics of the points, e.g., color, size, and shape – If position is used, then the relationships of points,

i.e., whether they form groups or a point is an outlier, is easily perceived.

Arrangement

• Is the placement of visual elements within a display

• Can make a large difference in how easy it is to understand the data

• Example:

Selection

• Is the elimination or the deemphasis of certain objects and attributes

• Selection may involve the choosing a subset of attributes – Dimensionality reduction is often used to reduce the

number of dimensions to two or three – Alternatively, pairs of attributes can be considered

• Selection may also involve choosing a subset of objects – A region of the screen can only show so many points – Can sample, but want to preserve points in sparse areas

Histograms

• Usually shows the distribution of values of a single variable • Divide the values into bins and show a bar plot of the number of objects in each bin. • The height of each bar indicates the number of objects • Shape of histogram depends on the number of bins

• Example: Petal Width (10 and 20 bins, respectively)

Empirical Cumulative Distribution Function (ECDF)

• Probability Density Function (PDF): describes the relative likelihood for this random variable to take on a given value

• Cumulative Distribution Function (CDF): Shows the distribution of data as the fraction of points that are less than this value.

CDF

PDF

Example: ECDF

Two-Dimensional Histograms • Show the joint distribution of the values of two attributes • Example: petal width and petal length

– What does this tell us?

Box Plots • Invented by J. Tukey • Another way of displaying the distribution of data • Following figure shows the basic part of a box plot

outlier

25th percentile – 1.5 IQR

25th percentile

75th percentile

50th percentile

75th percentile + 1.5 IQR

IQR

Q1 Q3

IQR

Median

Q3 + 1.5 × IQRQ1 − 1.5 × IQR

−0.6745 σ 0.6745 σ 2.698 σ−2.698 σ

50%24.65% 24.65%

68.27% 15.73%15.73%

−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ

−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ

−4 σ −3 σ −2 σ −1 σ 0 σ 1 σ 3 σ2 σ 4 σ

Example of Box Plots • Box plots can be used to compare attributes

Scatter Plots

– Attributes values determine the position – Two-dimensional scatter plots most common,

but can have three-dimensional scatter plots -Often additional attributes can be displayed by

using the size, shape, and color of the markers that represent the objects – It is useful to have arrays of scatter plots can

compactly summarize the relationships of several pairs of attributes

• See example on the next slide

Scatter Plot Array of Iris Attributes

Contour Plots – Useful when a continuous attribute is measured

on a spatial grid – They partition the plane into regions of similar

values – The contour lines that form the boundaries of

these regions connect points with equal values – The most common example is contour maps of

elevation – Can also display temperature, rainfall, air

pressure, etc.  An example for Sea Surface Temperature (SST) is

provided on the next slide

Contour Plot Example: SST Dec, 1998

Celsius

Matrix Plots

– Can plot a data matrix – Can be useful when objects are sorted

according to class – Typically, the attributes are normalized to

prevent one attribute from dominating the plot – Plots of similarity or distance matrices can also

be useful for visualizing the relationships between objects

Visualization of the Iris Data Matrix

standard deviation

Deviation form feature mean

Visualization of the Iris Correlation Matrix

Parallel Coordinates – Used to plot the attribute values of high-

dimensional data – Instead of using perpendicular axes, use a set of

parallel axes – The attribute values of each object are plotted as a

point on each corresponding coordinate axis and the points are connected by a line

– Thus, each object is represented as a line – Often, the lines representing a distinct class of

objects group together, at least for some attributes – Ordering of attributes is important in seeing such

groupings

Parallel Coordinates Plots for Iris Data

Reordered features

Other Visualization Techniques

Star Plots – Similar approach to parallel coordinates, but axes radiate

from a central point – The line connecting the values of an object is a polygon

Chernoff Faces – Approach created by Herman Chernoff – This approach associates each attribute with a characteristic

of a face – The values of each attribute determine the appearance of the

corresponding facial characteristic – Each object becomes a separate face – Relies on human’s ability to distinguish faces

Star Plots for Iris Data

Setosa

Versicolor

Virginica

Chernoff Faces for Iris Data Setosa

Versicolor

Virginica

OLAP • On-Line Analytical Processing (OLAP) was

proposed by E. F. Codd, the father of the relational database.

• Relational databases put data into tables, while OLAP uses a multidimensional array representation. – Such representations of data previously existed in

statistics and other fields

• There are a number of data analysis and data exploration operations that are easier with such a data representation.

Creating a Multidimensional Array • Two key steps in converting tabular data into a

multidimensional array. – First, identify which attributes are to be the dimensions

and which attribute is to be the target attribute whose values appear as entries in the multidimensional array. • The attributes used as dimensions must have discrete values • The target value is typically a count or continuous value, e.g.,

the cost of an item • Can have no target variable at all except the count of objects

that have the same set of attribute values

– Second, find the value of each entry in the multidimensional array by summing the values (of the target attribute) or count of all objects that have the attribute values corresponding to that entry.

Example: Iris data • We show how the attributes, petal length, petal

width, and species type can be converted to a multidimensional array – First, we discretized the petal width and length to

have categorical values: low, medium, and high

– We get the following table – note the count attribute

Example: Iris data (continued) • Each unique tuple of petal width, petal length,

and species type identifies one element of the array.

• This element is assigned the corresponding count value.

• The figure illustrates the result.

• All non-specified tuples are 0.

Example: Iris data (continued) • Slices of the multidimensional array are shown

by the following cross-tabulations

• What do these tables tell us?

OLAP Operations: Data Cube • The key operation of a OLAP is the formation of a

data cube • A data cube is a multidimensional representation

of data, together with all possible aggregates. • By all possible aggregates, we mean the aggregates

that result by selecting a proper subset of the dimensions and summing over all remaining dimensions.

• For example, if we choose the species type dimension of the Iris data and sum over all other dimensions, the result will be a one-dimensional entry with three entries, each of which gives the number of flowers of each type.

Data Cube Example • Consider a data set that records the sales of

products at a number of company stores at various dates.

• This data can be represented as a 3 dimensional array

• There are 3 two-dimensional aggregates (3 choose 2 ), 3 one-dimensional aggregates, and 1 zero-dimensional aggregate (the overall total)

Data Cube Example (continued) • The following figure table shows one of the two

dimensional aggregates, along with two of the one-dimensional aggregates, and the overall total

OLAP Operations: Slicing and Dicing • Slicing is selecting a group of cells from the

entire multidimensional array by specifying a specific value for one or more dimensions.

• Dicing involves selecting a subset of cells by specifying a range of attribute values.

– This is equivalent to defining a subarray from the complete array.

• In practice, both operations can also be accompanied by aggregation over some dimensions.

OLAP Operations: Roll-up and Drill-down

• Attribute values often have a hierarchical structure. – Each date is associated with a year, month, and week.

– A location is associated with a continent, country, state (province, etc.), and city.

– Products can be divided into various categories, such as clothing, electronics, and furniture.

• Note that these categories often nest and form a tree or lattice – A year contains months which contains day

– A country contains a state which contains a city

OLAP Operations: Roll-up and Drill-down

• This hierarchical structure gives rise to the roll- up and drill-down operations.

– For sales data, we can aggregate (roll up) the sales across all the dates in a month.

– Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals.

– Likewise, we can drill down or roll up on the location or product ID attributes.