Cosine Similarity
94 Chapter 2 Data
(a) Relationship between Euclidean distance and the cosine measure.
(b) Relationship between Euclidean distance and correlation.
Figure 2.20. Graphs for Exercise 20.
Show that the set difference metric given by
d(A, B) : size(A – B) + si,ze(B – A) (2.19)
satisfies the metric axioms given on page 70. ,4. and B are sets and A – B is the set difference.
Discuss how you might map correlation values from the interval l-1,1] to the interval [0,1]. Note that the type of transformation that you use might depend on the application that you have in mind. Thus, consider two applications: clustering time series and predicting the behavior of one time series given an- other.
Given a similarity measure with values in the interval [0,1] describe two ways to transform this similarity value into a dissimilarity value in the interval l0,oo].
Proximity is typically defined between a pair of objects.
(a) Define two ways in which you might define the proximity among a group of objects.
(b) How might you define the distance between two sets of points in Euclidean space?
(c) How might you define the proximity between two sets of data objects? (Make no assumption about the data objects, except that a proximity measure is defined between any pair of objects.)
You are given a set of points ,9 in Euclidean space, as well as the distance of each point in ,S to a point x. (It does not matter if x e S.)
o o
q
o G o
l,rJ
o
o o c (5 o
o
I.JJ
2T,
22.
23.
, A
1 . 4
1 . 2
1
0.8
0.6
0.4
o.2
Correlation
25.
26.
27.
,9,
2.6, Exercises 95
(a) If the goal is to find all points within a specified distance e of point y, y * x, explain how you could use the triangle inequality and the already calculated distances to x to potentially reduce the number of distance calculations necessary? Hint: The triangle inequality, d(x,z) < d(x,y)* d(y,*), can be rewritten as d(x,y) 2 d(x, z) – d(y,z).
(b) In general, how would the distance between x and y affect the number of distance calculations?
(c) Suppose that you can find a small subset of points ,5′, from the original data set, such that every point in the data set is within a specified distance e of at least one of the points in ^91 and that you also have the pairwise
distance matrix for 51 Describe a technique that uses this information to compute, with a minimum of distance calculations, the set of all points
within a distance of B of a specified point from the data set.
Show that 1 minus the Jaccard similarity is a distance measure between two data objects, x and y, that satisfies the metric axioms given on page 70. Specifically, d ( * , y ) : 1 – J ( x , y ) .
Show that the distance measure defined as the angle between two data vectors, x and y, satisfies the metric axioms given on page 70. Specifically, d(*, y) :
arccos(cos(x, y)).
Explain why computing the proximity between two attributes is often simpler than computing the similarity between two objects.
Exploring Data
The previous chapter addressed high-level data issues that are important in the knowledge discovery process. This chapter provides an introduction to data exploration, which is a preliminary investigation of the data in order to better understand its specific characteristics. Data exploration can aid in selecting the appropriate preprocessing and data analysis techniques. It can even address some of the questions typically answered by data mining. For example, patterns can sometimes be found by visually inspecting the data. Also, some of the techniques used in data exploration, such as visualization, can be used to understand and interpret data mining results.
This chapter covers three major topics: summary statistics, visualization, and On-Line Analytical Processing (OLAP). Summary statistics, such as the mean and standard deviation of a set of values, and visualization techniques, such as histograms and scatter plots, are standard methods that are widely employed for data exploration. OLAP, which is a more recent development, consists of a set of techniques for exploring multidimensional arrays of values. OlAP-related analysis functions focus on various ways to create summary data tables from a multidimensional data array. These techniques include aggregating data either across various dimensions or across various attribute values. For instance, if we are given sales information reported according to product, location, and date, OLAP techniques can be used to create a summary that describes the sales activity at a particular location by month and product category.
The topics covered in this chapter have considerable overlap with the area known as Exploratory Data Analysis (EDA), which was created in the 1970s by the prominent statistician, John Tirkey. This chapter, like EDA, places a heavy emphasis on visualization. Unlike EDA, this chapter does not include topics such as cluster analysis or anomaly detection. There are two
98 Chapter 3 Exploring Data
reasons for this. First, data mining views descriptive data analysis techniques as an end in themselves, whereas statistics, from which EDA originated, tends to view hypothesis-based testing as the final goal. Second, cluster analysis and anomaly detection are large areas and require full chapters for an in- depth discussion. Hence, cluster analysis is covered in Chapters 8 and 9, while anomaly detection is discussed in Chapter 10.
3.1- The Iris Data Set
In the following discussion, we will often refer to the Iris data set that is available from the University of California at Irvine (UCI) Machine Learn- ing Repository. It consists of information on 150 Iris flowers, 50 each from one of three Iris species: Setosa, Versicolour, and Virginica. Each flower is characterized by five attributes:
1. sepal length in centimeters
2. sepal width in centimeters
3. petal length in centimeters
4. petal width in centimeters
5. class (Setosa, Versicolour, Virginica)
The sepals of a flower are the outer structures that protect the more fragile parts of the flower, such as the petals. In many flowers, the sepals are green, and only the petals are colorful. For Irises, however, the sepals are also colorful. As illustrated by the picture of a Virginica Iris in Figure 3.1, the sepals of an Iris are larger than the petals and are drooping, while the petals are upright.
3.2 Summary Statistics
Summary statistics are quantities, such as the mean and standard deviation, that capture various characteristics of a potentially large set of values with a single number or a small set of numbers. Everyday examples of summary statistics are the average household income or the fraction of college students who complete an undergraduate degree in four years. Indeed, for many people, summary statistics are the most visible manifestation of statistics. We will concentrate on summary statistics for the values of a single attribute, but will provide a brief description of some multivariate summary statistics.
3.2 Summary Statistics 99
Figure 3.1. Picture of lris Virginica. Robert H. Mohlenbrock @ USDA-NRCS PLANTS Database/ USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Background removed.
This section considers only the descriptive nature of summary statistics. However, as described in Appendix C, statistics views data as arising from an underlying statistical process that is characterized by various parameters, and some of the summary statistics discussed here can be viewed as estimates of statistical parameters of the underlying distribution that generated the data.
3.2.L Flequencies and the Mode
Given a set of unordered categorical values, there is not much that can be done to further characterize the values except to compute the frequency with which each value occurs for a particular set of data. Given a categorical attribute r, which can take values {rt,. . . ,1ri,. .. u7r} and a set of rn objects, the frequency of a value u; is defined as
frequency(u5) : number of objects with attribute value u;
(3 .1)
The mode of a categorical attribute is the value that has the highest frequency.
100 Chapter 3 Exploring Data
Example 3.1. Consider a set of students who have an attribute, class, which can take values from the set {f reshman, sophomore, junior, senior}. Table 3.1 shows the number of students for each value of the class attribute. The
mode of the class attribute is f reshman, with a frequency of 0.33. This may indicate dropouts due to attrition or a larger than usual freshman class.
Table 3.1. Class size for students in a hypothetical college.
Class Size FYequency freshman sophomore junior
senior
t40 160 130 170
0.33 0.27 0.22 0.18
T
Categorical attributes often, but not always, have a small number of values, and consequently, the mode and frequencies of these values can be interesting and useful. Notice, though, that for the Iris data set and the class attribute, the three types of flower all have the same frequency, and therefore, the notion of a mode is not interesting.
For continuous data, the mode, as currently defined, is often not useful because a single value may not occur more than once. Nonetheless, in some cases, the mode may indicate important information about the nature of the values or the presence of missing values. For example, the heights of 20 people
measured to the nearest millimeter will typically not repeat, but if the heights are measured to the nearest tenth of a meter, then some people may have the same height. Also, if a unique value is used to indicate a missing value, then this value will often show up as the mode.
3.2.2 Percent i les
For ordered data, it is more useful to consider the percentiles of a set of values. In particular, given an ordinal or continuous attribute r and a number p between 0 and 100, the pth percentile ro is a value of z such that pTo of the observed values of r are less than ro. For instance, the 50th percentile is the valte r5so/o such that 50% of all values of r are less than r5sv. Table 3.2 shows the percentiles for the four quantitative attributes of the Iris data set.
3.2 Summary Statistics 101-
Table 3.2. Percentiles for sepal length, sepal width, petal length, and petal width. (All values are in centimeters.)
Percentile Sepal Length Sepal Width Petal Length Petal Width 0 10 20 30 40 50 60 70 80 90 100
4.3 4.8 5 .0 5 .2 D.r) 5 .8 6 . 1 6 .3 6 .6 6 .9 7.9
2 .0 2 . 5 2 . 7 2 .8 3 .0 3 .0 3 . 1 3 .2 3.4 3 .6 4 .4
1.0 t .4 r . o t . 7 3.9 4 .4 4 .6 5 .0 o . 4
c . 6
6.9
0 . 1 0.2 0.2 0.4 7 .2 1 .3 -t . l )
1.8 1 .9 2 .2 2 .5
Example 3.2. The percentiles, ryyo,rro4o,.. . ,frgyyo,rno% of the integers from 1to 10 are , in o rder , the fo l low ing : 1 .0 , 7 .5 ,2 .5 ,3 .5 ,4 .5 ,5 .5 , 6 .5 , 7 .5 ,8 .5 , 9 .5 , 10.0. By tradition, min(r) : troyo and max(r) : *too%. r
3.2.3 Measures of Location: Mean and Median
For continuous data, two of the most widely used summary statistics are the mean and median, which are measures of the locati,on of a set of values. Consider a set of nl objects and an attr ibute r . Let {r1, . . . , r^) be the attribute values of r for these zn objects. As a concrete example, these values might be the heights of rn children. Let {rg1,…,:xOd} represent the values of z after they have been sorted in non-decreasing order. Thus, ro): min(z) and r1*1: max(r). Then, the mean and median are defined as follows: