Each circle in a plot represents one of the 30 attributes; its r coordinate is the value of one of the attributes for x, while its 3r coordinate is the value of the same attribute for y. I

If we transform x and y by subtracting off their means and then normaliz- ing them so that their lengths are 1, then their correlation can be calculated by

Measures of Similaritv and Dissimilaritv 79

taking the dot product. Notice that this is not the same as the standardization used in other contexts, where we make the transformations, r’* : (rp – ,) lt” and y’r : (A* – T) I sa.

Bregman Divergence* This section provides a brief description of Breg- man divergences, which are a family of proximity functions that share some common properties. As a result, it is possible to construct general data min- ing algorithms, such as clustering algorithms, that work with any Bregman divergence. A concrete example is the K-means clustering algorithm (Section

8.2). Note that this section requires knowledge of vector calculus. Bregman divergences are loss or distortion functions. To understand the

idea of a loss function, consider the following. Let x and y be two points, where y is regarded as the original point and x is some distortion or approximation of it. For example, x may be a point that was generated, for example, by adding random noise to y. The goal is to measure the resulting distortion or Ioss that results if y is approximated by x. Of course, the more similar x and y are, the smaller the loss or distortion. Thus, Bregman divergences can be used as dissimilarity functions.

More formally, we have the following definition.

Definition 2.6 (Bregman Divergence). Given a strictly convex function

@ (with a few modest restrictions that are generally satisfied), the Bregman divergence (loss function) D(x, y) generated by that function is given by the following equation:

D(*,y) : d(x) – Q0) – (Vd(v), (* – y)) (2.r2)

where Vd(V) is the gradient of / evaluated at y, x – y, is the vector difference between x and y, and (Vd(V), (” – V)) is the inner product between Vd(*) and (x-y). For points in Euclidean space, the inner product is just the dot product.

D(*,y) can be wri t ten as D(x,V) : d(x) – L(*) , where.L(x) : d$)+ (Vd(V), (* – V)) and represents the equation of a plane that is tangent to the function Q at y. Using calculus terminology, L(x) is the linearization of. Q around the point y and the Bregman divergence is just the difference between a function and a linear approximation to that function. Different Bregman divergences are obtained by using different choices for S.

Example 2.22. We provide a concrete example using squared Euclidean dis- tance, but restrict ourselves to one dimension to simplify the mathematics. Let

2.4

80 Chapter 2 Data

r and y be real numbers and /(t) be the real valued function, d(t) : t2. In that case, the gradient reduces to the derivative and the dot product reduces to multiplication. Specifically, Equation 2.L2 becomes Equation 2.13.

D(*, i l : 12 – A2 – 2A@ – A) : @ – i l ‘ (2.13)

The graph for this example, with 3r : 1, is shown in Figure 2.18. The Bregman divergence is shown for two values of r: r :2 and r :3. r

Figure 2.18. lllustration of Bregman divergence.

2.4.6 fssues in Proximity Calculation

This section discusses several important issues related to proximity measures: (1) how to handle the case in which attributes have different scales and/or are correlated, (2) how to calculate proximity between objects that are composed of different types of attributes, e.g., quantitative and qualitative, (3) and how to handle proximity calculation when attributes have different weights; i.e., when not all attributes contribute equally to the proximity of objects.

– 1 0 1 2 3 4 x

2.4 Measures of Similarity and Dissimilarity 81

Standardization and Correlation for Distance Measures

An important issue with distance measures is how to handle the situation when attributes do not have the same range of values. (This situation is often described by saying that “the variables have different scales.”) Earlier, Euclidean distance was used to measure the distance between people based on two attributes: age and income. Unless these two attributes are standardized, the distance between two people will be dominated by income.

A related issue is how to compute distance when there is correlation be- tween some of the attributes, perhaps in addition to differences in the ranges of values. A generalization of Euclidean distance, the Mahalanobis distance, is useful when attributes are correlated, have different ranges of values (dif- ferent variances), and the distribution of the data is approximately Gaussian (normal). Specifically, the Mahalanobis distance between two objects (vectors) x and y is defined as

mahalanobis(x,y) : (x – y)>-1(x -y)r, (2.14)

where E-1 is the inverse of the covariance matrix of the data. Note that the covariance matrix E is the matrix whose ijth entry is the covariance of the ith and jth attributes as defined by Equation 2.II.

Example 2.23. In Figure 2.19, there are 1000 points, whose r and g at- tributes have a correlation of 0.6. The distance between the two large points

at the opposite ends of the long axis of the ellipse is 14.7 in terms of Euclidean distance, but only 6 with respect to Mahalanobis distance. In practice, com- puting the Mahalanobis distance is expensive, but can be worthwhile for data whose attributes are correlated. If the attributes are relatively uncorrelated, but have different ranges, then standardizing the variables is sufficient.

I

Combining Similarities for fleterogeneous Attributes

The previous definitions of similarity were based on approaches that assumed all the attributes were of the same type. A general approach is needed when the attributes are of different types. One straightforward approach is to compute the similarity between each attribute separately using Table 2.7, and then combine these similarities using a method that results in a similarity between 0 and 1. Typically, the overall similarity is defined as the average of all the individual attribute similarities.

82 Chapter 2 Data

Figure 2.19. Set of two-dimensional points. The Mahalanobis distance between the two points repre- sented by large dots is 6;their Euclidean distance is 14.7.

Unfortunately, this approach does not work well if some of the attributes are asymmetric attributes. For example, if all the attributes are asymmetric binary attributes, then the similarity measure suggested previously reduces to the simple matching coefficient, a measure that is not appropriate for asym- metric binary attributes. The easiest way to fix this problem is to omit asym- metric attributes from the similarity calculation when their values are 0 for both of the objects whose similarity is being computed. A similar approach also works well for handling missing values.

In summary, Algorithm 2.7 is effective for computing an overall similar- ity between two objects, x and y, with different types of attributes. This procedure can be easily modified to work with dissimilarities.

Using Weights

In much of the previous discussion, all attributes were treated equally when computing proximity. This is not desirable when some attributes are more im- portant to the definition of proximity than others. To address these situations,

2.4 Measures of Similaritv and Dissimilaritv 83

Algorithm 2.1 Similarities of heterogeneous objects.

1-: 2 :

For the kth attribute, compute a similarity, s,r(x,y), in the range [0, 1]. Define an indicator variable, d’7., for the kth attribute as follows:

3: Compute the overall similarity between the two objects using the following for- mula:

similarity(x,”) :$ffi (2.15)

the formulas for proximity can each attribute.

If the weights u.r6 sum to L,

be modified by weighting the contribution of

bhen (2.15) becomes

(2 .16)

The definition of the Minkowski distance can also be modified as follows:

d(x, y) : (2.r7)

2.4.7 Selecting the Right Proximity Measure

The following are a few general observations that may be helpful. First, the type of proximity measure should fit the type of data. For many types of dense, continuous data, metric distance measures such as Euclidean distance are of- ten used. Proximity between continuous attributes is most often expressed in terms of differences, and distance measures provide a well-defined way of combining these differences into an overall proximity measure. Although at- tributes can have different scales and be of differing importance, these issues can often be dealt with as described earlier.

For sparse data, which ofben consists of asymmetric attributes, we typi- cally employ similarity measures that ignore 0-0 matches. Conceptually, this reflects the fact that, for a pair of complex objects, similarity depends on the number of characteristics they both share, rather than the number of charac- teristics they both lack. More specifically, for sparse, asymmetric data, most

similarity(x, y) : D?:tl|lns !(*’ Y) DT:’6x

/ n \ 1 / ‘

( D,’ol”r – akl’I \ t : r /

84 Chapter 2 Data

objects have only a few of the characteristics described by the attributes, and thus, are highly similar in terms of the characteristics they do not have. The cosine, Jaccard, and extended Jaccard measures are appropriate for such data.

There are other characteristics of data vectors that may need to be consid- ered. Suppose, for example, that we are interested in comparing time series. If the magnitude of the time series is important (for example, each time series represent total sales of the same organization for a different year), then we could use Euclidean distance. If the time series represent different quantities (for example, blood pressure and oxygen consumption), then we usually want to determine if the time series have the same shape, not the same magnitude. Correlation, which uses a built-in normalization that accounts for differences in magnitude and level, would be more appropriate.

In some cases, transformation or normalization of the data is important for obtaining a proper similarity measure since such transformations are not always present in proximity measures. For instance, time series may have trends or periodic patterns that significantly impact similarity. Also, a proper computation of similarity may require that time lags be taken into account. Finally, two time series may only be similar over specific periods of time. For example, there is a strong relationship between temperature and the use of natural gas, but only during the heating season.