proximity between objects having a single attribute. Consider objects de-

scribed by one nominal attribute. What would it mean for two such objects

to be similar? Since nominal attributes only convey information about the

distinctness of objects, all we can say is that two objects either have the same value or they do not. Hence, in this case similarity is traditionally defined as 1

if attribute values match, and as 0 otherwise. A dissimilarity would be defined

in the opposite way: 0 if the attribute values match, and 1 if they do not. For objects with a single ordinal attribute, the situation is more compli-

cated because information about order should be taken into account. Consider an attribute that measures the quality of a product, â‚¬.8., a candy bar, on the

scale {poor, fai,r, OK, good, wonderful).It would seem reasonable that a prod-

uct, P1, which is rated wonderful, would be closer to a produclP2, which is

rated good, than it would be to a product P3, which is rated OK. To make this observation quantitative, the values of the ordinal attribute are often mapped

to successive integers, beginning at 0 or 1, e.g., {poor:O, fair:|, OK:2, good:3, wonderful:4). Then, d(Pl,P2) – 3 – 2 : 7 or, if we want the dis-

similarity to fall between 0 and 1, d(P1, P2) : TZ :0.25. A similarity for ordinal attributes can then be defined as s : 7 – d.

This definition of similarity (dissimilarity) for an ordinal attribute should make the reader a bit uneasy since this assumes equal intervals, and this is not

so. Otherwise, we would have an interval or ratio attribute. Is the difference between the values fair and good really the same as that between the values

OK and wonderful? Probably not, but in practice, our options are limited, and in the absence of more information, this is the standard approach for defining proximity between ordinal attributes.

For interval or ratio attributes, the natural measure of dissimilarity be- tween two objects is the absolute difference of their values. For example, we might compare our current weight and our weight a year ago by saying “I am

ten pounds heavier.” In cases such as these, the dissimilarities typically range from 0 to oo, rather than from 0 to 1. The similarity of interval or ratio at- tributes is typically expressed by transforming a similarity into a dissimilarity, as previously described.

Table 2.7 summarizes this discussion. In this table, r and g are two objects that have one attribute of the indicated type. AIso, d(r,a) and s(r,gr) are the dissimilarity and similarity between r and g, respectively. Other approaches are possible; these are the most common ones.

The following two sections consider more complicated measures of prox-

imity between objects that involve multiple attributes: (1) dissimilarities be- tween data objects and (2) similarities between data objects. This division

2.4 Measures of Similaritv and Dissimilaritv 69

allows us to more naturally display the underlying motivations for employing various proximity measures. We emphasize, however, that similarities can be transformed into dissimilarities and vice versa using the approaches described earlier.

2.4.3 Dissimilarities between Data Objects

In this section, we discuss various kinds of dissimilarities. We begin with a discussion of distances, which are dissimilarities with certain properties, and then provide examples of more general kinds of dissimilarities.

Distances

We first present some examples, and then offer a more formal description of distances in terms of the properties common to all distances. The Euclidean distance, d, between two points, x and y, in one-, two-, three-, or higher- dimensional space, is given by the following familiar formula:

d(*, y) : n

\ – r \ t ) \ tn – ak ) ‘ , K = l

(2 .1 )

where n is the number of dimensions and rp and.Ak are) respectively,the kth attributes (components) of r and g. We illustrate this formula with Figure 2.15 and Tables 2.8 and 2.9, which show a set of points, the e and gr coordinates of these points, and the distance matrix containing the pairwise distances of these points.

Table 2.7 . Similarity and dissimilarity for simple attributes

Attribute T’ype

Dissimilarity Similaritv

Nominal ) _ 11, –

0 1 f r : A I i f r l y s :

I \ f r : y 0 i f n l y

Ordinal d: l r -a l l ( ” – t ) (values mapped to integers 0 to n-1 where n is the number of values)

s : I – d

Interval or Ratio d : l r – a l s : – d , s : ; i , s : e – o ,’ L t a

^ – 1 d -min-d-

70 Chapter 2 Data

The Euclidean distance measure given the Minkowski distance metric shown in

in Equation 2.1 is generalized Equation 2.2,

by

d(x, y) : ( , , \

where r is a parameter. The following are the three most common examples of Minkowski distances.

. r :1. City block (Manhattan, taxicab, L1 norm) distance. A common example is the Hamming distance, which is the number of bits that are different between two objects that have only binary attributes, i.e., between two binary vectors.

o r :2. Euclidean distance (L2 norm).

. r : oo. Supremum (L*o, or L- norm) distance. This is the maximum difference between any attribute of the objects. More formally, the L- distance is defined by Equation 2.3

(2 .3)

(Lor-orr)””

J* (U’wr-rrt’)”‘d(*, Y) :

The r parameter should not be confused with the number of dimensions (at- tributes) n. The Euclidean, Manhattan, and supremum distances are defined for all values of n: I,2,3,…, and specify different ways of combining the differences in each dimension (attribute) into an overall distance.

Tables 2.10 and 2.11, respectively, give the proximity matrices for the L1 and Loo distances using data from Table 2.8. Notice that all these distance matrices are symmetric; i.e., the ijth entry is the same as the jith entry. In Table 2.9, for instance, the fourth row of the first column and the fourth column of the first row both contain the value 5.1.

Distances, such as the Euclidean distance, have some well-known proper- ties. If d(*, y) is the distance between two points, x and y, then the following properties hold.

1. Positivity

(a) d(x, x) > 0 for all x and y,

(b) d(x, Y) : 0 onlY if x : Y.

Measures of Similaritv and Dissimilaritv 7L

Figure 2.15. Four two-dimensional points.

Tabfe 2.8. r and y coordinates of four points. Table 2.9. Euclidean distance matrix for Table 2.g. point z coordinate y coordinate

p1 0 2 p2 2 0 p3 .) 1 p4 1

Table 2.10. L1 distance matrix for Table 2.8. Table 2,11. L- distance matrix for Table 2.8.

2. Symmetry d(*,Y) : d(Y,x) for al l x and Y.

3. T[iangle Inequality d(x,z) < d(*, y) + d(y, z) for all points x, y,, and z.

Measures that satisfy all three properties are known as metrics. Some people only use the term distance for dissimilarity measures that satisfy these properties, but that practice is often violated. The three properties described here are useful, as well as mathematically pleasing. AIso, if the triangle in- equality holds, then this property can be used to increase the efficiency of tech- niques (including clustering) that depend on distances possessing this property. (See Exercise 25.) Nonetheless, many dissimilarities do not satisfy one or more of the metric properties. We give two examples of such measures.

2.4

1 2 3 4 5 6 X

p l p2 p3 p4 pl 0.0 2 .8 3 .2 5 . 1 p2 2.8 0.0 t .4 3 .2 p3 3.2 L .4 0.0 2.0 p4 c . r 3.2 2 .0 0 .0

L1 p1 p2 p3 p4 p l 0.0 4 .0 4.0 6.0 p2 4.0 0.0 2.0 4.0 p3 4.0 2 .0 0.0 2 .0 p4 6.0 4.0 2 .0 0.0

L- p l p2 p3 p4 p1 0.0 2 .0 3.0 5.0 p2 2.0 0 .0 1 .0 3.0 p3 3.0 1 .0 0.0 2.0 p4 5.0 3 .0 2.0 0 .0

72 Chapter 2 Data

Example 2.L4 (Non-metric Dissimilarities: Set Differences). This ex- ample is based on the notion of the difference of two sets, as defined in set theory. Given two sets -4 and B, A – B is the set of elements of A that are not in B . For example , i f A : { I ,2 ,3 ,4 } and B : {2 ,3 ,4 } , then A- B : {1 } and B – A – 0, the empty set. We can define the distance d between two sets A and B as d(A,B): size(A- B), where s’ize ts a function returning the number of elements in a set. This distance measure, which is an integer value greater than or equal to 0, does not satisfy the second part of the pos- itivity property the symmetry property, or the triangle inequality. However, these properties can be made to hold if the dissimilarity measure is modified as fol lows: d(A,B): s ize(A- B) + si ,ze(B – A). See Exercise 21 on page 94. r

Example 2.15 (Non-metric Dissimilarities: Time). This example gives

a more everyday example of a dissimilarity measure that is not a metric, but that is still useful. Define a measure of the distance between times of the day as follows:

d,(t1,t2) : {7^ *t,az _ tr) l i l l i I:}

To illustrate, d(lPM, 2PM) : t hour, while d(2PM, 1PM) : 23 hours. Such a definition would make sense, for example, when answering the question:

“ff an event occurs at lPM every day, and it is now 2PM, how Iong do I have to wait for that event to occur again?”

2.4.4 Similarities between Data Objects

For similarities, the triangle inequality (or the analogous property) typically does not hold, but symmetry and positivity typically do. To be explicit, if s(x, y) is the similarity between points x and y, then the typical properties of similarities are the following:

1 . s (x ,y ) : 1 on ly i f x : y . (0 < s S 1)

2. s(x,y) : s(y, x) for all x and y. (Symmetry)

There is no general analog of the triangle inequality for similarity mea- sures. It is sometimes possible, however, to show that a similarity measure can easily be converted to a metric distance. The cosine and Jaccard similarity measures, which are discussed shortly, are two examples. Also, for specific sim- ilarity measures, it is possible to derive mathematical bounds on the similarity between two objects that are similar in spirit to the triangle inequality.

(2.4)

Measures of Similarity and 73

Example 2.L6 (A Non-symmetric Similarity Measure). Consider an experiment in which people are asked to classify a small set of characters as they flash on a screen. The confusion matrix for this experiment records how often each character is classified as itself, and how often each is classified as another character. For instance, suppose that “0” appeared 200 times and was classified as a “0” 160 times, but as an “o” 40 times. Likewise, suppose that ‘o’ appeared 200 times and was classified as an “o” 170 times, but as “0” only 30 times. If we take these counts as a measure of the similarity between two characters, then we have a similarity measure, but one that is not symmetric. In such situations, the similarity measure is often made symmetric by setting s'(x, y) : s'(y, x) : (s(x, y)+ s(y,