This section provides specific examples of some similarity and dissimilarity measures.

Similarity Measures for Binary Data

Similarity measures between objects that contain only binary attributes are called similarity coefficients, and typically have values between 0 and 1. A value of 1 indicates that the two objects are completely similar, while a value of 0 indicates that the objects are not at all similar. There are many rationales for why one coefificient is better than another in specific instances.

Let x and y be two objects that consist of n binary attributes. The com- parison of two such objects, i.e., two binary vectors, Ieads to the following four quantities (frequencies) :

.foo : the number of attributes where x is 0 and y is 0

.for : the number of attributes where x is 0 and y is 1

,fio : the number of attributes where x is 1 and y is 0

“fir : the number of attributes where x is 1 and y is 1

Simple Matching Coefficient One commonly used similarity coefficient is the simple matching coefficient (SMC), which is defined as

number of matching attribute values ft * fss

2.4

Examples of Proximity Measures

S M C : number of attributes for -l fn * ,frr * ,foo’

(2.5)

74 Chapter 2 Data

This measure counts both presences and absences equally. Consequently, the SMC could be used to find students who had answered questions similarly on a test that consisted only of true/false questions.

Jaccard Coefficient Suppose that x and y are data objects that represent two rows (two transactions) of a transaction matrix (see Section 2.1.2). If each asymmetric binary attribute corresponds to an item in a store, then a 1 indi- cates that the item was purchased, while a 0 indicates that the product was not purchased. Sincb the number of products not purchased by any customer far outnumbers the number of products that were purchased, a similarity measure such as SMC would say that all transactions are very similar. As a result, the Jaccard coefficient is frequently used to handle objects consisting of asymmet- ric binary attributes. The Jaccard coefficient, which is often symbolized by ,./,’is given by the following equation:

J : number of matching presences

(2.6) number of attributes not involved in 00 matches fot * fn -t ft

Example 2.17 (The SMC and Jaccard Similarity Coefficients). To illustrate the difference between these two similarity measures, we calculate SMC and -I for the following two binary vectors.

x : (1, 0, 0, 0, 0, 0, 0, 0, 0, 0) y : ( 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 1 )

for :2 the number of attributes where x was 0 and y was 1

frc: I the number of attributes where x was 1 and y was 0

foo :7 the number of attributes where x was 0 and y was 0

/rr : 0 the number of attributes where x was 1 and y was 1

S M C : – ; – , = – – ; t ‘ – u . l

J : -T – ,+ – – – – ; – : – -4 – :g J O I T J I O T J I I Z T I T U

Cosine Similarity

Documents are often represented as vectors, where each attribute represents the frequency with which a particular term (word) occurs in the document. It is more complicated than this, of course, since certain common words are ig-

J 7 I

Measures of Similaritv and Dissimilaritv 75

nored and various processing techniques are used to account for different forms of the same word, differing document lengths, and different word frequencies.

Even though documents have thousands or tens of thousands of attributes (terms), each document is sparse since it has relatively few non-zero attributes. (The normalizations used for documents do not create a non-zero entry where there was azero entry; i.e., they preserve sparsity.) Thus, as with transaction data, similarity should not depend on the number of shared 0 values since any two documents are likely to “not contain” many of the same words, and therefore, if 0-0 matches are counted, most documents will be highly similar to most other documents. Therefore, a similarity measure for documents needs to ignores 0-0 matches like the Jaccard measure, but also must be able to handle non-binary vectors. The cosine similarity, defined next, is one of the most common measure of document similarity. If x and y are two document vectors, then

cos(x,y) : ffi,

2.4

( , 7 \

where . indicates the vector dot product, x .y : D[:trplp, and llxll is the

length of vector x, ll*ll : 1f D|:rr2r: 1/x4.

Example 2.18 (Cosine Similarity of Two Document Vectors). This example calculates the cosine similarity for the following two data objects, which might represent document vectors:

* : ( 3 , 2 , 0 , 5 , 0 , 0 , 0 , 2 , 0 , 0 ) y : ( 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 2 )

x . y : 3 i . 1 * 2 x 0 * 0 * 0 * 5 x 0 * 0 x 0 f 0 * 0 * 0 * 0 * 2 * l * 0 x 0 * 0 x 2 : 5

l l x l l : roxo *oxo *2×2*oxo *oxo :6 .48 l l y l l : :2 .24 cos(x, y) : 0.31

I

As indicated by Figure 2.16, cosine similarity really is a measure of the (cosine of the) angle between x and y. Thus, if the cosine similarity is 1, the angle between x and y is 0o, and x and y are the same except for magnitude (length). If the cosine similarity is 0, then the angle between x and y is 90o, and they do not share any terms (words).

76 Chapter 2 Data

Figule 2.16. Geometric illustration of the cosine measure.

Equation 2.7 can be written as Equation 2.8.

cos(x ,y ) : + . , ,Y , , : x ‘ . y ‘ , l l x l l l l v l l

(2.8)

where x’ : x/llxll and y/ : V lllyll. Dividing x and y by their lengths normal- izes them to have a length of 1. This means that cosine similarity does not take the magnitude ofthe two data objects into account when computing similarity. (Euclidean distance might be a better choice when magnitude is important.) For vectors with a length of 1, the cosine measure can be calculated by taking a simple dot product. Consequently, when many cosine similarities between objects are being computed, normalizing the objects to have unit length can reduce the time required.

Extended Jaccard Coefficient (Tanimoto Coefficient)

The extended Jaccard coefficient can be used for document data and that re- duces to the Jaccard coefficient in the case of binary attributes. The extended Jaccard coefficient is also known as the Tanimoto coefficient. (However, there is another coefficient that is also known as the Tanimoto coefficient.) This co- efficient, which we shall represent as E J , is defined by the following equation:

EJ(x , y ) : x . y (2.e)

l l ” l l ‘+ l lv l l ‘ – * .v

Correlation

The correlation between two data objects that have binary or continuous vari- ables is a measure of the linear relationship between the attributes of the objects. (The calculation of correlation between attributes, which is more common, can be defined similarly.) More precisely, Pearson’s correlation

Measures of Similaritv and Dissimilaritv 77

coefficient between two data objects, x and y, is defined by the following equation:

corr(x, y) : covariance(x, y) StA (2 .10)

standard-deviation(x) xstandard-deviation(y) tr ta’

where we are using the following standard statistical notation and definitions:

2.4

1 n

covariance(x,y) : s,s: :+ I(“u -z)(yx -9) n – r – .

E : L

(2 .11)

standard-deviation(x) : su :

standard-deviation(y) : su:

1 f l

i : I ) – r * i s t h e m e a n o f x n 4

k :1

1 f l

g : 1 f y * i s t h e m e a n o f y h : 1

Example 2.19 (Perfect Correlation). Correlation is always in the range -1 to 1. A correlation of 1 (-1) means that x and y have a perfect positive (negative) linear relationship; that is, 16 – aAx -f b, where a, and b are con- stants. The following two sets of values for x and y indicate cases where the correlation is -1 and *1, respectively. In the first case, the means of x and y were chosen to be 0, for simplicity.

x : ( -3 , 6 , 0 , 3 , -6 )

y : ( 1 , – 2 , 0 , – 7 , 2 )

x : ( 3 , 6 , 0 , 3 , 6 ) y : ( 1 , 2 , 0 , L , 2 )

I

fil@n-n)’

; \ l@*-vt ‘

78 Chapter 2 Data

-1.00 -.0.90 4.70 –0.60

0.300.200.10

1.000.80

-o.10

Figure 2.17. Scatter plots illustrating correlations from -1 to 1.

Example 2.20 (Non-linear Relationships). If the correlation is 0, then there is no linear relationship between the attributes of the two data objects. However, non-linear relationships may still exist. In the following example, n*: A7, but their correlation is 0.

* : ( – 3 , – 2 , – 1 , 0 , I , 2 , 3 ) Y : (9 , 4 ,1 ,0 ,1 ,4 ,9 )

I

Example 2.21 (Visualizing Correlation). It is also easy to judge the cor- relation between two data objects x and y by plotting pairs of corresponding attribute values. Figure 2.17 shows a number of these plots when x and y have 30 attributes and the values of these attributes are randomly generated (with a normal distribution) so that the correlation of x and y ranges from -1