376 Chapter 6 Association Analysis

Table 6.10. Example of a contingency table for items p and q.

consider A and B as a pair of bit vectors, A o B : s(A,B) the dot product between the vectors, and lAl : \f(A) the magnitude of vector A. Therefore:

(6 .10)

The 1,S measure can also be expressed as the geometric mean between the confidence of association rules extracted from a pair of binary variables:

IS(4, B) : (6 .11)

Because the geometric mean between any two numbers is always closer to the smaller number, the 1^9 value of an itemset {p, q} is low whenever one of its rules, p —+ Q or e + p, has low confidence.

Limitations of rs Measure The 1,9 value for a pair of independent item- sets, A and B. is

IS1″a”o(,4, B) : s(A, B) s(A) x s(B)

since the value depends on s(A) and s(B), IS shares a similar problem as the confidence measure-that the value of the measure can be quite large, even for uncorrelated and negatively correlated patterns. For example, despite the large 1^9 value between items p and q given in Table 6.10 (0.889), it is still less than the expected value when the items are statistically independent (ISi’auo : 0.9).

IS(A, as : -$ : , l ‘ ” * , : cos i ,ne(A,B) .- \ – – r – ‘ J4A) AB) la l x In l

c(A– B) x c(B – – A) .

s(.4) x s(A) x s(B)

Evaluation of Association Patterns 377

Alternative Objective Interestingness Measures

Besides the measures we have described so far, there are other alternative mea- sures proposed for analyzing relationships between pairs of binary variables. These measures can be divided into two categories, symmetric and asym- metric measures. A measure M is symmetric if M(A –‘- B): M(B —— A). For example, interest factor is a symmetric measure because its value is iden- tical for the rules A ——+ B and B —- A. In contrast, confidence is an asymmetric measure since the confidence for A —. B and B -‘—-+,4. may not be the same. Symmetric measures are generally used for evaluating itemsets, while asymmetric measures are more suitable for analyzing association rules. Tables 6.11 and 6.12 provide the definitions for some of these measures in terms of the frequency counts of a 2 x 2 contingency table.

Consistency among Objective Measures

Given the wide variety of measures available, it is reasonable to question

whether the measures can produce similar ordering results when applied to a set of association patterns. If the measures are consistent, then we can choose any one of them as our evaluation metric. Otherwise, it is important to understand what their differences are in order to determine which measure is more suitable for analyzing certain types of patterns.

Tabfe 6.11. Examples of symmetric objective measures for the itemset {A, B}.

Measure (Symbol) Definition

Correlation (@)

Odds ratio (a)

Kappa (rc)

Interest (1)

Cosine (1S)

Piatetsky-Shapiro (P,9)

Collective strength (S)

Jaccard (o

All-confidence (h)

N”frr – ” f r+f+r t;_-i–

v J r + J + 1 J 0 + J + o / . ? \ l l o a \

\ I t t Ioo) / ( ,1 ‘ to ” ro t /

N”fr r +Nloq lrr.ffl:/ot_lt_q–Nt=1+ f-+r=j;+j+”

/ ^ r r \ l l r r \ \ 1 \ J t l ) / U 1 + J + 1 / / r \ l/ /-.-.a’– r-\

\ I r t ) / \ t / I t+J+r)

. f l t J L + J + 1 M

– –T-

fuf( f r++ f+t

– i”[#,#]

6.7

Tabfe 6.12. Examples of asymmetric objective measures for the rule A ——+ B.

Measure (Symbol) Definition

Goodman-Kruskal (.\)

Mutual Information (M)

J-Measure (J)

Gini index (G)

Laplace (.L)

Conviction (V)

Certainty factor (F)

Added Value (AV)

(Di *** f 1x – rnarxf +*)/(N – max;, fup)

(D,; D, # to* #tl t t- D, + t”g #) #t”g#+ffrog# # * (#)’+ (#)’l – (+)’

+#’t(#)’+(#)’ l -(#)’ ( / ” + t ) 1 f fp +z) (fr+f +o) l(w f,o)

(# – #)t0- +) J 7 1 _ J + 1

378 Chapter 6 Association Analysis

Table 6.13. Example of contingency tables.

Example . t 17 . t 10 J 0 1 foo E1 E2 Es Ea E5 E6 R.

Es Es Erc

8123 8330 3954 2886 1500 4000 9481 4000 7450 61

83 2

3080 1363 2000 2000 298

2000 2483 2483

424 622 5

1320 500 1000 \27

2000 4 /l

1370 r046 296r 4431 6000 3000 94

2000 63

7452

Suppose the symmetric and asymmetric measures are applied to rank the ten contingency tables shown in Table 6.13. These contingency tables are cho- sen to illustrate the differences among the existing measures. The ordering produced by these measures are shown in Tables 6.14 and 6.15, respectively (with 1 as the most interesting and 10 as the least interesting table). Although some of the measures appear to be consistent with each other, there are certain measures that produce quite different ordering results. For example, the rank- ings given by the fcoefficient agree with those provided by rc and collective strength, but are somewhat different than the rankings produced by interest

o a K I IS PS ^9 C h

n 1 E2 E3 Ea E5 E6 E7 E8 Es Erc

1 2 3 i

5 6 ,7

8 9 10

3 1 2 8 n I

9 6 . 10 4 5

I 2 A

3 0

5 ,7

8 I 10

tt 7 ^

3 2 5 I 8 10 1

2 3 5 F7 I

q

o 1 8 4

10

2 r

1 3 6 4 8 F7

I 10

1 2 3 4 6 E J

7 8 q

10

2 3 o

7 q

l)

1 8 4 10

2 3 8 5 9 7

1 1 I

4 10

6.7 Evaluation of Association Patterns 379

Table 6.14. Rankings of contingency tables using the symmetric measures given in Table 6.11.

Table 6.15. Rankings of contingency tables using the asymmetric measures given in Table 6.12.

factor and odds ratio. Furthermore, a contingency table such as -E1s is ranked lowest according to the @-coefficient, but highest according to interest factor.

Properties of Objective Measures

The results shown in Table 6.14 suggest that a significant number of the mea-

sures provide conflicting information about the quality of a pattern. To under- stand their differences, we need to examine the properties of these measures.

Inversion Property Consider the bit vectors shown in Figure 6.28. The

0/1 bit in each column vector indicates whether a transaction (row) contains a particular item (column). For example, the vector A indicates that item a

) M J G L V F AV E1 E2 E3 Ea E5 E6 E7 Eg Es Eto

I 2 tr

4 q

3 7 8 6 10

1r

2 3 o F7

8 5 9 4 10

1 2 r

3 4 6 9 7

10 8

1 .) 2 4 6 r

8 ,7

I 10

4 tr J

2 q

8 F7

3 10 ‘|

h

2 1 6 3 D

4 .7

8 I 10

2 I o 3 tr d

n

7 8 q

10

r J

6 4 I 2 3 q

7 10 8

380 Chapter 6 Association

Figure 6,28. Effect of the inversion operation. The vectors C and E are inversions of vector A, while the vector D is an inversion of vectors B and F.

belongs to the first and last transactions, whereas the vector B indicates that item b is contained only in the fifth transaction. The vectors C and E are in fact related to the vector A-their bits have been inverted from 0’s (absence) to l’s (presence), and vice versa. Similarly, D is related to vectors B and F by inverting their bits. The process of flipping a bit vector is called inversion. If a measure is invariant under the inversion operation, then its value for the vector pair (C, D) should be identical to its value for (A, B). The inversion property of a measure can be tested as follows.

Definition 6.6 (Inversion Property). An objective measure M is invariant under the inversion operation if its value remains the same when exchanging the frequency counts fi1 with /66 and fis with /61.