Among the measures that remain invariant under this operation include the @-coefficient, odds ratio, n, and collective strength. These measures may not be suitable for analyzing asymmetric binary data. For example, the /- coefficient between C and D is identical to the @-coefficient between A and B, even though items c and d appear together more frequently than a and b. Fbrthermore, the d-coefficient between C and D is less than that between E and F even though items e and / appear together only once! We had previously raised this issue when discussing the limitations of the fcoefficient on page 375. For asymmetric binary data, measures that do not remain invariant under the inversion operation are preferred. Some of the non-invariant measures include interest factor, IS, PS, and the Jaccard coefficient.

EF

(c)

Analysis

C D

(b)

B

)

A

(

6,7 Evaluation of Association Patterns 381

NuIl Addition Property Suppose we are interested in analyzing the re-

lationship between a pair of words, such as data and rnining, in a set of

documents. If a collection of articles about ice fishing is added to the data set,

should the association between data and mining be affected? This process of

adding unrelated data (in this case, documents) to a given data set is known

as the null addition operation.

Deffnition 6.7 (Null Addition Property). An objective measure M is

invariant under the null addition operation if it is not affected by increasing

/es, while all other frequencies in the contingency table stay the same.

For applications such as document analysis or market basket analysis, the

measure is expected to remain invariant under the null addition operation.

Otherwise, the relationship between words may disappear simply by adding

enough documents that do not contain both words! Examples of measures

that satisfy this property include cosine (19) and Jaccard ({) measures, while

those that viqlate this property include interest factor, PS, odds ratio, and

the fcoefficient.

Scaling Property Table 6.16 shows the contingency tables for gender and

the grades achieved by students enrolled in a particular course in 1993 and

2004. The data in these tables showed that the number of male students has

doubled since 1993, while the number of female students has increased by a

factor of 3. However, the male students in 2004 are not performing any better

than those in 1993 because the ratio of male students who achieve a high

grade to those who achieve a low grade is still the same, i.e., 3:4. Similarly,

the female students in 2004 are performing no better than those in 1993. The

association between grade and gender is expected to remain unchanged despite

changes in the sampling distribution.

Table 6.16. The grade-gender example.

High Low

High Low

Male Female 60 60 720 80 30 t1 740 90 230

(a) Sample data from 1993. (b) Sample data from 2004.

382 Chapter 6 Association Anal.ysis

Table 6.17. Properties of symmetric measures.

Svmbol Measure Inversion Null Addition Scaling

0 a

K

r IS PS ,9 e h

@-coefficient odds ratio Cohen’s Interest Cosine Piatetsky-Shapiro’s Collective strength Jaccard All-confidence Support

Yes Yes Yes No No Yes Yes No No No

oN No No No

No Yes

Yes No

No No

L\O

Yes No

No No

No No No No No

Definition 6.8 (Scaling Invariance Property). An objective measure M is invariant under the row/column scaling operation if Mg) : M(T’), where 7 is a contingency table with frequency counts lfn; frc; ,for; ,foo], Tt is a contingency table with scaled frequency counts [k*sfn; kzksfn; kft+fof kzk+foo), and k1, kz, ks, k4 are positive constants.

From Table 6.17, notice that only the odds ratio (a) is invariant under the row and column scaling operations. All other measures such as the f coefficient, n, IS, interest factor, and collective strength (,9) change their val- ues when the rows and columns of the contingency table are rescaled. Although we do not discuss the properties of asymmetric measures (such as confidence, J-measure, Gini index, and conviction), it is clear that such measures do not preserve their values under inversion and row/column scaling operations, but are invariant under the null addition oneration.

6.7.2 Measures beyond Pairs of Binary Variables

The measures shown in Tables 6.11 and 6.72 are defined for pairs of binary vari- ables (e.g.,2-itemsets or association rules) . However, many of them, such as support and all-confidence, are also applicable to larger-sized itemsets. Other measures, such as interest factor, IS, PS, and Jaccard coefficient, can be ex- tended to more than two variables using the frequency tables tabulated in a multidimensional contingency table. An example of a three-dimensional con- tingency table for a, b, and c is shown in Table 6.18. Each entry fiip in this table represents the number of transactions that contain a particular combi- nation of items a, b, and c. For example, frct is the number of transactions that contain a and c, but not b. On the other hand, a marginal frequency

6.7 Evaluation of Association Patterns 383

Table 6.18. Example of a three-dimensional contingency table.

such as ,fi+r is the number of transactions that contain a and c, irrespective of whether b is present in the transaction.

Given a k-itemset {h,iz, . . . ,in}, the condition for statistical independence can be stated as follows:

t . J L \ t 2 . . . I t c

– fo . r+. . .+x f+b. . .+ x ‘ . . x f++. . . to

AIk-1 (6 .12)

With this definition, we can extend objective measures such as interest factor

and P,S, which are based on deviations from statistical independence’ to more

than two variables:

T _ Iy’ft-1 x ftrb…tr

f , . r+ . . .+x f+b . . .+ x . . . x f++ . . t u

ps : I+–

Another approach is to define the objective measure as the maximum, min-

imum, or average value for the associations between pairs of items in a pat-

tern. For example, given a k- i temset X: {h, i2, . . . , ip},we may def ine the

/-coefficient for X as the average @-coefficient between every pair of items (io,i) in X. However, because the measure considers only pairwise associa-

tions, it may not capture all the underlying relationships within a pattern.

Analysis of multidimensional contingency tables is more complicated be-

cause of the presence of partial associations in the data. For example, some

associations may appear or disappear when conditioned upon the value of cer-

tain variables. This problem is known as Simpson’s paradox and is described

in the next section. More sophisticated statistical techniques are available to

analyze such relationships, e.g., loglinear models, but these techniques are

beyond the scope of this book.

lvrlt