Among the measures that remain invariant under this operation include the @-coefficient, odds ratio, n, and collective strength. These measures may not be suitable for analyzing asymmetric binary data. For example, the /- coefficient between C and D is identical to the @-coefficient between A and B, even though items c and d appear together more frequently than a and b. Fbrthermore, the d-coefficient between C and D is less than that between E and F even though items e and / appear together only once! We had previously raised this issue when discussing the limitations of the fcoefficient on page 375. For asymmetric binary data, measures that do not remain invariant under the inversion operation are preferred. Some of the non-invariant measures include interest factor, IS, PS, and the Jaccard coefficient.
EF
(c)
Analysis
C D
(b)
B
)
A
(
6,7 Evaluation of Association Patterns 381
NuIl Addition Property Suppose we are interested in analyzing the re-
lationship between a pair of words, such as data and rnining, in a set of
documents. If a collection of articles about ice fishing is added to the data set,
should the association between data and mining be affected? This process of
adding unrelated data (in this case, documents) to a given data set is known
as the null addition operation.
Deffnition 6.7 (Null Addition Property). An objective measure M is
invariant under the null addition operation if it is not affected by increasing
/es, while all other frequencies in the contingency table stay the same.
For applications such as document analysis or market basket analysis, the
measure is expected to remain invariant under the null addition operation.
Otherwise, the relationship between words may disappear simply by adding
enough documents that do not contain both words! Examples of measures
that satisfy this property include cosine (19) and Jaccard ({) measures, while
those that viqlate this property include interest factor, PS, odds ratio, and
the fcoefficient.
Scaling Property Table 6.16 shows the contingency tables for gender and
the grades achieved by students enrolled in a particular course in 1993 and
2004. The data in these tables showed that the number of male students has
doubled since 1993, while the number of female students has increased by a
factor of 3. However, the male students in 2004 are not performing any better
than those in 1993 because the ratio of male students who achieve a high
grade to those who achieve a low grade is still the same, i.e., 3:4. Similarly,
the female students in 2004 are performing no better than those in 1993. The
association between grade and gender is expected to remain unchanged despite
changes in the sampling distribution.
Table 6.16. The grade-gender example.
High Low
High Low
Male Female 60 60 720 80 30 t1 740 90 230
(a) Sample data from 1993. (b) Sample data from 2004.
382 Chapter 6 Association Anal.ysis
Table 6.17. Properties of symmetric measures.
Svmbol Measure Inversion Null Addition Scaling
0 a
K
r IS PS ,9 e h
@-coefficient odds ratio Cohen’s Interest Cosine Piatetsky-Shapiro’s Collective strength Jaccard All-confidence Support
Yes Yes Yes No No Yes Yes No No No
oN No No No
No Yes
Yes No
No No
L\O
Yes No
No No
No No No No No
Definition 6.8 (Scaling Invariance Property). An objective measure M is invariant under the row/column scaling operation if Mg) : M(T’), where 7 is a contingency table with frequency counts lfn; frc; ,for; ,foo], Tt is a contingency table with scaled frequency counts [k*sfn; kzksfn; kft+fof kzk+foo), and k1, kz, ks, k4 are positive constants.
From Table 6.17, notice that only the odds ratio (a) is invariant under the row and column scaling operations. All other measures such as the f coefficient, n, IS, interest factor, and collective strength (,9) change their val- ues when the rows and columns of the contingency table are rescaled. Although we do not discuss the properties of asymmetric measures (such as confidence, J-measure, Gini index, and conviction), it is clear that such measures do not preserve their values under inversion and row/column scaling operations, but are invariant under the null addition oneration.
6.7.2 Measures beyond Pairs of Binary Variables
The measures shown in Tables 6.11 and 6.72 are defined for pairs of binary vari- ables (e.g.,2-itemsets or association rules) . However, many of them, such as support and all-confidence, are also applicable to larger-sized itemsets. Other measures, such as interest factor, IS, PS, and Jaccard coefficient, can be ex- tended to more than two variables using the frequency tables tabulated in a multidimensional contingency table. An example of a three-dimensional con- tingency table for a, b, and c is shown in Table 6.18. Each entry fiip in this table represents the number of transactions that contain a particular combi- nation of items a, b, and c. For example, frct is the number of transactions that contain a and c, but not b. On the other hand, a marginal frequency
6.7 Evaluation of Association Patterns 383
Table 6.18. Example of a three-dimensional contingency table.
such as ,fi+r is the number of transactions that contain a and c, irrespective of whether b is present in the transaction.
Given a k-itemset {h,iz, . . . ,in}, the condition for statistical independence can be stated as follows:
t . J L \ t 2 . . . I t c
– fo . r+. . .+x f+b. . .+ x ‘ . . x f++. . . to
AIk-1 (6 .12)
With this definition, we can extend objective measures such as interest factor
and P,S, which are based on deviations from statistical independence’ to more
than two variables:
T _ Iy’ft-1 x ftrb…tr
f , . r+ . . .+x f+b . . .+ x . . . x f++ . . t u
ps : I+–
Another approach is to define the objective measure as the maximum, min-
imum, or average value for the associations between pairs of items in a pat-
tern. For example, given a k- i temset X: {h, i2, . . . , ip},we may def ine the
/-coefficient for X as the average @-coefficient between every pair of items (io,i) in X. However, because the measure considers only pairwise associa-
tions, it may not capture all the underlying relationships within a pattern.
Analysis of multidimensional contingency tables is more complicated be-
cause of the presence of partial associations in the data. For example, some
associations may appear or disappear when conditioned upon the value of cer-
tain variables. This problem is known as Simpson’s paradox and is described
in the next section. More sophisticated statistical techniques are available to
analyze such relationships, e.g., loglinear models, but these techniques are
beyond the scope of this book.
lvrlt