an objective measure is a data-driven approach for evaluating the quality

of association patterns. It is domain-independent and requires minimal in- put from the users, other than to specify a threshold for filtering low-quality patterns. An objective measure is usually computed based on the frequency

6.7

372 Chapter 6 Association Analysis

Tabfe 6.7. A 2-way contingency table for variables A and B.

B B

A

A

J t l

J 0 1

J L O

T ./ 00

fr+

fo+

J + O t/

counts tabulated in a contingency table. Table 6.7 shows an example of a contingency table for a pair of binary variables, ,4 and B. We use the notation A (B) to indicate that ,4 (B) is absent from a transaction. Each entry fii in this 2 x 2 table denotes a frequency count. For example, fi1 is the number of times A and B appear together in the same transaction, while /e1 is the num- ber of transactions that contain B but not -4. The row sum fi-. represents the support count for A, while the column sum /a1 represents the support count for B. Finally, even though our discussion focuses mainly on asymmet- ric binary variables, note that contingency tables are also applicable to other attribute types such as symmetric binary, nominal, and ordinal variables.

Limitations of the support-confidence Flamework Existing associa- tion rule mining formulation relies on the support and confidence measures,to eliminate uninteresting patterns. The drawback of support was previously de- scribed in Section 6.8, in which many potentially interesting patterns involving low support items might be eliminated by the support threshold. The dra6- back of confidence is more subtle and is best demonstrated with the following example.

Example 6.3. Suppose we are interested in analyzing the relationship be- tween people who drink tea and coffee. We may gather information about the beverage preferences among a group of people and summarize their responses into a table such as the one shown in Table 6.8.

Table 6.8. Beverage preferences among a group of 1000 people.

Cof f ee Cof f ee

Tea

Tea

150

650

50

150

200

800

800 200 1000

Evaluation of Association Patterns 373

The information given in this table can be used to evaluate the association rule {?ea,} ——, {Cof f ee}. At fi.rst glance, it may appear that people who drink tea also tend to drink coffee because the rule’s support (15%) and confidence (75%) values are reasonably high. This argument would have been acceptable except that the fraction of people who drink coffee, regardless of whether they drink tea, is 80%, while the fraction of tea drinkers who drink coffee is only 75%. Thus knowing that a person is a tea drinker actually decreases her probability of being a coffee drinker from 80% to 75Tol The rule {Tea) -,

{Cof f ee} is therefore misleading despite its high confidence value. r

The pitfall of confidence can be traced to the fact that the measure ignores the support of the itemset in the rule consequent. Indeed, if the support of coffee drinkers is taken into account, we would not be surprised to find that many of the people who drink tea also drink coffee. What is more surprising is that the fraction of tea drinkers who drink coffee is actually less than the overall fraction of people who drink coffee, which points to an inverse relationship between tea drinkers and coffee drinkers.

Because of rbhe limitations in the support-confidence framework, various objective measures have been used to evaluate the quality of association pat-

terns. Below, we provide a brief description of these measures and explain some of their strengths and limitations.

Interest Factor The tea-coffee example shows that high-confidence rules can sometimes be misleading because the confidence measure ignores the sup- port of the itemset appearing in the rule consequent. One way to address this problem is by applying a metric known as lift:

6.7

which computes the ratio between the rule’s confidence and the support of the itemset in the rule consequent. For binary variables, Iift is equivalent to another objective measure called interest factor, which is defined as follows:

I(4, B) : s(A, B) _ N”frr

f+f +t (6.5)

s(,4) x s(B)

Interest factor compares the frequency of a pattern against a baseline fre- quency computed under the statistical independence assumption. The baseline frequency for a pair of mutually independent variables is

fn f+ f+t ir ” h+f+t f

:1fr t?, or equivalently, fn:t: ; : :

(6.4)

(6.6)

p p

q

q

880

50

bt,

20

930

70

930 70 1000

Chapter Association Analysis

Table 6.9. Contingency tables for the word pairs ({p,q} and {r,s}.

This equation follows from the standard approach of using simple fractions as estimates for probabilities. The fraction fnlN is an estimate for the joint probability P(A,B), while fia/,n/ and fyf N are the estimates for P(A) and P(B), respectively. lt A and B are statistically independent, then P(A,B): P(A) x P(B), thus leading to the formula shown in Equation 6.6. Using Equations 6.5 and 6.6, we can interpret the measure as follows:

I(A, B) 1, if ,4 and B arc independent; 1, if A and B are positively correlated; l, if A and B are negatively correlated.

(6 .7)

For the tea-coffee example shown in Table 6.8, 1: O.H3_8- :0.9375, thus sug- gesting a slight negative correlation between tea drinkers and coffee drinkers.

Limitations of Interest Factor We illustrate the limitation of interest factor with an example from the text mining domain. In the text domain, it is reasonable to assume that the association between a pair of words depends on the number of documents that contain both words. For example, because of their stronger association, we expect the words data and mining to appear together more frequently than the words compiler and mining in a collection of computer science articles.

Table 6.9 shows the frequency of occurrences between two pairs of words,

{p,q} and {“,”}. Using the formula given in Equation 6.5, the interest factor for {p,q} is 1.02 and for {r, s} is 4.08. These results are somewhat troubling for the following reasons. Although p and q appear together in 88% of the documents, their interest factor is close to 1, which is the value when p and q are statistically independent. On the other hand, the interest factor for {r, s} is higher than {p, q} even though r and s seldom appear together in the same document. Confidence is perhaps the better choice in this situation because it considers the association between p and q (9a.6%) to be much stronger than that between r and s (28.6%).

{ .

r r

s

5

20

CU

50

880

70

930

70 930 r000

Evaluation of Association Patterns 375

Correlation Analysis Correlation analysis is a statistical-based technique for analyzing relationships between a pair of variables. For continuous vari- ables, correl-ation is defined using Pearson’s correlation coefficient (see Equa- tion 2.10 on page 77). For binary variables, correlation can be measured using the d-coefficient. which is defined as

6.7

(6.8)

The value of correlation ranges from -1 (perfect negative correlation) to *1 (perfect positive correlation). If the variables are statistically independent, then @ : 0. For example, the correlation between the tea and coffee drinkers given in Table 6.8 is -0.0625.

Limitations of Correlation Analysis The drawback of using correlation can be seen from the word association example given in Table 6.9. Although

the words p and g appear together more often than r and s, their /-coefficients are identical, i.e., Q(p,q): Q(r,s) :0.232. This is because the @-coefficient gives equal importance to both co-presence and co-absence of items in a trans-

action. It is therefore more suitable for analyzing symmetric binary variables.

Another limitation of this measure is that it does not remain invariant when

there are proportional changes to the sample size. This issue will be discussed

in greater detail when we describe the properties of objective measures on page

377.

IS Measure .I^9 is an alternative measure that has been proposed for han-

dling asymmetric binary variables. The measure is defined as follows:

rs(A, B) : (6.e)

Note that .LS is large when the interest factor and support of the pattern

are large. For example, the value of 1^9 for the word pairs {p, q} and {r, s}

shown in Table 6.9 are 0.946 and 0.286, respectively. Contrary to the results given by interest factor and the @-coefficient, the 15 measure suggests that

the association between {p, q} i. stronger than {r, s}, which agrees with what

we expect from word associations in documents. It is possible to show that 15 is mathematically equivalent to the cosine