Rules: {b} * {“}, {r} – {d,}, {b} – {d}, {“} – {“}, {c}—— ia}.
(b) Use the contingency tables in part (a) to compute and rank the rules in decreasing order according to the following measures.
i. Support.
ii. Confidence.
iii. Interest(X —-+Y): t##PV).
iv. IS(X —— Y) —
v. Klosgen( X ——+ Y) : \IF@fi x (P (Y lX) – P (Y)), where P(Y lX) :
P(X \
vi. Odds ratio(X—— Y) :1+++EF).t – e6.Y1e1X.v1′
Given the rankings you had obtained in Exercise 12, compute the correlation between the rankings of confidence and the other five measures. Which measure is most highly correlated with confidence? Which measure is least correlated with confidence?
Answer the following questions using the data sets shown in Figure 6.34. Note that each data set contains 1000 items and 10,000 transactions. Dark cells indicate the presence of items and white cells indicate the absence of items. We will apply l}ae Apriori, algorithm to extract frequent itemsets with m’insup :
I0% (i.e., itemsets must be contained in at least 1000 transactions)?
(a) Which data set(s) will produce the most number of frequent itemsets?
15. (u)
(b)
(“)
6.L0 Exercises 4LL
(b) Which data set(s) will produce the fewest number of frequent itemsets?
(c) Which data set(s) will produce the longest frequent itemset?
(d) Which data set(s) will produce frequent itemsets with highest maximum support?
(e) Which data set(s) will produce frequent itemsets containing items with wide-va,rying support levels (i.e., items with mixed support, ranging from less than 20% to more than 70%).
Prove that the / coefficient is equal to 1 if and only if fn : fr+ : f +r.
Show that if A and B are independent, then P(A, B)xP(A, B) : P(A, B)x P(A, B).
Show that Yule’s Q and Y coefficients
n f/rrloo – /ro/otl
\4 : L/,r/00T /./rl
r , l r f i t r f * – t f i t ; I ‘ 1 t’ff ‘J* + ‘If.of”)
are normalized versions of the odds ratio.
(d) Write a simplified expression for the value of each measure shown in Tables 6.11 and 6.12 when the variables are statistically independent.
16. Consider the interestingness mea,sure, m : !-(W#4, for an association
rule ,4, —–+ B.
(a) What is the range of this measure? When does the measure attain its maximum and minimum values?
(b) How does M behave when P(,4, B) is increased while P(,a) and P(B) remain unchanged?
(c) How does M behave when P(A) is increased while P(,4, B) and P(B) remain unchanged?
(d) How does M behave when P(B) is increased while P(A, B) and P(A) remain unchanged?
(e) Is the measure symmetric under variable permutation?
(f) What is the value of the measure when A and B are statistically indepen- dent?
(g) Is the measure null-invariant?
(h) Does the measure remain inva.riant under row or column scaling opera- tions?
(i) How does the measure behave under the inversion operation?
412 Chapter 6 Association Analysis
2000
4000
6000
8000
o
()(U o $u F
o
U’
tr
o o o(U at, (U F
200 400 600 800 (b)
Items
2000
4000
6000
8000
2000
4000
6000
8000
2000
4000
6000
8000
o
oqt
C’ F
I
l r I –
I . I
2000
4000
6000
8000
200 400 600 800 (c)
Items
200 400 600 800 (e)
2000
4000
6000
8000
200 400 600 800 (d)
Items
10% are 1s 90% are 0s
(uniformly distributed)
200 400 600 800 (f)
o
o(U at, (I’
F
Items
l.
I
r r
I- t
T
I
I t
I
l –
I
I
– I I
I
r r
r l –
I I
Figute 6.34. Figures for Exercise 14.
t7.
6.10 Exercises 4Lg
Suppose we have market basket data consisting of 100 transactions and 20 items. If the support for item ais25To, the support for item bis90% and the support for itemset {a, b} is 20%. Let the support and confidence thresholds be L0% and 60%, respectively.
(a) Compute the confidence of the association rule {o} * {b}. Is the rule interesting according to the confidence measure?
(b) Compute the interest measure for the association pattern {4, b}. Describe the nature of the relationship between item a and item b in terms of the interest measure.
(c) What conclusions can you draw from the results of parts (a) and (b)?
(d) Prove that if the confidence of the rule {o} * {b} is less than the support of {b}, then:
i. c({a}——‘ {bi) > c({di—— {bi), ii. c({a} —‘ {b}) > .s({b}),
where c(.) denote the rule confidence and s(‘) denote the support of an itemset.
Table 6.26 shows a 2 x 2 x 2 contingency table for the binary variables A and B at different values of the control variable C.
Table 6.26. A Contingency Table.
A
1 0
C = 0 B 1 0 1 5
0 1 5 30
C = 1 B 1 5 0
0 0 1 5
(a) Compute the @ coefficient for A and B when C : O, C : I, and C : 0 or 1 . Note tha t6( {A .8} ) : Wr \ ( t ) /
\ / P ( A ) P ( B ) ( 1 – P ( A ) ) ( I – P ( B ) )
(b) What conclusions can you draw from the above result?
19. Consider the contingency tables shown in Table 6.27.
(a) For table I, compute support, the interest measure, and the / correla- tion coeffi.cient for the association pattern {A, B}. Also, compute the
confidence of rules A — B and B — A.
18.
4L4 Chapter 6 Association Analysis
Table 6.27. Contingency tables for Exercise 19,
(b) Table II.
(b) For table II, compute support, the interest measure, and the @ correla- tion coefficient for the association pattern {A, B}. Also, compute the confidence of rules A —, B and, B – A.
(c) What conclusions can you draw from the results of (a) and (b)?
20. Consider the relationship between customers who buy high-definition televisions and exercise machines as shown in Tables 6.19 and 6.20.
(a) Compute the odds ratios for both tables.
(b) Compute the fcoefficient for both tables.
(c) Compute the interest factor for both tables.
For each of the measures given above, describe how the direction of association changes when data is pooled together instead of being stratified.
A
A
A
A
Association Analysis: Advanced Concepts
The association rule mining formulation described in the previous chapter assumes that the input data consists of binary attributes called items. The presence of an item in a transaction is also assumed to be more important than its absence. As a result, an item is treated as an asymmetric binary attribute and only frequent patterns are considered interesting.
This chapter extends the formulation to data sets with symmetric binary, categorical, and continuous attributes. The formulation will also be extended to incorporate more complex entities such as sequences and graphs. Although the overall structure of association analysis algorithms remains unchanged, cer- tain aspects of the algorithms must be modified to handle the non-traditional entities.
7.1 Handling Categorical Attributes
There are many applications that contain symmetric binary and nominal at- tributes. The Internet survey data shown in Table 7.1 contains symmetric binary attributes such as Gender, Computer at Home, Chat Online, Shop On1ine, and Privacy Concerns; as well as nominal attributes such as Leve1 of Education and State. Using association analysis, we may uncover inter- esting information about the characteristics of Internet users such as:
{Suop On1ine = Yes} ——+ {Privacy Concerns = Yes}.
This rule suggests that most Internet users who shop online are concerned about their personal privacy.
4LG Chapter 7 Association Analysis: Advanced Concepts
Table 7.1. Internet survey data with categorical attributes.
Gender l o tLeve Education
State Computer at l{ome
atUh Online
Shop Online
Privacy Concerns
Female Male Male
Female Female Male Male Male
Female
Graduate College
Graduate College
Graduate College College
High School Graduate
Illinois California Michigan Virginia
California Minnesota
Alaska Oregon Texas
Yes No Yes No Yes Yes Yes Yes *:
Yes No Yes No No Yes Yes No
l::
Yes No Yes Yes No Yes Yes No No
Yes No Yes Yes Yes Yes No No
):
To extract such patterns, the categorical and symmetric binary attributes are transformed into “items” first, so that existing association rule mining algorithms can be applied. This type of transformation can be performed by creating a new item for each distinct attribute-value pair. For example, the nominal attribute Level of Education can be replaced by three binary items: Education = Co1lege, Education = Graduate, and Education = High School. Similarly, symmetric binary attributes such as Gender can be con- verted into a pair of binary items, MaIe and Fenale. Table 7.2 shows the result of binarizing the Internet survey data.
Table7.2. Internet survey data after binarizing categorical and symmetric binary attributes.
Male Female Education : Graduate
Education : College
Privacy : Yes
Privacy : N O
0 I I
0 0 I 1 1 0
t
0 0 I 1 0 0 0
i
1 0 I 0 L 0 0 0 1
0 1 0 1 I
0 I I 0 0
I
0 1 I 1 1 0 0 0
0 1 0 0 0 0 1 1 I
7.L HandlingCategoricalAttributes 41,7
There are several issues to consider when applying association analysis to the binarized data:
1. Some attribute values may not be frequent enough to be part of a fre- quent pattern. This problem is more evident for nominal attributes that have many possible values, e.g., state names. Lowering the support threshold does not help because it exponentially increases the number of frequent patterns found (many of which may be spurious) and makes the computation more expensive. A more practical solution is to group related attribute values into a small number of categories. For exam- ple, each state name can be replaced by its corresponding geographi- cal region, such as Midwest, Pacific Northwest, Southwest, and East Coast. Another possibility is to aggregate the less frequent attribute values into a single category called Others, as shown in Figure 7.1.