[327] H. Xiong, X. He, C. Ding, Y.Zhang, V. Kumar, and S. R. Holbrook. Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery. In Proc. of the Pacif,c Sgmposium on Biocomputing, (PSB 2005),Mad, January 2005.
13281 H. Xiong, S. Shekhar, P. N. Tan, and V. Kumar. Exploiting a Support-based Upper Bound of Pearson’s Correlation Coefficient for Efficiently Identifying Strongly Corre- lated Pairs. In Proc. of the 10th IntI. Conf. on Knowled,ge Di,scouery and Data Mining, pages 334 343, Seattle, WA, August 2004.
[329] H. Xiong, M. Steinbach, P. N. Tan, and V. Kumar. HICAP: Hierarchial Clustering with Pattern Preservation. In Proc. of the SIAM Intl. Conf. on Data Mining, pages 279 290, Orlando, FL, April 2004.
[330] H. Xiong, P. N. Tan, and V. Kumar. Mining Strong Affinity Association Patterns in Data Sets with Skewed Support Distribution. In Proc. of the 2003 IEEE IntI. Conf. on Data M’in’ing, pages 387 394, Melbourne, FL,2OO3.
[331] X. Yan and J. Han. gSpan: Graph-based Substructure Pattern Mining. ln Proc. of the 2002 IEEE Intl. Conf. on Data Mining, pages 72I 724, Maebashi City, Japan, December 2002.
1332] C. Yang, U. M. Fayyad, and P. S. Bradley. Efficient discovery of error-tolerant frequent itemsets in high dimensions. In Proc. of the 7th Intl. Conf. on Knowled,ge Discouery and, Data M’in’ing, pages 194 203, San Francisco, CA, August 2001.
f333] M. J. Zaki. Parallel and Distributed Association Mining: A Survey. IEEE Concurrency, special issue on Parallel Mechanisrns for Data Min’ing,7(4):14-25, December 1999.
[334] M. J. Zaki. Generating Non-Redundant Association Rules. In Proc. of the 6th IntI. Conf. on Knowledge Discouery and Data Min’ing, pages 34-43, Boston, MA, August 2000.
[335] M. J. Zaki. Efficiently mining frequent trees in a forest. In Proc. of the 8th IntI. Conf. on Knowledge Di,scouery and Data Mining, pages 71-80, Edmonton, Canada, July 2002.
f336] M. J. Zaki and M. Orihara. Theoretical foundations of association rules. In Proc. of the 1998 ACM SIGMOD Workshop on Research Issues in Data Mining and, Knowledge Discouery, Seattle, WA, June 1998.
[337] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New Algorithms for Fast Discovery of Association Rules. In Proc. of the Srd Intl. Conf . on Knowled,ge Discouery and” Data Mining, pages 283-286, Newport Beach, CA, August 1997.
[338] H. Zhang, B. Padmanabhan, and A. T\rzhilin. On the Discovery of Significant Statis- tical Quantitative Rules. In Proc. of the 10th IntL Conf. on Knowled,ge D’iscouery and Data Mi.ning, pages 374-383, Seattle, WA, August 2004.
13391 Z. Zhang, Y. Lu, and B. Zhang. An Effective Partioning-Combining Algorithm for Discovering Quantitative Association Rules. In Proc. of the 1st Paci,fic-Asia Conf. on Knowled”ge Discouery and, Data Mi,ning, Singapore, 1997.
[340] N. Zhong, Y. Y. Yao, and S. Ohsuga. Peculiarity Oriented Multi-database Mining. In Proc. of the ?rd European Conf. of Principles and Practice of Knowledge Discouery in Databases, pages 136-146, Prague, Czech Republic, 1999.
4O4 Chapter 6 Association Analysis
6.10 Exercises
1. For each of the following questions, provide an example of an association rule from the market basket domain that satisfies the following conditions. Also, describe whether such rules are subjectively interesting.
(a) A rule that has high support and high confidence.
(b) A rule that has reasonably high support but low confidence.
(c) A rule that has low support and low confidence.
(d) A rule that has low support and high confidence.
2. Consider the data set shown in Table 6.22.
Table 6.22. Example of market basket transactions.
Customer ID Tlansaction ID Items ljouqht
I 1 2 2 3 3 4
4 5 tr
0024 0012 0031 0015 0022 0029 0040 0033 0038
{ a , d , e } { a , b , c , e } { a , b , d , e } {a, c, d, e}
{ b , c , e } { b , d , e } { ” , d } { a , b , c } {a, d, e}
{a ,b , e }
(a) Compute the support for itemsets {e}, {b,d}, and {b,d,e} by treating each transaction ID as a market basket.
(b) Use the results in part (a) to compute the confidence for the associa- tion rules {b,d} —, {e} and {“} ——+ {b,d}. Is confidence a symmetric measure?
Repeat part (a) by treating each customer ID as a market basket. Each item should be treated as a binary variable (1 if an item appears in at Ieast one transaction bought by the customer, and 0 otherwise.)
Use the results in part (c) to compute the confidence for the association rules {b, d} – {e} and {“}—— {b,d,}. Suppose s1 and c1 are the support and confidence values ofan association rule r when treating each transd,ction ID as a market basket. Also, let s2 and c2 be the support and confidence values of r when treating each cus- tomer ID as a market basket. Discuss whether there are any relationships between sr and s2 or c1 and, c2.
(“)
(d)
(e )
3.
6.10 Exercises 4O5
(a) What is the confidence for the rules A —— A and A —- A?
(b) Let crt c2t and ca be the confidence values of the rules {p} ——+ {q}, {p} – {q,r}, and {p, r} – {g}, respectively. If we assume that c1, c2, and ca have different values, what are the possible relationships that may exist among cI, c2, and ca? Which rule has the lowest confidence?
(c) Repeat the analysis in part (b) assuming that the rules have identical support. Which rule has the highest confidence?
(d) Tiansitivity: Suppose the confidence of the rules A – B and B ——-, C are larger than some threshold, m’inconf . Is it possible that A —–* C has a confidence less than mi,nconf?
For each of the following measures, determine whether it is monotone, anti- monotone, or non-monotone (i.e., neither monotone nor anti-monotone).
Example: Support, t : “iFl is anti-monotone because s(X) ) s(Y) whenever X CY.
(a) A characteristic rule is a rule of the form {p) – {qt,qr,. . . , g,}, where the rule antecedent contains only a single item. An itemset of size k can produce up to k characteristic rules. Let ( be the minimum confidence of all characteristic rules generated from a given itemset:
C ( { p r , p z , . . . , p k } ) m i n I c ( { p 1 } – { p z , p s , . . . , p k } ) , – – –
‘ ( {Pn} ‘ {Pr ‘Ps” ‘ ‘P l ” -1}) ]
Is ( monotone, anti-monotone, or non-monotone?
(b) A discriminant rule is a rule of the form {pt,pr,. . .,pn} —— {q}, where the rule consequent contains only a single item. An itemset of size k can produce up to k discriminant rules. Let 4 be the minimum confidence of all discriminant rules generated from a given itemset:
\ ( { p r , pz , – . – , pk } ) : m in l c ( {p2 ,Pz , – . – ,Pn } – – – – – – { p r } ) , . . .
c( {n,nz, . . .pn- t } – – – – – – {p*} ) ]
Is 4 rhonotone, anti-monotone, or non-monotone?
(c) Repeat the analysis in parts (a) and (b) bV replacing the min function with a max function.
5. Prove Equation 6.3. (Hint: First, count the number of ways to create an itemset that forms the left hand side of the rule. Next, for each size k itemset selected for the left-hand side, count the number of ways to choose the remainin1d-k items to form the right-hand side of the rule.)
4 .
4OG Chapter 6 Association Analysis
Table 6.23. Market basket transactions, Tlansaction ID Items Boueht
1 2 3 4
o
7 8 9 10
tMilk, Beer, Diapers]
{Bread, Butter, Milk}
{Milk, Diapers, Cookies}
{Bread, Butter, Cookies}
{Beer, Cookies, Diapers}
{Milk, Diapers, Bread, Butter}
{Bread, Butter, Diapers}
{Beer, Diapers}
{Milk, Diapers, Bread, Butter}
{Beer, Cookies}
6. Consider the market basket transactions shown in Table 6.23.
(a) What is the maximum number of association rules that can be extracted from this data (including rules that have zero support)?
(b) What is the maximum size of frequent itemsets that can be extracted (assuming minsup > 0)?
(c) Write an expression for the maximum number of size-3 itemsets that can be derived from this data set.
(d) Find an itemset (of size 2 or larger) that has the largest support.
(e) Find a pair of items, a and b, such that the rules {o} -, {b} and {b} —–+
{a} have the same confidence.
7. Consider the following set of frequent 3-itemsets:
{1 , 2 , 3 } , { 1 , 2 , 4 ) , { r , 2 , 5 } , { r , 3 , 4 } , { 1 , 3 , 5 } , { 2 , 3 , 4 } , { 2 , 3 , 5 } , { 3 , 4 , 5 } .
Assume that there are only five items in the data set.
(a) List all candidate 4-itemsets obtained by a candidate generation procedure using the Fn-t x ?’1 merging strategy.
(b) List all candidate 4-itemsets obtained by the candidate generation proce- dure in Apriori.
(c) List all candidate 4-itemsets that survive the candidate pruning step of the Apriori, algorithm.
8. The Apri,ori algorithm uses a generate-and-count strategy for deriving frequent itemsets. Candidate itemsets of size k + I are created by joining a pair of frequent itemsets of size k (this is known as the candidate generation step). A candidate is discarded if-any one of its subsets is found to be infrequent during the candidate pruning step. Suppose the Apriori algorithm is applied to the
6.10 Exercises 4O7
Table 6.24. Example of market basket transactions.
Tlansaction ID Items Bought
1 2 3 4 r J
0
7 8 I 10
{a ,b , d , e }
{ b , c , d } { a , b , d , e } {a, c, d, e}
{b, c, d, e}
{ b , d , e } {“, d}
{a ,b , c }
{a, d, e}
{b, d}
9.
data set shown in Table 6.24 with m’insup : 30yo, i.e., any itemset occurring in less than 3 transactions is considered to be infrequent.
(a) Draw an itemset lattice representing the data set given in Table 6.24. Label each node in the Iattice with the following letter(s):
o N: If the itemset is not considered to be a candidate itemset by the Apriori. algorithm. The-re are two reasons for an itemset not to be considered as a candidate itemset: (1) it is not generated at all during the candidate generation step, or (2) it is generated during the candidate generation step but is subsequently removed during the candidate pruning step because one of its subsets is found to be infrequent.
o F: If the candidate itemset is found to be frequent by the Apri’ori algorithm.
o I: If the candidate itemset is found to be infrequent after support counting.
(b) What is the percentage of frequent itemsets (with respect to all itemsets in the lattice)?
(c) What is the pruning ratio of the Apri,ori algorithm on this data set? (Pruning ratio is defined as the percentage of itemsets not considered to be a candidate because (1) they are not generated during candidate generation or (2) they are pruned during the candidate pruning step.)
(d) What is the false alarm rate (i.e, percentage of candidate itemsets that are found to be infrequent after performing support counting)?
The Apriori algorithm uses a hash tree data structure to effrciently count the support of candidate itemsets. Consider the hash tree for candidate 3-itemsets shown in Figure 6.32.
408 Chapter 6 Association Analysis
Figure 6,32, An example of a hash tree structure.
(a) Given a transaction that contains items {1,3,4,5,8}, which of the hash tree leaf nodes will be visited when fi.nding the candidates of the transac- tion?
(b) Use the visited leaf nodes in part (b) to determine the candidate itemsets that are contained in the transaction {1,3,4,5,8}.
10. Consider the following set of candidate 3-itemsets:
{ t , 2 , 3 } , { r , 2 , 6 } , { 1 , 3 , 4 } , { 2 , 3 , 4 ) , { 2 , 4 , 5 } , { 3 , 4 , 6 } , { 4 , 5 , 6 }
(a) Construct a hash tree for the above candidate 3-itemsets. Assume the tree uses a hash function where all odd-numbered items are hashed to the left child of a node, while the even-numbered items are hashed to the right child. A candidate k-itemset is inserted into the tree by hashing on each successive item in the candidate and then following the appropriate branch of the tree according to the hash value. Once a leaf node is reached, the candidate is inserted based on one of the following conditions:
Condition 1: If the depth of the leaf node is equal to k (the root is assumed to be at depth 0), then the candidate is inserted regardless of the number of itemsets already stored at the node.
Condition 2: If the depth of the leaf node is less than k, then the candi- date can be inserted as long as the number of itemsets stored at the node is less than mars’ize. Assume mars’ize:2 for this question.
Condition 3: If the depth of the leaf node is less than k and the number of itemsets stored at the node is equal to mars’ize, then the leaf node is converted into an internal node. New leaf nodes are created as children of the old leaf node. Candidate itemsets previouslv stored
6.10 Exercises 4Og
Figure 6.33. An itemset lattice
in the old leaf node are distributed to the children based on their hash values. The new candidate is also hashed to its appropriate leafnode.
(b) How many leaf nodes are there in the candidate hash tree? How many internal nodes are there?
(c) Consider a transaction that contains the following items: {1,2,3,5,6}. Using the hash tree constructed in part (a), which leaf nodes will be checked against the transaction? What are the candidate 3-itemsets con- tained in the transaction?
11. Given the lattice structure shown in Figure 6.33 and the transactions given in Table 6.24,label each node with the following letter(s):
M if the node is a maximal frequent itemset,
C if it is a closed frequent itemset,
,A’/ if it is frequent but neither maximal nor closed, and
1 if it is infreouent.
Assume that the support threshold is equal to 30T0.
12. The original association rule mining formulation uses the support and confi- dence measures to prune uninteresting rules.
a
a
a
a
ALO Chapter 6 Association Analysis
(a) Draw a contingency table for each of the following rules using the trans- actions shown in Table 6.25.
Table 6,25. Example of market basket transactions.
Tlansaction ID Items Bought I
2 3 4 E
6 7 8 I 10
{a ,b , d , e }
{b , c ,d } {a ,b , d , e }
{a, c, d, e}
{ b , c , d , e } { b , d , e } { ” , d } { a , b , c } {a ,d , e }
{b ,d }
13.