This section describes how to extract association rules efficiently from a given frequent itemset. Each frequent k-itemse t, Y , can produce up to 2k – 2 associa- tion rules, ignoring rules that have empty antecedents or consequents (0 —- Y orY ——+ 0). An association rule can be extracted by partitioning the itemset Y into two non-empty subsets, X and Y -X, suchthat X ——+Y -X satisfies the confidence threshold. Note that all such rules must have already met the support threshold because they are generated from a frequent itemset.

350 Chapter 6 Association Analysis

Example 6.2. Let X : {I,2,3} be a frequent itemset. There are six candi- date association rules that can be generated from X: {1,2} – i3}, {1,3} –

{2}, {2,3} .—-* {1}, {U —– {2,3}, {2} ._–> {1,3}, and {3} —-* {1,2}. As each of their support is identical to the support for X, the rules must satisfy the support threshold. r

Computing the confidence of an association rule does not require additional scans of the transaction data set. Consider the rule {1,2} – {3}, which is generated from the frequent itemset X : {1,2,3}. The confi.dence for this rule is o({1, 2,3}) lo({1, 2}). Because {1, 2, 3} is frequent, the anti-monotone prop-

erty of support ensures that {1,2} must be frequent, too. Since the support counts for both itemsets were already found during frequent itemset genera- tion, there is no need to read the entire data set again.

6.3.1 Confidence-Based Pruning

Unlike the support measure) confidence does not have any monotone property. For example, the confidence for X ——. Y can be larger, smaller, or equal to the conf idenceforanother ru le* , t ,where *gX andf e Y (seeExerc ise

3 on page 405). Nevertheless, if we compare rules generated from the same frequent itemset Y, the following theorem holds for the confidence measure.

Theorem 6.2. If a rule X ——+ Y – X does not sati,sfy the confidence threshold, then any rule Xt –+ Y – Xt , where X’ ,is a subset of X, must not sati,sfy the confidence threshold as well.

To prove this theorem, consider the following two rules: Xt —‘, Y – Xt and X -+Y-X,where XtcX. Theconf idenceof theru les a reo(Y) lo (X/ ) and o(V) lo(X), respectively. Since X/ is a subset of X , o(Xt) >

“(X). Therefore,

the former rule cannot have a higher confidence than the latter rule.

6.3.2 Rule Generation in Apriori Algorithm

The Apri,orz algorithm uses a level-wise approach for generating association rules, where each level corresponds to the number of items that belong to the rule consequent. Initially, all the high-confidence rules that have only one item in the rule consequent are extracted. These rules are then used to generate new candidate rules. For example, if {acd}——- {b} and {abd} —— {c} are high-confidence rules, then the candidate rule {a,d} ——, {bc} is generated by merging the consequents of both rules. Figure 6.15 shows a lattice structure for the association rules generated from the frequent itemset {a,b,c,d}. If any node in the lattice has low confidence, then according to Theorem 6.2, the

Rule Generation 35L

aa Pruned –. ” . . – Rules

Figure 6.15, Pruning of association rules using the confidence measure.

entire subgraph spanned by the node can be pruned immediately. Suppose the confidence lbr {bcd} ——‘ {o} is low. All the rules containing item a in its consequent, including {cd} —– {ab}, {bd} —–+ {ac}, {bc) —— {od}, and

{d} – {abc} can be discarded. A pseudocode for the rule generation step is shown in Algorithms 6.2 and

6.3. Note the similarity between the ap-genrules procedure given in Algo- rithm 6.3 and the frequent itemset generation procedure given in Algorithm 6.1. The only difference is that, in rule generation, we do not have to make additional passes over the data set to compute the confidence ofthe candidate rules. Instead, we determine the confidence of each rule by using the support counts computed during frequent itemset generation.

Algorithm 6.2 Rule generation of the Apri,ori algorithm.

6.3

I I \ \ \ \

1: for each frequent k-itemset f p, k > 2 do 2 : H 1 : { i l i e f n } {l-item consequents of the rule.} 3: call ap-genrules(/6,fI1.) 4: end for

352 Chapter 6 Association Analysis

Algorithm 6.3 Procedure ap-genrules(fp, H*).

9: 10:

t: k : lfpl {size of frequent itemset.} Z, m: lH^l {size of rule consequent.} 3 : i f k > m l l t h e n 4: Hm+t : aPriori-gen(f1-). 5: for each h^+r € H*a1 do 6: conf :

” ( fn) l “Un – h^+t) .

7: if conf ) m’inconf tl:.en 8: output the rule (f* – h*+t) ——+ hm*r.

else delete h*q1 from Hm+r.

11: end i f 12: end for 13: call ap-genrules(/p, H-+r.) 14: end if

6.3.3 An Example: Congressional Voting Records

This section demonstrates the results of applying association analysis to the voting records of members of the United States House of Representatives. The data is obtained from the 1984 Congressional Voting Records Database, which is available at the UCI machine learning data repository. Each transaction contains information about the party affiIiation for a representative along with his or her voting record on 16 key issues. There are 435 transactions and 34 items in the data set. The set of items are listed in Table 6.3.

The Apri,ore algorithm is then applied to the data set with mi,nsup : 30To and minconf : 90T0. Some of the high-confidence rules extracted by the algorithm are shown in Table 6.4. The first two rules suggest that most of the members who voted yes for aid to El Salvador and no for budget resolution and MX missile are Republicans; while those who voted no for aid to EI Salvador and yes for budget resolution and MX missile are Democrats. These high- confidence rules show the key issues that divide members from both political parties. If mi,nconf is reduced, we may find rules that contain issues that cut across the party lines. For example, with mi.nconf : 40Vo, the rules suggest that corporation cutbacks is an issue that receives almost equal number of votes from both parties-52.3% of the members who voted no are Republicans, while the remaining 47.7% of them who voted no are Democrats.

6.4 Compact Representation of Frequent Itemsets 353

Table 6.3. List of binary attributes from the 1984 United States Congressional Voting Records. Source: The UCI machine learning repository.

1. Republican 2. Democrat 3. handicapped-infants — yes 4. handicapped-infants : no 5. water project cost sharing : yes 6. water project cost sharing : pe

7. budget-resolution – yes 8. budget-resolution : no 9. physician L,” 1r”s2s – yes 10. physician fee freeze : no 11. aid to EI Salvador : yes 12. aid to EI Salvador : no 13. religious groups in schools : yes 14. religious groups in schools : no 15. anti-satellite test f61 – 1les 16. anti-satellite test barr: no 17. aid to Nicaragua : yes

aid to Nicaragua : no MX-missile : yes MX-missile : no immigration : yes immigration : no synfuel corporation cutback : yes synfuel corporation cutback : no education spending : yes education spending : 1e right-to-sue : yes right-to-sue : no crrme – yes crime : no duty-free-exports : yes duty-free-exports : no export administration act : yes export administration act : no

18. 19 . 20. 21. 22. 23. ,L

25. 26. 27. 28. 29. 30. 31 . 32. 33. 34.

Table 6.4. Association rules extracted from the 1984 United States Congressional Voting Records.

Association Rule Confidence

{budget resolution : no, Mx-missile:no, aid to El Salvador : yes } ——+ {Republican}

9r.0%

{budget resolution : y€s, MX-missile:yes, aid to El Salvador : no } —–+ {Democrat}

97.5%

{crime: y€s, right-to-sue : y€s, physician fee freeze : yes} ——+ {Republican}

93.5To

{crime : no, right-to-sue : no, physician fee freeze : no} ——+ {Democrat}

l00Yo

6.4 Compact Representation of Flequent Itemsets

In practice, the number of frequent itemsets produced from a transaction data set can be very large. It is useful to identify a small representative set of itemsets from which all other frequent itemsets can be derived. Two such representations are presented in this section in the form of maximal and closed frequent itemset;s.

354 Chapter 6 Association Analysis

Figure 6.16, Maximal frequent itemset.

6.4.L Maximal Flequent Itemsets