After counting their supports, the algorithm eliminates all candidate itemsets whose support counts are less than mi,nsup (step 12).
The algorithm terminates when there are no new frequent itemsets gen- erated, i .e. , Fp: 0 (step 13).
The frequent itemset generation part of the Apri,orz algorithm has two im- portant characteristics. First, it is a level-wise algorithm; i.e., it traverses the itemset lattice one level at a time, from frequent 1-itemsets to the maximum size of frequent itemsets. Second, it employs a generate-and-test strategy for finding frequent itemsets. At each iteration, new candidate itemsets are generated from the frequent itemsets found in the previous iteration. The support for each candidate is then counted and tested against the minsup threshold. The total number of iterations needed by the algorithm is k-.* a 1, where k-r* is the maximum size of the frequent itemsets.
6.2.3 Candidate Generation and Pruning
The apriori-gen function shown in Step 5 of Algorithm 6.1 generates candidate itemsets by performing the following two operations:
1. Candidate Generation. This operation generates new candidate k- itemsets based on the frequent (k – l)-itemsets found in the previous iteration.
2. Candidate Pruning. This operation eliminates some of the candidate k-itemsets using the support-based pruning strategy.
To illustrate the candidate pruning operation, consider a candidate /c-itemset, X: {h, i2, . . . , ip}. The algori thm must determine whether al l of i ts proper subsets, X – { l i } (Vl : 1,2,. . . ,k), are frequent. I f one of them is infre- quent, then X is immediately pruned. This approach can effectively reduce the number of candidate itemsets considered during support counting. The complexity of this operation is O(k) for each candidate k-itemset. However, as will be shown later, we do not have to examine all k subsets of a given candidate itemset. If m of the k subsets were used to generate a candidate, we only need to check the remaining k – rn subsets during candidate pruning.
6.2 Frequent Itemset Generation 339
In principle, there are many ways to generate candidate itemsets. The fol- Iowing is a list of requirements for an effective candidate generation procedure:
1. It should avoid generating too many unnecessary candidates. A candi- date itemset is unnecessary if at least one of its subsets is infrequent. Such a candidate is guaranteed to be infrequent according to the anti- monotone property of support.
2. It must ensure that the candidate set is complete, i.e., no frequent item- sets are left out by the candidate generation procedure. To ensure com- pleteness, the set of candidate itemsets must subsume the set of all fre- quent itemsets, i.e., Vk : fi, C Cp.
3. It should not generate the same candidate itemset more than once. For example, the candidate itemset {a,b,c,d} can be generated in many ways-by merging {a,b, c} with {d}, {b, d} with {o, “}, {c} with {a,b, d}, etc. Generation of duplicate candidates leads to wasted computations and thus should be avoided for efficiency reasons.
Next, we will briefly describe several candidate generation procedures, in- cluding the one used by the apriori-gen function.
Brute-Force Method The brute-force method considers every k-itemset as a potential candidate and then applies the candidate pruning step to remove any unnecessary candidates (see Figure 6.6). The number of candidate item- sets generated at level k is equal to ( i, ), where d is the total number of items. Although candidate generation is rather trivial, candidate pruning becomes extremely expensive because a large number of itemsets must be examined. Given that the amount of computations needed for each candidate is O(k), the overall complexity of this method k O(Dfl:, tt
” (1″)) : O(d .2d-t).
Fr-r x F1 Method An alternative method for candidate generation is to extend each frequent (k – 1)-itemset with other frequent items. Figure 6.7 illustrates how a frequent 2-itemset such as {Beer, Diapers} can be aug- mented with a frequent item such as Bread to produce a candidate 3-itemset
{Beer, Diapers, Bread}. This method will produce O(lFn_tl x lF1l) candi- date,k-itemsets, where l4l ir the number of frequent j-itemsets. The overall complexity of this step is O(DnklFn_tllF l)
The procedure is complete because every frequent k-itemset is composed of a frequent (k – 1)-itemset and a frequent 1-itemset. Therefore, all frequent k-itemsets are part of the candidate k-itemsets generated by this procedure.
34O Chapter 6 Association Analysis
Candidate Generation
Itemset Itreer, trreao, uora
iBeer, Bread, Diapers) {Beer, Bread, Milk} {Beer, Bread, Eqqs} {Beer, Cola, Diapers} {Beer, Cola, Milk} itseer, uola, Eggs’
{Beer, Diaoers. Milk} {Beer, Diapers, Eggs} {Beer, Milk, Eqqs}
{Bread, Cola, Diapers} {Bread, Cola, Milk}
{Dreao, uora, t rqqsl
{Bread, Diapers, Milk} {Breao, Drapers, Eqqsl {Bread, Milk, Eqqs}
lola, Diapers, Milk)
{Cola, Diapers, Eqqs)
{Cola, Milk, Eggs} {Diapers, Milk, Eqqs)
Figure 6.6. A brute-force method for generating candidate 3-itemsets.
Freouent 2-itemset
Candidate Pruning
Frequent 1-itemset
Figure 6,7. Generating and pruning candidate k-itemsets by merging a frequent (k – l)-itemset with a frequent item. Note that some of the candidates are unnecessary because their subsets are infrequent,
This approach, howevet, does not prevent the same candidate itemset from being generated more than once. For instance, {Bread, Diapers, Mitk} can be generated by merging {Bread, Diapers} with {Milk}, {Bread, MiIk} with
{niapers}, or {Diapers, MiIk} with {Bread}. One way to avoid generating
Candidate Generation
Itemset {Beer, Diapers. Bread) {Beer, Diapers. Milk}
{Bread. Diapers, Milk}
{Bread, Milk, Beer}
Itemset {Bread, Diapers, Milk}
6.2 Requent Itemset Generation 34L
duplicate candidates is by ensuring that the items in each frequent itemset are kept sorted in their lexicographic order. Each frequent (/c-l)-itemset X is then extended with frequent items that are lexicographically larger than the items in X. For example, the itemset {Bread, Diapers} can be augmented with {Milk} since Mil-k is lexicographically larger than Bread and Diapers. However, we should not augment {Diapers, MiIk} with {Bread} nor {Bread, Milk} with
{Oi-apers} because they violate the lexicographic ordering condition. While this procedure is a substantial improvement over the brute-force
method, it can still produce a large number of unnecessary candidates. For example, the candidate itemset obtained by merging {Beer, Diapers} with
{Ui-ft} is unnecessary because one of its subsets, {Beer, Milk}, is infrequent. There are several heuristics available to reduce the number of unnecessary candidates. For example, note that, for every candidate k-itemset that survives the pruning step, every item in the candidate must be contained in at least k – L of the frequent (k – 1)-itemsets. Otherwise, the candidate is guaranteed to be infrequent. For example, {Beer, Diapers, Milk} is a viable candidate 3-itemset only if every item in the candidate, including Beer, is contained in at least two frequent 2-itemsets. Since there is only one frequent 2-itemset containing Beer, all candidate itemsets involving Beer must be infrequent.
Fr-r x F6-1 Method The candidate generation procedure in the apriori-gen function merges a pair of frequent (k – l)-itemsets only if their first k – 2 items a r e i d e n t i c a l . L e t A : { o t , a 2 t . . . , a k _ r } a n d B : { b t , b z , . . . , b n _ t } b e a p a i r of frequent (/c – l)-itemsets. A and B are merged if they satisfy the following conditions:
a i : b i ( f o r z : 1 , 2 , . . . , k – 2 ) a n d a 7 r – 1 I b n – t .
In Figure 6.8, the frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a candidate 3-itemset {Bread, Diapers, Milk}. The algorithm does not have to merge {Beer, Di-apers} with {Diapers, Milk} because the first item in both itemsets is different. Indeed, if {Beer, Diapers, Milk} is a viable candidate, it would have been obtained by merging {Beer, Diapers} with {Beer, MiIk} instead. This example illustrates both the completeness of the candidate generation procedure and the advantages of using lexicographic ordering to prevent duplicate candidates. However, because each candidate is obtained by merging a pair of frequent (k-1)-itemsets, an additional candidate pruning step is needed to ensure that the remaining k – 2 subsets of the candidate are frequent.
342 Chapter 6 Association Analysis
Frequent 2-itemset
Candidate Generation
Candidate Pruning
Frequent 2-itemsel
Figure 6.8, Generating and pruning candidate k-itemsets by merging pairs of frequent (k- l)-itemsets.
6.2.4 Support Counting
Support counting is the process of determining the frequency of occurrence for every candidate itemset that survives the candidate pruning step of the apriori-gen function. Support counting is implemented in steps 6 through 11 of Algorithm 6.1. One approach for doing this is to compare each transaction against every candidate itemset (see Figure 6.2) and to update the support counts of candidates contained in the transaction. This approach is computa- tionally expensive, especially when the numbers of transactions and candidate itemsets are large.
An alternative approach is to enumerate the itemsets contained in each transaction and use them to update the support counts oftheir respective can- didate itemsets. To illustrate, consider a transaction t that contains five items,
{I,2,3,5,6}. There are ( 3 ) : tO itemsets of size 3 contained in this transac- tion. Some of the itemsets may correspond to the candidate 3-itemsets under investigation, in which case, their support counts are incremented. Other subsets of t that do not correspond to any candidates can be ignored.
Figure 6.9 shows a systematic way for enumerating the 3-itemsets contained in l. Assuming that each itemset keeps its items in increasing lexicographic order, an itemset can be enumerated by specifying the smallest item first, followed by the larger items. For instance, given t : {L,2,3,5,6}, all the 3- itemsets contained in f must begin with item 1, 2, or 3. It is not possible to construct a 3-itemset that begins with items 5 or 6 because there are only two
Frequent Itemset Generation 343
Level 2