Level 3 Subsets of 3 items

Figure 6.9. Enumerating subsets of three items from a transaction r.

items in f whose labels are greater than or equal to 5. The number of ways to specify the first item of a 3-itemset contained in t is illustrated by the Level 1 prefix structures depicted in Figure 6.9. For instance, 1 [2J 5 6 I represents a 3-itemset that begins with item 1, followed by two more items chosen from t h e s e t { 2 , 3 , 5 , 6 } .

After fixing the first item, the prefix structures at Level 2 represent the number of ways to select the second item. For example, 1 2 F 5 6l corresponds to itemsets that begin with prefix (1 2) and are followed by items 3, 5, or 6. Finally, the prefix structures at Level 3 represent the complete set of 3-itemsets contained in t. For example, the 3-itemsets that begin with prefix {1 2} are

{1,2,3}, {7,2,5}, and {1,2,6}, whi le those that begin with pref ix {2 3} are

{ 2 , 3 , 5 } a n d { 2 , 3 , 6 } . The prefix structures shown in Figure 6.9 demonstrate how itemsets con-

tained in a transaction can be systematically enumerated, i.e., by specifying their items one by one, from the leftmost item to the rightmost item. We still have to determine whether each enumerated 3-itemset corresponds to an existing candidate itemset. If it matches one of the candidates, then the sup- port count of the corresponding candidate is incremented. In the next section, we illustrate how this matching operation can be performed efficiently using a hash tree structure.

6.2

1 2

Transaction, t

1 2 3 5 6

Bread, Milk, Diapers, Cola

344 Chapter 6 Association Analysis

Leaf nodes containing candidate 2-itemsets

Transactions

Figure 6.10. Counting the support of itemsets using hash structure.

Support Counting Using a Hash Tlee

In the Apriori, algorithm, candidate itemsets are partitioned into different buckets and stored in a hash tree. During support counting, itemsets contained in each transaction are also hashed into their appropriate buckets. That way, instead of comparing each itemset in the transaction with every candidate itemset, it is matched only against candidate itemsets that belong to the same bucket, as shown in Figure 6.10.

Figure 6.11 shows an example of a hash tree structure. Each internal node of the tree uses the following hash function, h(p) : p mod 3, to determine which branch of the current node should be followed next. For example, items 1, 4, and 7 are hashed to the same branch (i.e., the leftmost branch) because they have the same remainder after dividing the number by 3. All candidate itemsets are stored at the leaf nodes of the hash tree. The hash tree shown in Figure 6.11 contains 15 candidate 3-itemsets, distributed across 9 leaf nodes.

Consider a transaction, f, : {1,2,3,5,6}. To update the support counts of the candidate itemsets, the hash tree must be traversed in such a way that all the leaf nodes containing candidate 3-itemsets belonging to f must be visited at least once. Recall that the 3-itemsets contained in t must begin with items 1, 2,or 3, as indicated by the Level 1prefix structures shown in Figure 6.9. Therefore, at the root node of the hash tree, the items 1, 2, and 3 of the transaction are hashed separately. Item 1 is hashed to the left child ofthe root node, item 2 is hashed to the middle child, and item 3 is hashed to the right child. At the next level of the tree, the transaction is hashed on the second

6.2 FYeouent Itemset Generation 345

Hash Function

-‘-:,@ ,-.w

Figure 6.1 1. Hashing a transaction at the root node of a hash tree.

item listed in the Level 2 structures shown in Figure 6.9. For example, after hashing on item 1 at the root node, items 2, 3, and 5 of the transaction are hashed. Items 2 and 5 are hashed to the middle child, while item 3 is hashed to the right child, as shown in Figure 6.12. This process continues until the leaf nodes of the hash tree are reached. The candidate itemsets stored at the visited leaf nodes are compared against the transaction. If a candidate is a subset of the transaction, its support count is incremented. In this example, 5 out of the 9 leaf nodes are visited and 9 out of the 15 itemsets are compared against the transaction.

6.2.5 Computational Complexity

The computational complexity of the Apri,ori, algorithm can be affected by the following factors.

Support Threshold Lowering the support threshold often results in more itemsets being declared as frequent. This has an adverse effect on the com-

Transaction

F,*;]

346 Chapter 6 Association Analysis

1 + l 2 3 s 6 l—

z +

3 +

Figure 6.12. Subset operation on the leftmost subtree of the root of a candidate hash tree.

putational complexity of the algorithm because more candidate itemsets must be generated and counted, as shown in Figure 6.13. The maximum size of frequent itemsets also tends to increase with lower support thresholds. As the maximum size of the frequent itemsets increases, the algorithm will need to make more passes over the data set.

Number of Items (Dimensionality) As the number of items increases, more space will be needed to store the support counts of items. If the number of frequent items also grows with the dimensionality of the data, the computation and I/O costs will increase because of the larger number of candidate itemsets generated by the algorithm.

Number of Tbansactions Since the Apri,ori, algorithm makes repeated passes over the data set, its run time increases with a larger number of trans- actions.

Average TYansaction Width For dense data sets, the average transaction width can be very large. This affects the complexity of the Apriori algorithm in two ways. First, the maximum size of frequent itemsets tends to increase as the

Transaction t;;;-:-:r l r z r c o l

-@

x105

i i l

\:, Y il

I t=–**-* –

6.2 Freouent Itemset Generation 347

4

3.5

f ; 3 E I z s a 6

d

E r.s 3 E z 1

0.5

0 1 0

Size of ltemsel

(a) Number of candidate itemsets.

, x105

Eo

f

E ; d E z

0 – + J – f – + – + – +

0 1 0 Size of ltemsel

(b) Number of frequent itemsets.

Figure 6.13. Effect of support threshold on the number of candidate and frequent itemsets.

average transaction width increases. As a result, more candidate itemsets must be examined during candidate generation and support counting, as illustrated in Figure 6.14. Second, as the transaction width increases, more itemsets

v t i l i l i l i l i l i l i l i l

Y I i l l t : A\y

‘ + # t * .

.F.vv

v

348 Chapter 6 Association Analysis

1 0 1 5

Size ol ltemset

(a) Number of candidate itemsets.

1 0 1 5 Size of ltemset

(b) Number of Frequent ltemsets.

Figure 6.14. Effect of average transaction width on the number of candidate and frequent itemsets.

are contained in the transaction. This will increase the number of hash tree traversals performed during support counting.

A detailed analysis of the time complexity for the Apri,ori, algorithm is presented next.

9

8

o _

o

o 6 6

() 6

E 3

z

1

0

1 0

8

Eo

o

I

o 4

z

2

1

0

x1 05

I

4*+- .

v

\b

; v

6.3 Rule Generation 349

Generation of frequent l-itemsets For each transaction, we need to up- date the support count for every item present in the transaction. Assuming that tr is the average transaction width, this operation requires O(l/tr) time, where .A[ is the total number of transactions.

Candidate generation To generate candidate k-itemsets, pairs of frequent (k – l)-itemsets are merged to determine whether they have at least k – 2 items in common. Each merging operation requires at most k – 2 equality comparisons. In the best-case scenario, every merging step produces a viable candidate k-itemset. In the worst-case scenario, the algorithm must merge ev- ery pair of frequent (k – 1)-itemsets found in the previous iteration. Therefore, the overall cost of merging frequent itemsets is

iU – z)lcnl ( cost or merging .irr – 2)lLn-r12. k:2 k:2

A hash tree is also constructed during candidate generation to store the can- didate itemsets. Because the maximum depth of the tree is k, the cost for populating the hash tree with candidate itemsets rs O(l[:2klcrl). During candidate pruning, we need to verify that the ,k – 2 subsets of every candidate k-itemset are frequent. Since the cost for looking up a candidate in a hash tree is O(k), the candidate pruning step require” O(D|:rk(k – z)lckl) time.

Support counting Each transaction of length ltl produces (lll) itemsets of size k. This is also the effective number of hash tree traversals performed for each transaction. The cost for support counting is O(,n/Dr(T)”r), where tr is the maximum transaction width and a7, is the cost for updating the support count of a candidate k-itemset in the hash tree.

6.3 Rule Generation