Definition 6.3 (Maximal Flequent ltemset). A maximal frequent item- set is defined as a frequent itemset for which none of its immediate supersets are frequent.

To illustrate this concept, consider the itemset lattice shown in Figure 6.16. The itemsets in the lattice are divided into two groups: those that are frequent and those that are infrequent. A frequent itemset border, which is represented by a dashed line, is also illustrated in the diagram. Every itemset located above the border is frequent, while those located below the border (the shaded nodes) are infrequent. Among the itemsets residing near the border,

{o, d}, {a, c, e), and {b, c, d, e} are considered to be maximal frequent itemsets because their immediate supersets are infrequent. An itemset such as {o,d} is maximal frequent because all of its immediate supersets, {a,b,d}, {a,c,d}, and {a, d,e}, are infrequent. In contrast, {o,”} is non-maximal because one of its immediate supersets, {a, c, e}, is frequent.

Maximal frequent itemsets effectively provide a compact representation of frequent itemsets. In other words, they form the smallest set of itemsets from

Compact Representation of Frequent Itemsets 355

which all frequerrt itemsets can be derived. For example, the frequent itemsets shown in Figure 6.16 can be divided into two groups:

o Frequent il;emsets that begin with item a and that may contain items c, d, or e. This group includes itemsets such as {o), {o.,c), {a,d}, {a,e}, and {a ,c ,e } .

o Flequent it,emsets that begin with items b, c, d, or e. This group includes itemsets such as {b}, {b, c}, {c,d},{b,c,d,e}, etc.

Requent itemsel;s that belong in the first group are subsets of either {a,c,e} or {4, d}, while those that belong in the second group are subsets of {b, c, d, e}. Hence, the maximal frequent itemsets {a, c, e}, {o, d}, and {b, c, d, e} provide a compact representation of the frequent itemsets shown in Figure 6.16.

Maximal frequent itemsets provide a valuable representation for data sets that can produce very long, frequent itemsets, as there are exponentially many frequent itemsets in such data. Nevertheless, this approach is practical only if an efficient algorithm exists to explicitly find the maximal frequent itemsets without havingto enumerate all their subsets. We briefly describe one such approach in Secl;ion 6.5.

Despite providing a compact representation, maximal frequent itemsets do not contain the support information of their subsets. For example, the support of the maximal fiequent itemsets {a,c,e}, {o,d}, and {b,c,d,e} do not provide any hint about the support of their subsets. An additional pass over the data set is therefore needed to determine the support counts of the non-maximal frequent itemsets. In some cases, it might be desirable to have a minimal representation of frequent itemsets that preserves the support information. We illustrate such a representation in the next section.

6.4.2 Closed Flequent Itemsets

Closed itemsets provide a minimal representation of itemsets without losing their support infbrmation. A formal definition of a closed itemset is presented below.

Definition 6.4 (Closed Itemset). An itemset X is closed if none of its immediate supersets has exactly the same support count as X.

Put another way, X is not closed if at least one of its immediate supersets has the same support count as X. Examples of closed itemsets are shown in Figure 6.17. To better illustrate the support count of each itemset, we have associated each node (itemset) in the lattice with a list of its corresponding

6.4

1 aoc

1 ,2 ,4

abcd

oce

4 acde

de

356 Chapter 6 Association Analysis

Figure 6.17. An example of the closed frequent itemsets (with minimum support count equalto 40%).

transaction IDs. For example, since the node {b, c} is associated with transac- tion IDs 1,2, and 3, its support count is equal to three. Flom the transactions given in this diagram, notice that every transaction that contains b also con- tains c. Consequently, the support for {b} is identical to {b, c} and {b} should not be considered a closed itemset. Similarly, since c occurs in every transac- tion that contains both a and d, the itemset {o,d} is not closed. On the other hand, {b, c} is a closed itemset because it does not have the same support count as any of its supersets.

Definition 6.5 (Closed Flequent Itemset). An itemset is a closed fre- quent itemset if it is closed and its support is greater than or equal to m’insup.

In the previous example, assuming that the support threshold is 40%, {b,c} is a closed frequent itemset because its support is 60%. The rest of the closed frequent itemsets are indicated by the shaded nodes.

Algorithms are available to explicitly extract closed frequent itemsets from a given data set. Interested readers may refer to the bibliographic notes at the end of this chapter for further discussions of these algorithms. We can use the closed frequent itemsets to determine the support counts for the non-closed

6.4 Compact Representation of Requent Itemsets 357

Algorithm 6.4 Support counting using closed frequent itemsets. 1: Let C denote the set of closed frequent itemsets 2: Let k-.* denote the maximum size of closed frequent itemsets 3: F6-“* : {flf e C, lfl: k-.*} {Find all frequent itemsets of size k-.*.} 4: for k : k^u*- 1 downto 1 do 5: Fn : {f lf ( Fp’,1, lf l : k} {Find all frequent itemsets of size k.} 6: for each f e F* do 7: if f f C then 8: f .support :max{f t .suryor t l f te Fn+r, f Cf ‘ } 9: end if

10: end for 11: end for

frequent itemsets. For example, consider the frequent itemset {o,d} shown in Figure 6.17. Because the itemset is not closed, its support count must be identical to one of its immediate supersets. The key is to determine which superset (amon6; {a,b,d}, {a,c,d}, or {a,d,e}) has exactly the same support count as {a, d}. The Apri,ori, principle states that any transaction that contains the superset of .l-a, d) must also contain {o,d}. However, any transaction that contains {o,d} does not have to contain the supersets of {a,d}. For this reason, the support for {a, d} must be equal to the largest support among its supersets. Since {a, c, d} has a larger support than both {a,b,d} and {a,d,e}, the support for .[a, d] must be identical to the support for {4, c, d}. Using this methodology, an algorithm can be developed to compute the support for the non-closed frequent itemsets. The pseudocode for this algorithm is shown in Algorithm 6.4. The algorithm proceeds in a specific-to-general fashion, i.e., from the largest to the smallest frequent itemsets. This is because, in order to find the support for a non-closed frequent itemset; the support for all of its supersets must be known.

To illustrate the advairtage of using closed frequent itemsets, consider the data set shown in Table 6.5, which contains ten transactions and fifteen items. The items can be divided into three groups: (1) Group -4, which contains items ar through as; Q) Group B, which contains items b1 through b5i and (3) Group C, which contains items c1 through c5. Note that items within each group are perfectly associated with each other and they do not appear with items from another group. Assuming the support threshold is 20To, the total number of frequent itemsets is 3 x (25 – 1) : 93. However, there are only three closed frequent i temsets in the data: ({41, a2tayta4lab}, {bt ,bz,bz,b+,b5}, and

{“t,”z,cs,ca,c5}). It is often sufficient to present only the closed frequent itemsets to the analysts instead of the entire set of frequent itemsets.

Table 6.5. A transaction data set for minino closed itemsets.

TID A7 A2 AJ A4 As O1 bz bs ba b5 C1 C2 ca C4 Cb

I 2 J

4 (

t)

7 e

rl

10

I I 1 0 0 0 0 0 0 0

1 1 I 0 0 0 0 0 0 0

I 1_ 1. 0 0 0 0 0 0 0

I 1_ l. 0 0 0 0 0 0 0

1 1 I 0 0 0 0 0 0 0

0 0 0 I I I 0 0 0 0

0 0 0 1 1 1 0 0 0 0

0 0 0 1 1 1 0 0 0 0

0 0 0 1 1 1 0 0 0 0

0 0 0 1 1 1 0 0 0 0

0 0 0 0 0 0 1 1 1 1

U 0 0 0 0 0 1 L 1 1

U 0 0 0 0 0 1 I 1 I

0 0 0 0 0 0 1 1 I 1

0 0 0 0 0 0 1 1 I 1

358 Chapter 6 Association Analysis

Figure 6.18. Relationships among f requent, maximal f requent, and closed frequent itemsets.

Closed frequent itemsets are useful for removing some of the redundant association rules. An association rule X —— Y is redundant if there exists another rule X/ —— Y’ , where X is a subset of X/ and Y is a subset of Y/, such that the support and confidence for both rules are identical. In the example shown in Figure 6.L7, {b} is not a closed frequent itemset while {b, c} is closed. The association rule {b} – {d, e} is therefore redundant because it has the sarne support and confidence as {b, c} – {d,”}. Such redundant rules are not generated if closed frequent itemsets are used for rule generation.

Finally, note that all maximal frequent itemsets are closed because none of the maximal frequent itemsets can have the same support count as their immediate supersets. The relationships among frequent, maximal frequent, and closed frequent itemsets are shown in Figure 6.18.

6.5 Alternative Methods for Generating Frequent Itemsets 359

6.5 Alternative Methods for Generating Fhequent Itemsets

Apri,ori, is one of the earliest algorithms to have successfully addressed the combinatorial explosion of frequent itemset generation. It achieves this by ap- plying the Apii,ori principle to prune the exponential search space. Despite its significant performance improvement, the algorithm still incurs considerable I/O overhead since it requires making several passes over the transaction data set. In addition, as noted in Section 6.2.5, the performance of the Apri,ori algorithm may degrade significantly for dense data sets because of the increas- ing width of transactions. Several alternative methods have been developed to overcome these limitations and improve upon the efficiency of the Apri,ori, algorithm. The following is a high-level description of these methods.

Tlaversal of Itemset Lattice A search for frequent itemsets can be con- ceptually viewecl as a traversal on the itemset lattice shown in Figure 6.1. The search strategy employed by an algorithm dictates how the lattice struc- ture is traversed during the frequent itemset generation process. Some search strategies are better than others, depending on the configuration of frequent itemsets in the lattice. An overview of these strategies is presented next.

o General-t;o-Specific versus Specific-to-General: The Apriori, al- gorithm uses a general-to-specific search strategy, where pairs offrequent (k- l)-itemsets are merged to obtain candidate k-itemsets. This general- to-specific search strategy is effective, provided the maximum length of a frequent itemset is not too long. The configuration of frequent item- sets that works best with this strategy is shown in Figure 6.19(a), where the darker nodes represent infrequent itemsets. Alternatively, a specific- to-general search strategy looks for more specific frequent itemsets first, before finding the more general frequent itemsets. This strategy is use- ful to discover maximal frequent itemsets in dense transactions, where the frequent itemset border is located near the bottom of the lattice, as shown in Figure 6.19(b). The Apri,ori, principle can be applied to prune all subsets of maximal frequent itemsets. Specifically, if a candi- date k-itemset is maximal frequent, we do not have to examine any of its subsets of size k – l. However, if the candidate k-itemset is infrequent, we need to check all of its k – 1 subsets in the next iteration. Another approach is to combine both general-to-specific and specific-to-general search strategies. This bidirectional approach requires more space to

360 Chapter 6 Association Analysis

Frequent Itemset Border null

{a1,a2,.. . ,an}

(a) General-to-specific (c) Bidirectional

Figure 6.19. General-to-specific, specific{o-general, and bidirectional search.

store the candidate itemsets, but it can help to rapidly identify the fre- quent itemset border, given the configuration shown in Figure 6.19(c).

Equivalence Classes: Another way to envision the traversal is to first partition the lattice into disjoint groups of nodes (or equivalence classes). A frequent itemset generation algorithm searches for frequent itemsets within a particular equivalence class first before moving to another equiv- alence class. As an example, the level-wise strategy used in the Apri,ori, algorithm can be considered to be partitioning the lattice on the basis of itemset sizes; i.e., the algorithm discovers all frequent l-itemsets first before proceeding to larger-sized itemsets. Equivalence classes can also be defined according to the prefix or suffix labels of an itemset. In this case, two itemsets belong to the same equivalence class if they share a common prefix or suffix of length k. In the prefix-based approach, the algorithm can search for frequent itemsets starting with the prefix a before looking for those starting with prefixes b) c) and so on. Both prefix-based and suffix-based equivalence classes can be demonstrated using the tree-like structure shown in Figure 6.20.