Breadth-First versus Depth-First: The Apriori, algorithm traverses the lattice in a breadth-first manner) as shown in Figure 6.2L(a). It first discovers all the frequent 1-itemsets, followed by the frequent 2-itemsets, and so on, until no new frequent itemsets are generated. The itemset
Frequent Itemset null
{a1,a2,…,an} \ ltemsel Border
(b) Specific-to-general
6.5 Alternative Methods for Generating Flequent Itemsets 361
(a) Prefix tree.
Figure 6,20″ Equivalence classes based on the prefix and suffix labels of itemsets.
(a) Breadth first (b) Depth first
Figure 6,21. Breadth{irst and depth-first traversals.
lattice can also be traversed in a depth-first manner, as shown in Figures 6.21(b) and 6.22. The algorithm can start from, say, node a, in Figure 6.22, and count its support to determine whether it is frequent. If so, the algorithm progressively expands the next level of nodes, i.e., ab, abc, and so on, until an infrequent node is reached, say, abcd. It then backtracks to another branch, say, abce, and continues the search from there.
The deprth-first approach is often used by algorithms designed to find maximal frequent itemsets. This approach allows the frequent itemset border to be detected more quickly than using a breadth-first approach. Once a merximal frequent itemset is found, substantial pruning can be
\.– /
362 Chapter 6 Association Analysis
bce bde cde r/// – – – – – – – – t ‘
abce abde
Figure 6.22. Generating candidate itemsets using the depth{irst approach.
performed on its subsets. For example, if the node bcde shown in Figure 6.22 is maximal frequent, then the algorithm does not have to visit the subtrees rooted at bd,, be, c, d, and e because they will not contain any maximal frequent itemsets. However, if abcis maximal frequent, only the nodes such as ac and bc arc not maximal frequent (but the subtrees of ac and bc may still contain maximal frequent itemsets). The depth-first approach also allows a different kind of pruning based on the support of itemsets. For example, suppose the support for {a,b,c} is identical to the support for {a, b}. The subtrees rooted at abd and abe can be skipped because they are guaranteed not to have any maximal frequent itemsets. The oroof of this is left as an exercise to the readers.
Representation of Transaction Data Set There are many ways to rep- resent a transaction data set. The choice of representation can affect the I/O costs incurred when computing the support of candidate itemsets. Figure 6.23 shows two different ways of representing market basket transactions. The rep- resentation on the left is called a horizontal data layout, which is adopted by many association rule mining algorithms, including Apri,ori,. Another pos- sibility is to store the list of transaction identifiers (TID-list) associated with each item. Such a representation is known as the vertical data layout. The support for each candidate itemset is obtained by intersecting the TlD-lists of its subset items. The length of the TlD-lists shrinks as we progress to larger
I I I I I qc
I \
)r tt
ade
\ ) l I t- ,
I I
o
de I
I
bcd
abcde
6.6 FP-Growth Algorithm 363
Horizontal Data Layout Vertical Data Layout
‘l 1 2 2 1 4 2 3 4 3 5 5 4 5 o
6 7 8 I 7 I I 8 10 9
Figure 6.23, Horizontal and vertical data format.
sized itemsets. [fowever, one problem with this approach is that the initial
set of TlD-lists may be too large to fit into main memory, thus requiring
more sophisticated techniques to compress the TlD-lists. We describe another
effective approach to represent the data in the next section.
6.6 FP-Growth Algorithm
This section presents an alternative algorithm called FP-growth that takes
a radically different approach to discovering frequent itemsets. The algorithm
does not subscribe to the generate-and-test paradigm of Apriori^ Instead, it
encodes the data set using a compact data structure called an FP-tree and
extracts frequent itemsets directly from this structure. The details of this
approach are presented next.
6.6.1 FP-Tbee Representation
An FP-tree is a compressed representation of the input data. It is constructed
by reading the data set one transaction at a time and mapping each transaction
onto a path in the FP-tree. As different transactions can have several items
in common, their paths may overlap. The more the paths overlap with one
another, the more compression we can achieve using the FP-tree structure. If
the size of the F.P-tree is small enough to fit into main memory, this will allow
us to extract frequent itemsets directly from the structure in memory instead
of making repeated passes over the data stored on disk.
364 Chapter 6 Association Analysis
Transaction Data Set
(i) After reading TID=1 (ii) After reading TID=2
(iv) After reading TID=10
Figure 6.24, Construction of an FP-tree.
Figure 6.24 shows a data set that contains ten transactions and five items. The structures ofthe FP-tree after reading the first three transactions are also depicted in the diagram. Each node in the tree contains the label of an item along with a counter that shows the number of transactions mapped onto the given path. Initially, the FP-tree contains only the root node represented by the nulf symbol. The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of each item. Infrequent items are discarded, while the frequent items are sorted in decreasing support counts. For the data set shown in Figure 6.24, a is the most frequent item, followed by b, c, d, and e.
b : 1
FP-GrowthAlgorithm 365
2. The algoril;hm makes a second pass over the data to construct the FP-
tree. After reading the first transaction, {o,b), the nodes labeled as a
and b are created. A path is then formed from nulI –+ a ‘–+ b to encode
the transaction. Every node along the path has a frequency count of 1.
3. After reading the second transaction, {b,cd}, a new set of nodes is cre-
ated for items b, c, arrd d. A path is then formed to represent the
transaction by connecting the nodes null —+ b —+ c -* d. Every node
along this path also has a frequency count equal to one. Although the first two transactions have an item in common, which is b, their paths
are disjoint because the transactions do not share a common prefix.
4. The third transaction, {a,cd,e}, shares a common prefix item (which
is a) with the first transaction. As a result, the path for the third
transaction, null ) a —+ c —+ d –+ e, overlaps with the path for the
first transaction, nuII —+ a — b. Because of their overlapping path, the frequency count for node o is incremented to two, while the frequency
counts for the newly created nodes, c, d, and e) are equal to one.
5. This process continues until every transaction has been mapped onto one
of the paths given in the FP-tree. The resulting FP-tree after reading all the transactions is shown at the bottom of Figure 6.24.
The size of an FP-tree is typically smaller than the size of the uncompressed data because ma,ny transactions in market basket data often share a few items
in common. In the best-case scenario, where all the transactions have the
same set of iterns, the FP-tree contains only a single branch of nodes. The
worst-case scenario happens when every transaction has a unique set of items.
As none of the transactions have any items in common, the size of the FP-tree is effectively the same as the size of the original data. However, the physical
storage requirement for the FP-tree is higher because it requires additional space to store pointers between nodes and counters for each item.
The size of an FP-tree also depends on how the items are ordered. If
the ordering scheme in the preceding example is reversed, i.e., from lowest
to highest support item, the resulting FP-tree is shown in Figure 6.25. The
tree appears to be denser because the branching factor at the root node has
increased from 2 to 5 and the number of nodes containing the high support items such as a and b has increased from 3 to 12. Nevertheless, ordering
by decreasing support counts does not always lead to the smallest tree. For
exanlple, suppose we augment the data set given in Figure 6.24 with 100
transactions that contain {e}, 80 transactions that contain {d}, 60 transactions
6.6
366 Chapter 6 Association Analysis