+1 (208) 254-6996 [email protected]

Breadth-First versus Depth-First: The Apriori, algorithm traverses the lattice in a breadth-first manner) as shown in Figure 6.2L(a). It first discovers all the frequent 1-itemsets, followed by the frequent 2-itemsets, and so on, until no new frequent itemsets are generated. The itemset

Frequent Itemset null

Don't use plagiarized sources. Get Your Custom Essay on
Breadth-First versus Depth-First: The Apriori, algorithm traverses the lattice in a breadth-first manner) as shown in Figure 6.2
Just from $13/Page
Order Essay

{a1,a2,…,an} \ ltemsel Border

(b) Specific-to-general



6.5 Alternative Methods for Generating Flequent Itemsets 361

(a) Prefix tree.

Figure 6,20″ Equivalence classes based on the prefix and suffix labels of itemsets.

(a) Breadth first (b) Depth first

Figure 6,21. Breadth{irst and depth-first traversals.

lattice can also be traversed in a depth-first manner, as shown in Figures 6.21(b) and 6.22. The algorithm can start from, say, node a, in Figure 6.22, and count its support to determine whether it is frequent. If so, the algorithm progressively expands the next level of nodes, i.e., ab, abc, and so on, until an infrequent node is reached, say, abcd. It then backtracks to another branch, say, abce, and continues the search from there.

The deprth-first approach is often used by algorithms designed to find maximal frequent itemsets. This approach allows the frequent itemset border to be detected more quickly than using a breadth-first approach. Once a merximal frequent itemset is found, substantial pruning can be

\.– /



362 Chapter 6 Association Analysis

bce bde cde r/// – – – – – – – – t ‘

abce abde

Figure 6.22. Generating candidate itemsets using the depth{irst approach.

performed on its subsets. For example, if the node bcde shown in Figure 6.22 is maximal frequent, then the algorithm does not have to visit the subtrees rooted at bd,, be, c, d, and e because they will not contain any maximal frequent itemsets. However, if abcis maximal frequent, only the nodes such as ac and bc arc not maximal frequent (but the subtrees of ac and bc may still contain maximal frequent itemsets). The depth-first approach also allows a different kind of pruning based on the support of itemsets. For example, suppose the support for {a,b,c} is identical to the support for {a, b}. The subtrees rooted at abd and abe can be skipped because they are guaranteed not to have any maximal frequent itemsets. The oroof of this is left as an exercise to the readers.

Representation of Transaction Data Set There are many ways to rep- resent a transaction data set. The choice of representation can affect the I/O costs incurred when computing the support of candidate itemsets. Figure 6.23 shows two different ways of representing market basket transactions. The rep- resentation on the left is called a horizontal data layout, which is adopted by many association rule mining algorithms, including Apri,ori,. Another pos- sibility is to store the list of transaction identifiers (TID-list) associated with each item. Such a representation is known as the vertical data layout. The support for each candidate itemset is obtained by intersecting the TlD-lists of its subset items. The length of the TlD-lists shrinks as we progress to larger

I I I I I qc

I \

)r tt


\ ) l I t- ,



de I






6.6 FP-Growth Algorithm 363

Horizontal Data Layout Vertical Data Layout

‘l 1 2 2 1 4 2 3 4 3 5 5 4 5 o

6 7 8 I 7 I I 8 10 9

Figure 6.23, Horizontal and vertical data format.

sized itemsets. [fowever, one problem with this approach is that the initial

set of TlD-lists may be too large to fit into main memory, thus requiring

more sophisticated techniques to compress the TlD-lists. We describe another

effective approach to represent the data in the next section.

6.6 FP-Growth Algorithm

This section presents an alternative algorithm called FP-growth that takes

a radically different approach to discovering frequent itemsets. The algorithm

does not subscribe to the generate-and-test paradigm of Apriori^ Instead, it

encodes the data set using a compact data structure called an FP-tree and

extracts frequent itemsets directly from this structure. The details of this

approach are presented next.

6.6.1 FP-Tbee Representation

An FP-tree is a compressed representation of the input data. It is constructed

by reading the data set one transaction at a time and mapping each transaction

onto a path in the FP-tree. As different transactions can have several items

in common, their paths may overlap. The more the paths overlap with one

another, the more compression we can achieve using the FP-tree structure. If

the size of the F.P-tree is small enough to fit into main memory, this will allow

us to extract frequent itemsets directly from the structure in memory instead

of making repeated passes over the data stored on disk.



364 Chapter 6 Association Analysis

Transaction Data Set

(i) After reading TID=1 (ii) After reading TID=2

(iv) After reading TID=10

Figure 6.24, Construction of an FP-tree.

Figure 6.24 shows a data set that contains ten transactions and five items. The structures ofthe FP-tree after reading the first three transactions are also depicted in the diagram. Each node in the tree contains the label of an item along with a counter that shows the number of transactions mapped onto the given path. Initially, the FP-tree contains only the root node represented by the nulf symbol. The FP-tree is subsequently extended in the following way:

1. The data set is scanned once to determine the support count of each item. Infrequent items are discarded, while the frequent items are sorted in decreasing support counts. For the data set shown in Figure 6.24, a is the most frequent item, followed by b, c, d, and e.

b : 1



FP-GrowthAlgorithm 365

2. The algoril;hm makes a second pass over the data to construct the FP-

tree. After reading the first transaction, {o,b), the nodes labeled as a

and b are created. A path is then formed from nulI –+ a ‘–+ b to encode

the transaction. Every node along the path has a frequency count of 1.

3. After reading the second transaction, {b,cd}, a new set of nodes is cre-

ated for items b, c, arrd d. A path is then formed to represent the

transaction by connecting the nodes null —+ b —+ c -* d. Every node

along this path also has a frequency count equal to one. Although the first two transactions have an item in common, which is b, their paths

are disjoint because the transactions do not share a common prefix.

4. The third transaction, {a,cd,e}, shares a common prefix item (which

is a) with the first transaction. As a result, the path for the third

transaction, null ) a —+ c —+ d –+ e, overlaps with the path for the

first transaction, nuII —+ a — b. Because of their overlapping path, the frequency count for node o is incremented to two, while the frequency

counts for the newly created nodes, c, d, and e) are equal to one.

5. This process continues until every transaction has been mapped onto one

of the paths given in the FP-tree. The resulting FP-tree after reading all the transactions is shown at the bottom of Figure 6.24.

The size of an FP-tree is typically smaller than the size of the uncompressed data because ma,ny transactions in market basket data often share a few items

in common. In the best-case scenario, where all the transactions have the

same set of iterns, the FP-tree contains only a single branch of nodes. The

worst-case scenario happens when every transaction has a unique set of items.

As none of the transactions have any items in common, the size of the FP-tree is effectively the same as the size of the original data. However, the physical

storage requirement for the FP-tree is higher because it requires additional space to store pointers between nodes and counters for each item.

The size of an FP-tree also depends on how the items are ordered. If

the ordering scheme in the preceding example is reversed, i.e., from lowest

to highest support item, the resulting FP-tree is shown in Figure 6.25. The

tree appears to be denser because the branching factor at the root node has

increased from 2 to 5 and the number of nodes containing the high support items such as a and b has increased from 3 to 12. Nevertheless, ordering

by decreasing support counts does not always lead to the smallest tree. For

exanlple, suppose we augment the data set given in Figure 6.24 with 100

transactions that contain {e}, 80 transactions that contain {d}, 60 transactions




366 Chapter 6 Association Analysis

Order your essay today and save 10% with the discount code ESSAYHELP