Figure 6.25. An FP-tree representation for the data set shown in Figure 6.24 with a different item ordering scheme.
that contain {“}, and 40 transactions that contain {b}. Item e is now most frequent, followed by d, c, b, and a. With the augmented transactions, ordering by decreasing support counts will result in an FP-tree similar to Figure 6.25, while a scheme based on increasing support counts produces a smaller FP-tree similar to Figure 6.2a$v).
An FP-tree also contains a list of pointers connecting between nodes that have the same items. These pointers, represented as dashed lines in Figures 6.24 and 6.25, help to facilitate the rapid access of individual items in the tree. We explain how to use the FP-tree and its corresponding pointers for frequent itemset generation in the next section.
6.6.2 FYequent Itemset Generation in FP-Growth Algorithm
FP-growth is an algorithm that generates frequent itemsets from an FP-tree by exploring the tree in a bottom-up fashion. Given the example tree shown in Figure 6.24, the algorithm looks for frequent itemsets ending in e first, followed by d, c, b, and finally, a. This bottom-up strategy for finding frequent item- sets ending with a particular item is equivalent to the suffix-based approach described in Section 6.5. Since every transaction is mapped onto a path in the FP-tree, we can derive the frequent itemsets ending with a particular item, sayT e, by examining only the paths containing node e. These paths can be accessed rapidly using the pointers associated with node e. The extracted paths are shown in Figure 6.26(a). The details on how to process the paths to obtain frequent itemsets will be explained later.
6.6 FP-Growth Algorithm 367
(a) Paths containing node e
null
(b) Paths containing node d
) b : 2
c:3 (c) Paths containing node c (d) Paths containing node b (e) Paths containing node a
Figure 6.26. Decomposing the frequent itemset generation problem into multiple subproblems, where each subproblem involves finding frequent itemsets ending in e, d, c, b, and a.
Table 6.6. The list of frequent itemsets ordered by their corresponding suffixes.
Suffix Fbequent Itemsets e {e}, {d,e}, {a,d,e}, {c,e},{a,e} d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d} c { c } , { b , c } , { a , b , c } , { a , c } b {b}, {a,b} a { a }
After finding the frequent itemsets ending in e, the algorithm proceeds to look for frequent itemsets ending in d by processing the paths associated with node d. The corresponding paths are shown in Figure 6.26(b). This process continues until all the paths associated with nodes c, b, and finally a) are processed. The paths for these items are shown in Figures 6.26(c), (d), and (e), while their corresponding frequent itemsets are summarized in Table 6.6.
FP-growth finds all the frequent itemsets ending with a particular suffix by employing a divide-and-conquer strategy to split the problem into smaller subproblems. For example, suppose we are interested in finding all frequent
c:3
d :1
nul l
b:5
a:8
null o ,\
a:8 c:2
368 Chapter 6 Association Analysis
d :1 d :1
(b) Conditional FP-tree for e
(c) Prefix paths ending in de (d) Conditional FP-tree for de
nul l
c :1 c :1
(e) Prefix paths ending in ce (f) Prefix paths
Figure 6.27. Example of applying the FP-growth algorithm to find frequent itemsets ending in e.
itemsets ending in e. To do this, we must first check whether the itemset
{e} itself is frequent. If it is frequent, we consider the subproblem of finding frequent itemsets ending in de, followed by ce, be, and ae. In turn, each of these subproblems are further decomposed into smaller subproblems. By merging the solutions obtained from the subproblems, all the frequent itemsets ending in e can be found. This divide-and-conquer approach is the key strategy employed by the FP-growth algorithm.
For a more concrete example on how to solve the subproblems, consider the task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initial paths are called prefix paths and are shown in Figure 6.27(a).
2. Fbom the prefix paths shown in Figure 6.27(a), the support count for e is obtained by adding the support counts associated with node e. Assuming that the minimum support count is 2, {e} is declared a frequent itemset because its support count is 3.
.)
I d a:2 ending in ae
FP-GrowthAlgorithm 369
Because {e} is frequent, the algorithm has to solve the subproblems of finding frequent itemsets ending in de, ce, be, and ae. Before solving these subproblems, it must first convert the prefix paths into a con- ditional FP-tree, which is structurally similar to an FP-tree, except it is used to find frequent itemsets ending with a particular suffix. A conditional FP-tree is obtained in the following way:
(a) First, the support counts along the prefix paths must be updated because some ofthe counts include transactions that do not contain item e. For example, the rightmost path shown in Figure 6.27(a), nufl ——+ b:2 ——-+ c:2 ——+ e:1, includes a transaction {b,c} that does not contain item e. The counts along the prefix path must therefore be adjusted to 1 to reflect the actual number of transac- tions containing {b, c, e}.
(b) The prefix paths are truncated by removing the nodes for e. These nodes can be removed because the support counts along the prefix pathsr have been updated to reflect only transactions that contain e and the subproblems of finding frequent itemsets ending in de, ce, be, anrd ae no longer need information about node e.
(c) After updating the support counts along the prefix paths, some of the items may no longer be frequent. For example, the node b appears only once and has a support count equal to 1, which means that there is only one transaction that contains both b and e. Item b can be safely ignored from subsequent analysis because all itemsets ending in be must be infrequent.
The conditional FP-tree for e is shown in Figure 6.27(b). The tree looks different than the original prefix paths because the frequency counts have been updated and the nodes b and e have been eliminated.
FP-growth uses the conditional FP-tree for e to solve the subproblems of finding frequent itemsets ending in de, ce, and ae. To find the frequent itemsets ending in de, the prefix paths for d are gathered from the con- ditional FP-tree for e (Figure 6.27(c)). By adding the frequency counts associated with node d, we obtain the support count for {d,e}. Since the support count is equal to 2, {d,e} is declared a frequent itemset. Next, the algorithm constructs the conditional FP-tree for de using the approach described in step 3. After updating the support counts and removing the infrequent item c, the conditional FP-tree for de is shown in Figure 6.27(d). Since the conditional FP-tree contains only one item,
6 .6
3.
A
37O Chapter 6 Association Analysis
o, whose support is equal to minsup, the algorithm extracts the fre- quent itemset {a,d,e} and moves on to the next subproblem, which is to generate frequent itemsets ending in ce. After processing the prefix paths for c, only {c, e} is found to be frequent. The algorithm proceeds to solve the next subprogram and found {a,e} to be the only frequent itemset remaining.
This example illustrates the divide-and-conquer approach used in the FP- growth algorithm. At each recursive step, a conditional FP-tree is constructed by updating the frequency counts along the prefix paths and removing all infrequent items. Because the subproblems are disjoint, FP-growth will not generate any duplicate itemsets. In addition, the counts associated with the nodes allow the algorithm to perform support counting while generating the common suffix itemsets.
FP-growth is an interesting algorithm because it illustrates how a compact representation of the transaction data set helps to efficiently generate frequent itemsets. In addition, for certain transaction data sets, FP-growth outperforms the standard Apriori, algorithm by several orders of magnitude. The run-time performance of FP-growth depends on the compaction factor of the data set. If the resulting conditional FP-trees are very bushy (in the worst case, a full prefix tree), then the performance of the algorithm degrades significantly because it has to generate a large number of subproblems and merge the results returned by each subproblem.
6.7 Evaluation of Association Patterns
Association analysis algorithms have the potential to generate a large number of patterns. For example, although the data set shown in Table 6.1 contains only six items, it can produce up to hundreds of association rules at certain support and confidence thresholds. As the size and dimensionality of real commercial databases can be very large, we could easily end up with thousands or even millions of patterns, many of which might not be interesting. Sifting through the patterns to identify the most interesting ones is not a trivial task because “one person’s trash might be another person’s treasure.” It is therefore important to establish a set of well-accepted criteria for evaluating the quality of association patterns.
The first set of criteria can be established through statistical arguments. Patterns that involve a set of mutually independent items or cover very few transactions are considered uninteresting because they may capture spurious relationships in the data. Such patterns can be eliminated by applying an
Evaluation of Association Patterns 371
objective intenestingness measure that uses statistics derived from data to determine whether a pattern is interesting. Examples of objective interest- ingness measures include support, confidence, and correlation.
The second set of criteria can be established through subjective arguments. A pattern is considered subjectively uninteresting unless it reveals unexpected information about the data or provides useful knowledge that can lead to profitable actions. For example, the rrle {Butter} – {Bread} may not be interesting, despite having high support and confidence values, because the relationship represented by the rule may seem rather obvious. On the other hand, the rule {Di,apers} —— {Beer} is interesting because the relationship is quite unexpected and may suggest a new cross-selling opportunity for retailers. Incorporating subjective knowledge into pattern evaluation is a difficult task because it requires a considerable amount of prior information from the domain experts.
The following are some of the approaches for incorporating subjective knowledge into the pattern discovery task.
Visualization This approach requires a user-friendly environment to keep the human user in the loop. It also allows the domain experts to interact with the data mining system by interpreting and verifying the discovered patterns.
Template-based approach This approach allows the users to constrain the type of patterns extracted by the mining algorithm. Instead of reporting all the extracted rules, only rules that satisfy a user-specified template are returned to the users.
Subjective interestingness measure A subjective measure can be defined based on domain information such as concept hierarchy (to be discussed in Section 7.3) or profit margin of items. The measure can then be used to filter patterns that are obvious and non-actionable.
Readers interested in subjective interestingness measures may refer to re- sources listed in the bibliography at the end of this chapter.
6.7.t Objective Measures of Interestingness