For example, the second transaction shown in Table 6.2 contains the item-
set {Bread, Diapers} but not {Bread, Milk}. An important property of an
itemset is its support count, which refers to the number of transactions that
contain a particular itemset. Mathematically, the support count, o(X), for an
itemset X can be stated as follows:
o(X) : l{ t , lx C t i , t , i
where the symbol | . I denote the number of elements in a set. In the data set
shown in Table 6.2, the support count for {Beer, Diapers, Milk} is equal to
two because there are only two transactions that contain all three items.
Association Rule An association rule is an implication expression of the
form X , Y, where X and Y are disjoint itemsets, i.e., X )Y : @. The
strength of an association rule can be measured in terms of its support and
confidence. Support determines how often a rule is applicable to a given
6 . L
330 Chapter 6 Association Analysis
data set, while confidence determines how frequently items in Y appear in transactions that contain X. The formal definitions of these metrics are
Support, s(X ——+ I/) :
Confidence, c(X –+ Y) :
o(X uY) . .Atr
1
o(X uY)
“(x)
(6 .1)
(6.2)
Example 6.1. Consider the rule {Uitt, Diapers} —–* {eeer}. Since the support count for {ltitt<, Diapers, Beer} is 2 and the total number of trans- actions is 5, the rule’s support is 2f 5 :0.4. The rule’s confidence is obtained by dividing the support count for {ttitt<, Diapers, Beer} by the support count for {Uitt<, Diapers}. Since there are 3 transactions that contain milk and di- apers, the confidence for this rule is 213: 0.67. I
Why Use Support and Confidence? Support is an important measure because a rule that has very low support may occur simply by chance. A low support rule is also likely to be uninteresting from a business perspective because it may not be profitable to promote items that customers seldom buy together (with the exception of the situation described in Section 6.8). For these reasons, support is often used to eliminate uninteresting rules. As will be shown in Section 6.2.1, support also has a desirable property that can be exploited for the efficient discovery of association rules.
Confidence, on the other hand, measures the reliability of the inference made by a rule. For a given rule X > Y, the higher the confidence, the more Iikely it is for Y to be present in transactions that contain X. Confidence also provides an estimate of the conditional probability of Y given X.
Association analysis results should be interpreted with caution. The infer- ence made by an association rule does not necessarily imply causality. Instead, it suggests a strong co-occurrence relationship between items in the antecedent and consequent of the rule. Causality, on the other hand, requires knowledge about the causal and effect attributes in the data and typically involves rela- tionships occurring over time (e.g., ozone depletion leads to global warming).
Formulation of Association Rule Mining Problem The association rule mining problem can be formally stated as follows:
Definition 6.1 (Association Rule Discovery). Given a set of transactions 7, find all the rules having support ) minsup and confidence ) minconf , where m’insup and m’inconf are the corresponding support and confidence thresholds.
Problem Definition 331
A brute-force approach for mining association rules is to compute the sup- port and confidence for every possible rule. This approach is prohibitively expensive because there are exponentially many rules that can be extracted from a data set. More specifically, the total number of possible rules extracted from a data set that contains d items is
R : J d – 2 d + r + I . (6 .3)
The proof for this equation is left as an exercise to the readers (see Exercise 5 on page 405). Even for the small data set shown in Table 6.1, this approach requires us to compute the support and confidence for 36 – 27 * 1 : 602 rules. More than 80% of the rules are discarded after applying mi,nsup : 20Vo and minconf : 5070, thus making most of the computations become wasted. To avoid performing needless computations, it would be useful to prune the rules early without having to compute their support and confidence values.
An initial step toward improving the performance of association rule min- ing algorithms is to decouple the support and confidence requirements. From Equation 6.2, notice that the support of a rule X —–+ Y depends only on the support of its corresponding itemset , X U Y. For example, the following rules have identical support because they involve items from the same itemset,
{Beer, Diapers, MiIk}:
{Beer, Diapers} —-* {t’ l i.ft}, {Beer, Milk} —— {Diapers}, {Diapers, Milk} —–r {eeer}, {eeer} —-* {Diapers, Milk},
{uirt} —— {Beer,Diapers}, {oiapers} ——+ {Beer,Milk}.
If the itemset is infrequent, then all six candidate rules can be pruned imme- diately without our having to compute their confidence values.
Therefore, a common strategy adopted by many association rule mining algorithms is to decompose the problem into two major subtasks:
1. FYequent Itemset Generation, whose objective is to find all the item- sets that satisfy Lhe mi,nsup threshold. These itemsets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from the frequent itemsets found in the previous step. These rules are called strong rules.
The computational requirements for frequent itemset generation are gen- erally more expensive than those of rule generation. Efficient techniques for generating frequent itemsets and association rules are discussed in Sections 6.2 and 6.3, respectively.
6 .1
332 Chapter 6 Association Analysis
Figure 6.1. An itemset lattice.
6.2 Frequent Itemset Generation
A lattice structure can be used to enumerate the list of all possible itemsets. Figure 6.1 shows an itemset lattice for 1: {a,b,c.,d,e}.In general, a data set that contains k items can potentially generate up to 2k – 7 frequent itemsets, excluding the null set. Because k can be very large in many practical appli- cations, the search space of itemsets that need to be explored is exponentially Iarge.
A brute-force approach for finding frequent itemsets is to determine the support count for every candidate itemset in the lattice structure. To do this, we need to compare each candidate against every transaction, an opera- tion that is shown in Figure 6.2. If the candidate is contained in a transaction, its support count will be incremented. For example, the support for {Bread, Milk) is incremented three times because the itemset is contained in transac- tions 1, 4, and 5. Such an approach can be very expensive because it requires O(N Mw) comparisons, where l/ is the number of transacti ons, M : 2k – | is the number of candidate itemsets, and tl is the maximum transaction width.
Frequent Itemset Generation 333
Figure 6.2. Counting the support of candidate itemsets.
There are several ways to reduce the computational complexity of frequent itemset generation.
1. Reduce the number of candidate itemsets (M). The Apri’ori’ prin’
ciple, described in the next section, is an effective way to eliminate some
of the candidate itemsets without counting their support values.
2. Reduce the number of comparisons. Instead of matching each can-
didate itemset against every transaction, we can reduce the number of
comparisons by using more advanced data structures, either to store the
candidate itemsets or to compress the data set. We will discuss these
strategies in Sections 6.2.4 and 6.6.
6.2.L The Apriori Principle
This section describes how the support measure helps to reduce the number
of candidate itemsets explored during frequent itemset generation. The use of
support for pruning candidate itemsets is guided by the following principle.
Theorem 6.I (Apriori Principle). If an’itemset’is frequent, then all of its
subsets must also be frequent.
To illustrate the idea behind the Apri,ore principle, consider the itemset
Iattice shown in Figure 6.3. Suppose {c, d, e} is a frequent itemset. Clearly,
any transaction that contains {c,d,e} must also contain its subsets, {“,d},
{” ,”} , {d,e}, {“} , {d}, and {e}. As a result , i f {c,d,e} is f requent, then
all subsets of {c, d,e} (i.e., the shaded itemsets in this figure) must also be
frequent.
6.2
Mt N
Y
/
334 Chapter 6 Association Analysis
Figure 6,3. An illustration ol lhe Aprioriprinciple. lt {c,d,e} is frequent, then all subsets of this itemset are frequent.
Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be infrequent too. As illustrated in Figure 6.4, the entire subgraph containing the supersets of {o, b} can be pruned immediately once {a, b} is found to be infrequent. This strategy of trimming the exponential search space based on the support measure is known as support-based pruning. Such a pruning strategy is made possible by a key property of the support measure, namely, that the support for an itemset never exceeds the support for its subsets. This property is also known as the anti-monotone property of the support measure.
Definition 6.2 (Monotonicity Property). Let 1 be a set of items, and J :21 be the pov/er set of 1. A measure / is monotone (or upward closed) if
YX,Y e J : (X eY) ——, f(X) < f(Y),
6.2 Frequent Itemset Generation 335
Pruned \r
Supersets \\–_
Figure 6.4. An illustration of support-based pruning. lt {a,b} is infrequent, then allsupersets of {a,b} are infrequent.
which means that if X is a subset of Y, then /(X) must not exceed /(f). On
the other hand, / is anti-monotone (or downward closed) if
vx,YeJ: (X qY) – f (Y)<f (x ) ,
which means that if X is a subset of Y, then /(Y) must not exceed /(X).
Any measure that possesses an anti-monotone property can be incorpo- rated directly into the mining algorithm to effectively prune the exponential search space of candidate itemsets, as will be shown in the next section.
6.2.2 Fbequent Itemset Generation in the Apri’ori Algorithm
Apri,ori, is the first association rule mining algorithm that pioneered the use
of support-based pruning to systematically control the exponential growth of candidate itemsets. Figure 6.5 provides a high-level illustration of the frequent itemset generation part of the Apriori algorithm for the transactions shown in
336 Chapter 6 Association Analysis
Candidate 1-ltemsets
Minimum support count = 3
Itemsets removed because of low support Candidate
3-ltemsets Itemset Count
{Bread, Diapers, Milk} 3
Figure 6.5. llluskation of frequent itemset generation using the Apriori algorithm.
Table 6.1. We assume that the support threshold is60To, which is equivalent to a minimum support count equal to 3.
Initially, every item is considered as a candidate l-itemset. Afber count- ing their supports, the candidate itemsets {Co1a} and {Eggs} are discarded because they appear in fewer than three transactions. In the next iteration, candidate 2-itemsets are generated using only the frequent 1-itemsets because the Apri,orz principle ensures that all supersets of the infrequent 1-itemsets must be infrequent. Because there are only four frequent 1-itemsets, the num- ber of candidate 2-itemsets generated by the algorithm is ( f ) : O. Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found to be infrequent after computing their support values. The remain- ing four candidates are frequent, and thus will be used to generate candidate 3-itemsets. Without support-based. pruning, there are ( ! ) : 2O candidate 3-itemsets that can be formed using the six items given in this example. With the Apri,ori principle, we only need to keep candidate 3-itemsets whose subsets are frequent. The only candidate that has this property is {Bread, Diapers, Mi rk ) .
The effectiveness of the Apri,ore pruning strategy can be shown by count- ing the number of candidate itemsets generated. A brute-force strategy of
6.2 Freouent Itemset Generation 337
enumerating all itemsets (up to size 3) as candidates will produce
candidates. With the Apri,ori principle, this number decreases to
candidates, which represents a 68% reduction in the number of candidate itemsets even in this simple example.
The pseudocode for the frequent itemset generation part of the Apri,ori, algorithm is shown in Algorithm 6.1. Let Cp denote the set of candidate k-itemsets and Fr denote the set of frequent k-itemsets:
o The algorithm initially makes a single pass over the data set to determine the support of each item. Upon completion of this step, the set of all frequent 1-itemsets, f’r, will be known (steps 1 and 2).
o Next, the algorithm will iteratively generate new candidate k-itemsets using the frequent (k – 1)-itemsets found in the previous iteration (step
5). Candidate generation is implemented using a function called apriori- gen, which is described in Section 6.2.3.
Algorithm 6.1 Flequent itemset generation of the Apriori, algorithm. I : k : 1 .
2: F1″ – { i I i e I no({t}) > lr’ x minsup}. {Find all frequent l-itemsets} 3: repeat 4 : k : k + 7 . 5: Cr : apriori-gen(Fr-1). {Generate candidate itemsets} 6: for each transaction t eT do 7: Ct : subset(Cn, t). {Identify all candidates that belong to l} 8: for each candidate itemset c € C1 do 9: o(c) : o(c) + 1. {Increment support count}
10: end for 11: end for 12: Fr: { cl c€Cp Ao(c) ) N x mi.nsup}. {Extract the frequent /c-itemsets} 13: until F* : A 14: Result : UF*.
(?) . (!) . (:) :6 + 15 t20:4r
( l) .( i ) .1:6*6*1:13
338 Chapter 6 Association Analysis
To count the support of the candidates, the algorithm needs to make an additional pass over the data set (steps 6 10). The subset function is used to determine all the candidate itemsets in Ca that are contained in each transaction f. The implementation of this function is described in Section 6.2.4.