Finally, because the lower the support of X U Y, the more negatively cor- related is the pattern, negatively correlated patterns that are infrequent tend to be more interesting than negatively correlated patterns that are frequent. The infrequent, negatively correlated patterns are illustrated by the overlap- ping region in Figure 7.22 between both types of patterns.
7.6.4 Techniques for Mining Interesting Infrequent Patterns
In principle, infrequent itemsets are given by all itemsets that are not extracted by standard frequent itemset generation algorithms such as Apri,ori, and FP-
462 Chapter 7 Association Analysis: Advanced Concepts
Maximal Frequent Itemset
Infrequent
Frequent Itemset Border
Figure 7.23, Frequent and infrequent itemsets.
growth. These itemsets correspond to those located below the frequent itemset
border shown in Figure 7.23. Since the number of infrequent patterns can be exponentially large, es-
pecially for sparse, high-dimensional data, techniques developed for mining infrequent patterns focus on finding only interesting infrequent patterns. An
example of such patterns includes the negatively correlated patterns discussed in Section 7.6.2. These patterns are obtained by eliminating all infrequent itemsets that fail the negative correlation condition provided in Inequality
7.3. This approach can be computationally intensive because the supports for all infrequent itemsets must be computed in order to determine whether
they are negatively correlated. Unlike the support measure used for mining frequent itemsets, correlation-based measures used for mining negatively corre- lated itemsets do not possess an anti-monotone property that can be exploited for pruning the exponential search space. Although an efficient solution re-
mains elusive, several innovative methods have been developed, as mentioned in the bibliographic notes provided at the end of this chapter.
The remainder of this chapter presents two classes of techniques for mining interesting infrequent patterns. Section 7.6.5 describes methods for mining
‘/’
InfrequentPatterns 463
TID Items
1 {A,B} 2 {A,B,C}
3 tc) 4 {B,C} 5 {B,D}
Original Transactions Transactions with Negative ltems
Figure7.24. Augmenting a data set with negative items.
negative patterns in data, while Section 7.6.6 describes methods for finding interesting infrequent patterns based on support expectation.
7.6.5 Techniques Based on Mining Negative Patterns
The first class of techniques developed for mining infrequent patterns treats every item as a symmetric binary variable. Using the approach described in Section 7.1, the transaction data can be binarized by augmenting it with neg- ative items. Figure 7.24 shows an example of transforming the original data into transactions having both positive and negative items. By applying exist- ing frequent itemset generation algorithms such as Apriori on the augmented transactions, all the negative itemsets can be derived.
Such an approach is feasible only if a few variables are treated as symmetric binary (i.e., we look for negative patterns involving the negation of only a small number of items). If every item must be treated as symmetric binary, the problem becomes computationally intractable due to the following reasons.
1. The number of items doubles when every item is augmented with its corresponding negative item. Instead of exploring an itemset lattice of size 2d, where d is the number of items in the original data set, the lattice becomes considerably larger, as shown in Exercise 21 on page 485.
2. Support-based pruning is no longer effective when negative items are augmented. For each variable r, either r or r has support greater than or equal to 50%. Hence, even if the support threshold is as high as 50Vo,half of the items will remain frequent. For lower thresholds, many more items and possibly itemsets containing them will be frequent. The support-based pruning strategy employed by Apriori, is effective only
7.6
TID A A B B c c D D 1 1 0 1 0 0 1 0 1
2 1 0 1 0 1 0 0 1
3 0 1 0 1 1 0 0 1
4 0 1 1 0 1 0 0 1
5 0 1 1 0 0 1 1 0
464 Chapter 7 Association Analysis: Advanced Concepts
when the support for most itemsets is low; otherwise, the number of
frequent itemsets grows exponentially.
3. The width of each transaction increases when negative items are aug- mented. Suppose there are d items available in the original data set. For
sparse data sets such as market basket transactions, the width of each
transaction tends to be much smaller than d. As a result, the maximum size of a frequent itemset, which is bounded by the maximum transac- tion width, rumax, tends to be relatively small. When negative items are included, the width of the transactions increases to d because an item is
either present in the transaction or absent from the transaction, but not
both. Since the maximum transaction width has grown from’u.r,r.ur. to d, this will increase the number of frequent itemsets exponentially. As
a result, many existing algorithms tend to break down when they are applied to the extended data set.
The previous brute-force approach is computationally expensive because it
forces us to determine the support for a large number of positive and negative patterns. Instead of augmenting the data set with negative items, another approach is to determine the support of the negative itemsets based on the support of their corresponding positive items. For example, the support for
{p,Q,r} can be computed in the following way:
s({p,Q,r } ) : ” ( {p} )
– ” ( {p ,
q} ) – ‘ ( {p , r } ) + s( {p, q , r } ) .
More generally, the support for any itemset X UY can be obtained as follows:
n
s(XuY):s(X)+t { ( – t )uxs (Xoz) } . (7 .7) i : r Zcy, lz l : i
To apply Equation 7.7, s(X U Z) mtxt be determined for every Z that is a subset of Y. The support for any combination of X and Z that exceeds the m’insup threshold can be found using the Apri.ori, algorithm. For all other combinations, the supports must be determined explicitly, e.8., by scanning the entire set of transactions. Another possible approach is to either ignore the support for any infrequent itemset X U Z or to approximate it with the m’insup threshold.
Several optimization strategies are available to further improve the perfor-
mance of the mining algorithms. First, the number of variables considered as
7.6 Infrequent Patterns 465
symmetric binary can be restricted. More specifically, a negative item E is con- sidered interesting only if y is a frequent item. The rationale for this strategy is that rare items tend to produce a large number of infrequent patterns and many of which are uninteresting. By restricting the set Y given in Equation 7.7 to variables whose positive items are frequent, the number of candidate nega- tive itemsets considered by the mining algorithm can be substantially reduced. Another strategy is to restrict the type of negative patterns. For example, the algorithm may consider only a negative pattern X U Y if it contains at least one positive item (i.e., lxl > 1). The rationale for this strategy is that if the data set contains very few positive items with support greater than 50%, then most of the negative patterns of the form X U Y will become frequent, thus degrading the performance of the mining algorithm.
7.6.6 Techniques Based on Support Expectation
Another class of techniques considers an infrequent pattern to be interesting only if its actual support is considerably smaller than its expected support. For negatively correlated patterns, the expected support is computed based on the statistical independence assumption. This section describes two alternative approaches for determining the expected support of a pattern using (1) a concept hierarchy and (2) a neighborhood-based approach known as indirect association.
Support Expectation Based on Concept Hierarchy
Objective measures alone may not be sufficient to eliminate uninteresting in- frequent patterns. For example, suppose bread and laptop computer are frequent items. Even though the itemset {bread, Iaptop conputer} is in- frequent and perhaps negatively correlated, it is not interesting because their lack of support seems obvious to domain experts. Therefore, a subjective ap- proach for determining expected support is needed to avoid generating such infrequent patterns.
In the preceding example, bread and laptop computer belong to two completely different product categories, which is why it is not surprising to find that their support is low. This example also illustrates the advantage of using domain knowledge to prune uninteresting patterns. For market basket data, the domain knowledge can be inferred from a concept hierarchy such as the one shown in Figure 7.25. The basic assumption of this approach is that items from the same product family are expected to have similar types of interaction with other items. For example, since ham and bacon belong to the
466 Chapter 7 Association Analysis: Advanced Concepts
Taco Oatmeal Chocolate Ham Bacon Boneless Chio
Figure 7.25. Example of a concept hierarchy.
same product family, we expect the association between ham and chips to be
somewhat similar to the association between bacon and chips. If the actual support for any one ofthese pairs is less than their expected support, then the infrequent pattern is interesting.
To illustrate how to compute the expected support, consider the diagram shown in Figure 7.26. Suppose the itemset {C,G} is frequent. Let s(‘) denote the actual support of a pattern and e(.) denote its expected support. The expected support for any children or siblings of C and G can be computed using the formula shown below.
e(s(E,J)): s(C,G)rffi”8 (7.8)
(7.e)
(7.10)
e ( s ( C , J ) ) :
e(s(C, H)) :
s(C,G)>< #
s(C,G)” ffi For example, if soda and snack food are frequent, then the expected
support between diet soda and chips can be computed using Equation 7.8 because these items are children of soda and snack food, respectively. If
the actual support for diet soda and chips is considerably lower than their expected value, then diet soda and chips form an interesting infrequent pattern.
InfrequentPatterns 467
DEJK
Figure 7.26. Mining interesting negative patterns using a concept hierarchy.
Support Expectation Based on Indirect Association
Consider a pair of items, (a, b), that are rarely bought together by customers. Ifa and b are unrelated items, such as bread and DVO player, then their support is expected to be low. On the other hand, if a and b are related items, then their support is expected to be high. The expected support was previously computed using a concept hierarchy. This section presents an approach for determining the expected support between a pair of items by looking at other items commonly purchased together with these two items.
For example, suppose customers who buy a sleeping bag also tend to buy other camping equipment, whereas those who buy a desktop computer also tend to buy other computer accessories such as an optical mouse or a printer. Assuming there is no other item frequently bought together with both a sleeping bag and a desktop computer, the support for these unrelated items is expected to be low. On the other hand, suppose diet and regular soda are often bought together with chips and cookies. Even without using a concept hierarchy, both items are expected to be somewhat related and their support should be high. Because their actual support is low, diet and regular soda form an interesting infrequent pattern. Such patterns are known as indirect association patterns.
A high-level illustration of indirect association is shown in Figure 7.27. Items a and b correspond to diet soda and regular soda, while Y, which is known as the mediator set, contains items such as chips and cookies. A formal definition of indirect association is presented next.
7.6