Returning to the previous example, notice that the rule {p} —–+ {q} has very low confi.dence because most of the transactions that contain p do not contain q. In contrast, the rule {r} – {q}, which is derived from the pattern

{q,r}, has very high confidence. This observation suggests that cross-support patterns can be detected by examining the lowest confidence rule that can be extracted from a given itemset. The proof of this statement can be understood as follows.

6.8 Effect of Skewed Support Distribution 389

Figure 6.30. A transaction data set containing three items, p, q, ?fidr, where p is a high support item and q and r are low support items.

1. Recall the following anti-monotone property of confidence:

conf ( { i j4 } – { i2 , , i4 , . . . , in } ) < conf ( { i , j ,2 is } – { in , i s , . . . , i * } ) .

This property suggests that confi.dence never increases as we shift more items from the left- to the right-hand side of an association rule. Because of this property, the lowest confidence rule extracted from a frequent itemset contains only one item on its left-hand side. We denote the set of all rules with only one item on its left-hand side as -Rr.

2. Given a frequent itemset {it,,ir,.. . ,in}, the rule

{ i i } – { i r , i , r , . . . , i i – r , i i + r , . . . , i n }

has the lowest conf idence in l?1 i f s( i ) : max l”( i r ) , s( iz), . . . ,s( i r) ] . This follows directly from the definition of confidence as the ratio be, tween the rule’s support and the support of the rule antecedent.

390 Chapter 6 Association Analysis

3. Summarizing the previous points, the lowest confidence attainable from a f requent i temset {h , i2 , . . . ,26} i s

s ( { i , 1 , i . 2 , . . . , i , n ) )

max [s ( i1 ) , s ( i z ) , . . . , s ( i r ) ] ‘

This expression is also known as the h-confidence or all-confidence measure. Because of the anti-monotone property of support, the numer- ator of the h-confidence measure is bounded by the minimum support of any item that appears in the frequent itemset. In other words, the h-confidence of an itemset X : {h,,i2,.. . ,i7r} must not exceed the fol- lowing expression:

h-confidence(x) < gt” l,-t!’.tl’ ty),” ‘t!’:ll max is (e1 ) , s \zz ) , . . . , s \Lk) )

Note the equivalence between the upper bound of h-confidence and the support ratio (r) given in Equation 6.13. Because the support ratio for a cross-support pattern is always less than h”, the h-confidence of the pattern is also guaranteed to be less than h..

Therefore, cross-support patterns can be eliminated by ensuring that the h-confidence values for the patterns exceed h”. As a final note, it is worth mentioning that the advantages of using h-confidence go beyond eliminating cross-support patterns. The measure is also anti-monotone, i.e.,

h-conf idenc e({ i ,1, i2, . . . , i ,n}) ) h-conf idence({21, i2, . . . , in+r}) ,

and thus can be incorporated directly into the mining algorithm. Furthermore, h-confidence ensures that the items contained in an itemset are strongly asso- ciated with each other. For example, suppose the h-confidence of an itemset X is 80%. If one of the items in X is present in a transaction, there is at least an 80% chance that the rest of the items in X also belong to the same trans- action. Such strongly associated patterns are called hyperclique patterns.

6.9 Bibliographic Notes

The association rule mining task was first introduced by Agrawal et al. in

1228, 2291to discover interesting relationships among items in market basket

Bibliographic Notes 391

transactions. Since its inception, extensive studies have been conducted to address the various conceptual, implementation, and application issues per- taining to the association analysis task. A summary of the various research activities in this area is shown in Figure 6.31.

Conceptual Issues

Research in conceptual issues is focused primarily on (1) developing a frame- work to describe the theoretical underpinnings of association analysis, (2) ex- tending the formulation to handle new types of patterns, and (3) extending the formulation to incorporate attribute types beyond asymmetric binary data.

Following the pioneering work by Agrawal et al., there has been a vast amount of research on developing a theory for the association analysis problem. In 1254], Gunopoulos et al. showed a relation between the problem of finding maximal frequent itemsets and the hypergraph transversal problem. An upper bound on the complexity of association analysis task was also derived. Zaki et al. [334, 336] and Pasquier et al. [294] have applied formal concept analysis to study the frequent itemset generation problem. The work by Zaki et al. have subsequently led them to introduce the notion of closed frequent itemsets 1336]. Friedman et al. have studied the association analysis problem in the context of bump hunting in multidimensional space [252]. More specifically, they consider frequent itemset generation as the task of finding high probability density regions in multidimensional space.

Over the years, new types of patterns have been defined, such as profile association rules [225], cyclic association rules [290], finzy association rules

[273], exception rules [316], negative association rules 1238, 304], weighted association rules [240, 300], dependence rules [308], peculiar rules[340], inter- transaction association rules [250, 323], and partial classification rules 1231, 285]. Other types of patterns include closed itemsets [294, 336], maximal itemsets 1234], hyperclique patterns 1330], support envelopes [314], emerging patterns [246], and contrast sets [233]. Association analysis has also been successfully applied to sequential 1230, 312], spatial [266], and graph-based

1268,274,293, 331, 335] data. The concept of cross-support pattern was first introduced by Hui et al. in [330]. An efficient algorithm (called Hyperclique Miner) that automatically eliminates cross-support patterns was also proposed by the authors.

Substantial research has been conducted to extend the original association rule formulation to nominal [311], ordinal [281], interval [2S+], and ratio [253, 255, 31L, 325, 339] attributes. One of the key issues is how to define the support measure for these attributes. A methodology was proposed by Steinbach et

6 .9

392 Chapter 6 Association Analysis

al. [315] to extend the traditional notion of support to more general patterns and attribute types.

o (t)

(tt c(u

o (E C) o U’ .Jt (E

U) <D

(J C’

.c C) CE (D aJ> <D

.l):: o (5

o -c

o

(o E E 5 C”

– a? (o o 5 .9 IJ-

6.9 Bibliographic Notes 393

ct) c’ = o

=g’ – G f / ‘ tLo – 6 .9 ,

– ( E

EB FE 6<o E

E o

9 ” 0 o <

c – v = x

Y d

.= – =:’ = K E e . g i d. : . . ; , + – F F < = v x A E = = ? , * F s t – + A ” x ^ i ? f ? i ? – i Y

o o

– o

o o

, 9 . 9 a

S :EB :

‘Ee gE€ cEgE

394 Chapter 6 Association Analysis

Implementation Issues

Research activities in this area revolve around (1) integrating the mining ca- pability into existing database technology, (2) developing efficient and scalable mining algorithms, (3) handling user-specified or domain-specific constraints, and ( ) post-processing the extracted patterns.

There are several advantages to integrating association analysis into ex- isting database technology. First, it can make use of the indexing and query processing capabilities of the database system. Second, it can also exploit the DBMS support for scalability, check-pointing, and parallelization [301]. The SETM algorithm developed by Houtsma et al. [265] was one of the earliest algorithms to support association rule discovery via SQL queries. Since then, numerous methods have been developed to provide capabilities for mining as- sociation rules in database systems. For example, the DMQL [258] and M-SQL

[267] query languages extend the basic SQL with new operators for mining as- sociation rules. The Mine Rule operator [283] is an expressive SQL operator that can handle both clustered attributes and item hierarchies. Tsur et al.

[322] developed a generate-and-test approach called query flocks for mining association rules. A distributed OLAP-based infrastructure was developed by Chen et al. l2aL] for mining multilevel association rules.

Dunkel and Soparkar l2a8] investigated the time and storage complexity of the Apri,ori algorithm. The FP-growth algorithm was developed by Han et al. in [259]. Other algorithms for mining frequent itemsets include the DHP (dynamic hashing and pruning) algorithm proposed by Park et al. [292] and the Partition algorithm developed by Savasere et al [303]. A sampling-based frequent itemset generation algorithm was proposed by Toivonen [320]. The algorithm requires only a single pass over the data, but it can produce more candidate itemsets than necessary. The Dynamic Itemset Counting (DIC) algorithm [239] makes only 1.5 passes over the data and generates less candi- date itemsets than the sampling-based algorithm. Other notable algorithms include the tree-projection algorithm [223] and H-Mine [295]. Survey articles on frequent itemset generation algorithms can be found in 1226, 262]. A repos- itory of data sets and algorithms is available at the Frequent Itemset Mining Implementations (FIMI) repository (http:llfirni.cs.helsinki.fi). Parallel algo- rithms for mining association patterns have been developed by various authors

[224,256,287,306,337]. A survey of such algorithms can be found in [333]. Online and incremental versions of association rule mining algorithms had also been proposed by Hidber [260] and Cheung et aL. 12421.

Srikant et al. [313] have considered the problem of mining association rules in the presence of boolean constraints such as the following:

6.9 Bibliographic Notes 395

(Cookies A Milk) V (descendents(Cookies) A -ancestors(Wheat Bread))

Given such a constraint, the algorithm looks for rules that contain both cook- ies and milk, or rules that contain the descendent items of cookies but not ancestor items of wheat bread. Singh et al. [310] and Ng et al. [288] had also developed alternative techniques for constrained-based association rule min- ing. Constraints can also be imposed on the support for different itemsets. This problem was investigated by Wang et al. [324], Liu et al. in [279], and Seno et al. [305].

One potential problem with association analysis is the large number of patterns that can be generated by current algorithms. To overcome this prob- lem, methods to rank, summarize, and filter patterns have been developed. Toivonen et al. [321] proposed the idea of eliminating redundant rules using structural rule covers and to group the remaining rules using clustering. Liu et al. [280] applied the statistical chi-square test to prune spurious patterns and summarized the remaining patterns using a subset of the patterns called direction setting rules. The use of objective measures to filter patterns has been investigated by many authors, including Brin et al. [238], Bayardo and Agrawal [235], Aggarwal and Yu 12271, and DuMouchel and Pregibon[2a7]. The properties for many of these measures were analyzed by Piatetsky-Shapiro

[297], Kamber and Singhal [270], Hilderman and Hamilton [261], and Tan et al. [318]. The grade-gender example used to highlight the importance of the row and column scaling invariance property was heavily influenced by the discussion given in [286] by Mosteller. Meanwhile, the tea-coffee example il- lustrating the limitation of confidence was motivated by an example given in