Figwe 7 .27 . An indirect association between a pair of items.

Definition 7.L2 (Indirect Association). A pair of items a,b is indirectly

associated via a mediator set Y if the following conditions hold:

1. s({a,b}) < t” (Itempair support condition).

2. 2Y I 0 such that:

(a) s({a} U y) > t7 and s({b} u Y) 2 tt (Mediator support condition).

(b) d({a},Y) > ta,d({b},Y) > ta, where d(X,Z) is an object ive mea-

sure of the association between X and Z (Mediator dependence condition).

Note that the mediator support and dependence conditions are used to

ensure that items in Y form a close neighborhood to both a and b. Some of the dependence measures that can be used include interest, cosine or IS,

Jaccard, and other measures previously described in Section 6.7.1 on page 371. Indirect association has many potential applications. In the market basket

domain, a and b may refer to competing items such as desktop and laptop

computers. In text mining, indirect association can be used to identify syn-

onyms, antonyms, or words that are used in different contexts. For example, given a collection of documents, the word data may be indirectly associated with gold via the mediator mining. This pattern suggests that the word mining can be used in two different contexts-data mining versus gold min- itrg.

Indirect associations can be generated in the following way. First, the set

of frequent itemsets is generated using standard algorithms such as Apri’ori’

or FP-growth. Each pair of frequent k-itemsets are then merged to obtain a candidate indirect association (a,b,Y), where a and b are a pair of items

and Y is their common mediator. For example, if {p,q,r} and {p,q,s} are

7.7 Bibliographic Notes 469

Algorithm 7.2 Algorithm for mining indirect associations. 1: Generate Fa,the set of frequent itemsets. 2: fot k :2 to k-ur. do 3 : C n : { ( a , b , Y ) l { a } U y â‚¬ F n , { b } U y â‚¬ F p , a l b } 4: for each candidate (a,b,Y) â‚¬ Cp do b: i f s({o,, b}) < r” A d({a},y) Z ta A d({b}, y) > ta then 6: In : In U {(o, b,Y)} 7: end if 8: end for 9: end for

1o: Result : UIr.

frequent 3-itemsets, then the candidate indirect association (r,t,{p,q}) is ob- tained by merging the pair of frequent itemsets. Once the candidates have been generated, it is necessary to verify that they satisfy the itempair support and mediator dependence conditions provided in Definition 7.12. However, the mediator support condition does not have to be verified because the can- didate indirect association is obtained by merging a pair of frequent itemsets. A summary of the algorithm is shown in Algorithm 7.2.

7.7 Bibliographic Notes

The problem of mining association rules from categorical and continuous data was introduced by Srikant and Agrawal in 1363]. Their strategy was to binarize the categorical attributes and to apply equal-frequency discretization to the continuous attributes. A partial completeness measure was also proposed to determine the amount of information loss as a result of discretization. This measure was then used to determine the number of discrete intervals needed to ensure that the amount of information loss can be kept at a certain desired level. Following this work, numerous other formulations have been proposed for mining quantitative association rules. The statistics-based approach was developed by Aumann and Lindell [343] to identify segments of the population who exhibit interesting behavior characterized by some quantitative attributes. This formulation was later extended by other authors including Webb [363] and Zhang et al. [372]. The min-Apri,ori algorithm was developed by Han et al.

[349] for finding association rules in continuous data without discretization. The problem of mining association rules in continuous data has also been

Chapter 7 Association Analysis: Advanced Concepts

investigated by numerous other researchers including Fukuda et al’ 1347)’ Lent et al. [355], Wang et al. [367], and Miller and Yang [357].

The method described in Section 7.3 for handling concept hierarchy using

extended transactions was developed by Srikant and Agrawal 1362]. An alter-

native algorithm was proposed by Han and Ib [350], where frequent itemsets

are generated one level at a time. More specifically, their algorithm initially generates all the frequent l-itemsets at the top level of the concept hierarchy.

The set of frequent 1-itemsets is denoted as .L(1,1). Using the frequent 1-

itemsets in L(7,1), the algorithm proceeds to generate all frequent 2-itemsets

at level 7, L(I,2). This procedure is repeated until all the frequent itemsets

involving items from the highest level of the hierarchy, ,L(1, k) (k > 1), are

extracted. The algorithm then continues to extract frequent itemsets at the

next level of the hierarchy, L(2,I), based on the frequent itemsets in.L(1,1).

The procedure is repeated until it terminates at the lowest level of the concept

hierarchy requested by the user. The sequential pattern formulation and algorithm described in Section 7.4

was proposed by Agrawal and Srikant in [341, 364]. Similarly, Mannila et

al. [356] introduced the concept of frequent episode, which is useful for min-

ing sequential patterns from a long stream of events. Another formulation of

sequential pattern mining based on regular expressions was proposed by Garo-

falakis et al. in [348]. Joshi et al. have attempted to reconcile the differences between various sequential pattern formulations [352]. The result was a uni-

versal formulation of sequential pattern with the different counting schemes

described in Section 7.4.4. Alternative algorithms for mining sequential pat-

terns were also proposed by Pei et aI. [359], Ayres et al. [344], Cheng et al.

1346], and Seno et al. [361]. The frequent subgraph mining problem was initially introduced by Inokuchi

et al. in [351]. They used a vertex-growing approach for generating frequent

induced subgraphs from a graph data set. The edge-growing strategy was

developed by Kuramochi and Karypis in 1353], where they also presented an

Apri,ori,-Iike algorithm called FSG that addresses issues such as multiplicity

of candidates, canonical labeling, and vertex invariant schemes. Another fre- quent subgraph mining algorithm known as gSpan was developed by Yan and

Han in [370]. The authors proposed using a minimum DFS code for encoding

the various subgraphs. Other variants of the frequent subgraph mining prob-

Iems were proposed by Zaki in 1371], Parthasarathy and Coatney in 1358], and

Kuramochi and Karypis in [354]. The problem of mining infrequent patterns has been investigated by many

authors. Savasere et al. [360] examined the problem of mining negative asso-

Bibliography 47L

ciation rules using a concept hierarchy. Tan et al. [365] proposed the idea of mining indirect associations for sequential and non-sequential data. Efficient algorithms for mining negative patterns have also been proposed by Boulicaut et al. [345], Teng et al. [366], Wu et al. [369], and Antonie and Za’iane 13421.

Bibliography f341] R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. of IntI. Conf. on

Data Engi,neeri:ng, pages 3 14, Taipei, Taiwan, 1995.

1342] M.-L. Antonie and O. R. Zaiane. Mining Positive and Negative Association Rules: An Approach for Confined Rules. In Proc. of the 8th European Conf of Princi,ples and Practice of Knowledge D,i,scouery i,n Databases, pages 27-38, Pisa, Italy, September 2004.

[343] Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association Rules. In KDD99, pages 261-270, San Diego, CA, August 1999.

13441 J . Ayres, J. Flannick, J. Gehrke, and T. Yiu. Sequential Pattern mining using a bitmap representation. In Proc. of the 9th IntI. Conf. on Knowledge D,iscouerg and Data Mining, pages 429 435, trdmonton, Canada, JuJy 2002.

[345] J.-F. Boulicaut, A. Bykowski, and B. Jeudy. Towards the Tlactable Discovery of Association Rules with Negations. In Proc. of the lth Intl. Conf on Flerible Query Answering Sgstems FQAS’1}, pages 425-434, Warsaw, Poland, October 2000.

[346] H. Cheng, X. Yan, and J. Han. IncSpan: incremental mining of sequential patterns in large database. In Proc. of the 10th Intl. Conf. on Knowledge Discouerg and, Data M’in’ing, pages 527 532, Seattle, WA, August 2004.

[347] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining Optimized Associa- tion Rules for Numeric Attributes. In Proc. of the 15th SErnp. on Principles of Database Sgstems, pages 182-191, Montreal, Canada, June 1996.

1348] M. N. Garofalakis, R. Rastogi, and K. Shim. SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. In Proc. of the 25th VLDB ConJ., pages 223-234, Edinburgh, Scotland, 1999.

1349] E.-H. Han, G. Karypis, and V. Kumar. Min-Apriori: An Algorithm for Finding As- sociation Rules in Data with Continuous Attributes. http://www.cs.umn.edu/-han, 1997.

[350] J. Han and Y. Fu. Mining Multiple-Level Association Rules in Large Databases. IEEE Trans. on Knowledge and Data Eng’ineering, 11(5):798-804, 1999.

[351] A. Inokuchi, T. Washio, and H. Motoda. An Apriori-based Algorithm for Mining F]e- quent Substructures from Graph Data. In Proc. of the /1th European Conf. of Principles and, Practice of Knowledge D’iscouery in Databases, pages 13-23, Lyon, France, 2000.

[352] M. V. Joshi, G. Karypis, and V. Kumar. A Universal Formulation of Sequential Patterns. In Proc. of the KDD’2001 workshop on Temporal Data Mi,ning, San Ftancisco, CA, August 2001.

[353] M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. In Proc. of the 2001 IEEE Intl. Conf. on Data Mi,ning, pages 313-320, San Jose, CA, November 2001.

[354] M. Kuramochi and G. Karypis. Discovering Frequent Geometric Subgraphs. In Proc. of the 2002 IEEE Intl. Conf. on Data Mini,ng, pages 258-265, Maebashi City, Japan, December 2002.

472 Chapter 7 Association Analysis: Advanced Concepts

[355] B. Lent, A. Swami, and J. Widom. Clustering Association Rules. In Proc. of the 13th

IntI. Conf. on Data Engineering, pages 220 231, Birmingham, U.K, April 1997.

[356] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of F]equent Episodes in Event

Sequences. Data Mi,ni.ng and Knowledge Di,scouery, 1(3):259-289, November 1997.

[357] R. J. Miller and Y. Yang. Association Rules over Interval Data. In Proc. of 1997

ACM-SIGMOD IntL Conf. on Management of Data, pages 452 461, T\rcson, AZ, May

1997.

1358] S. Parthasarathy and M. Coatney. Efficient Discovery of Common Substructures in

Macromolecules. In Proc. of the 2002 IEEE Intl. Conf. on Data M’ining, pages 362 369,

Maebashi City, Japan, December 2002.

1359] J. Pei, J. Han, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu. PrefixSpan: Mining

Sequential Patterns efficiently by prefix-projected pattern growth. In Proc of the 17th

IntI. Conf. on Data Engi,neering, Heidelberg, Germany, April 2001.

[360] A. Savasere, E. Omiecinski, and S. Navathe. Mining for Strong Negative Associations in a Large Database of Customer TYansactions. In Proc. of the llth IntI. Conf. on Data

Eng’ineering, pages 494-502, Orlando, Florida, February 1998.

[361] M. Seno and G. Karypis. SLPMiner: An Algorithm for Finding Frequent Sequential Patterns Using Length-Decreasing Support Constraint. In Proc. of the 2002 IEEE Intl.

Conf. on Data Mzni,ng, pages 418-425, Maebashi City, Japan, December 2002.

f362] R. Srikant and R. Agrawal. Mining Generalized Association Rules. In Proc. of the

21st VLDB Conf., pages 407-419, Ztxich, Switzerland, 1995.

1363] R. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. In Proc. of 1996 ACM-SIGMOD Intl. Conf. on Managementof Data, pages 1

12, Montreal, Canada, 1996.