Note that multiplying M from the right with P13 exchanges the first and third columns of. M, while multiplying M from the left with Pfi exchanges the first and third rows of M. If all three matrices are multiplied, this will produce a matrix M’whose first and third rows and columns have been swapped. r

456 Chapter 7 Association Analysis: Advanced Concepts

A(2)

A(4)

A (1 )

A(3) A(4)

A(3)

Figure

code = 1 01 1 01 001 01 0000010011 0001 11 0

7.21. String representation of adjacency matrices.

The second step is to determine the string representation for each adjacency matrix. Since the adjacency matrix is symmetric, it is suffrcient to construct the string representation based on the upper triangular part of the matrix. In the example shown in Figure 7.2I,the code is obtained by concatenating the entries of the upper triangular matrix in a column-wise fashion. The final step is to compare all the string representations of the graph and choose the one that has the lowest (or highest) lexicographic value.

The preceding approach seems expensive because it requires us to examine all possible adjacency matrices of a graph and to compute each of their string representation in order to find the canonical label. More specifically, there are kl permutations that must be considered for every graph that contains k vertices. Some of the methods developed to reduce the complexity of this task include caching the previously computed canonical label (so that we do not have to recompute it again when performing an isomorphism test on the same graph) and reducing the number of permutations needed to determine the canonical label by incorporating additional information such as vertex labels and the degree of a vertex. The latter approach is beyond the scope of this

code = 1 1 0011 1 00001 001 001 01 00001 011

InfreouentPatterns 457

book, but interested readers may consult the bibliographic notes at the end of this chapter.

7.5.6 Support Counting

Support counting is also a potentially costly operation because all the can- didate subgraphs contained in each graph G e g must be determined. One way to speed up this operation is to maintain a Iist of graph IDs associated with each frequent (k – l)-subgraph. Whenever a new candidate k-subgraph is generated by merging a pair of frequent (k – 1)-subgraphs, their correspond- ing lists of graph IDs are intersected. Finally, the subgraph isomorphism tests are performed on the graphs in the intersected list to determine whether they contain a particular candidate subgraph.

7.6 Infrequent Patterns

The association analysis formulation described so far is based on the premise that the presence of an item in a transaction is more important than its ab- sence. As a consequence, patterns that are rarely found in a database are often considered to be uninteresting and are eliminated using the support measure. Such patterns are known as infrequent patterns.

Definition 7.7 (Infrequent Pattern). An infrequent pattern is an itemset or a rule whose support is less than the minsup threshold.

Although a vast majority of infrequent patterns are uninteresting, some of them might be useful to the analysts, particularly those that correspond to negative correlations in the data. For example, the sale of DVDs and VCRs together is low because any customer who buys a DVD will most likely not buy a VCR, and vice versa. Such negative-correlated patterns are useful to help identify competing items, which are items that can be substituted for one another. Examples of competing items include tea versus coffee, butter versus margarine, regular versus diet soda, and desktop versus laptop computers.

Some infrequent patterns may also suggest the occurrence of interesting rare events or exceptional situations in the data. For example, if {f :-re = Yes} is frequent but {Fire = Yes, Alarm = 0″} is infrequent, then the latter is an interesting infrequent pattern because it may indicate faulty alarm systems. To detect such unusual situations, the expected support of a pattern must be determined, so that, if a pattern turns out to have a considerably lower support than expected, it is declared as an interesting infrequent pattern.

7.6

458 Chapter 7 Association Analysis: Advanced Concepts

Mining infrequent patterns is a challenging endeavor because there is an enormous number of such patterns that can be derived from a given data set. More specifically, the key issues in mining infrequent patterns are: (1) how to identify interesting infrequent patterns, and (2) how to efficiently discover them in large data sets. To get a different perspective on various types of interesting infrequent patterns, two related concepts-negative patterns and negatively correlated patterns-are introduced in Sections 7.6.1 and 7.6.2, re- spectively. The relationships among these patterns are elucidated in Section 7.6.3. Finally, two classes of techniques developed for mining interesting in- frequent patterns are presented in Sections 7.6.5 and 7.6.6.

7.6.L Negative Patterns

Let 1 : {h, iz, . . . , ia} be a set of i tems. A negat ive i tern, t i4, denotes the absence of item i7, from a given transaction. For example, cof f ee is a negative item whose value is 1 if a transaction does not contain cof f ee.

Definition 7.8 (Negative Itemset). A negative itemset X is an itemset that has the following properties: (1) X : AU B, where .4 is a set of positive items, B is a set of negative items, lBl > 1, and (2) s(X) > minsup.

Definition 7.9 (Negative Association Rule). A negative association rule is an association rule that has the following properties: (1) the rule is extracted from a negative itemset, (2) the support of the rule is greater than or equal to minsup, and (3) the confidence of the rule is greater than or equal to minconf.

The negative itemsets and negative association rules are collectively known as negative patterns throughout this chapter. An example of a negative association rule is tea ——+ E6f f ee, which may suggest that people who drink tea tend to not drink coffee.

7.6.2 Negatively Correlated Patterns

Section 6.7.1 on page 371 described how correlation analysis can be used to analyze the relationship between a pair of categorical variables. Measures such as interest factor (Equation 6.5) and the /-coefficient (Equation 6.8) were shown to be useful for discovering itemsets that are positively correlated. This section extends the discussion to negatively correlated patterns.

Let X : {r t , r2t . . . , r7,} denote a k- i temset and P(X) denote the proba- bility that a transaction contains X. In association analysis, the probability is often estimated using the itemset support, s(X).

7.6 Infrequent Patterns 459

Definition 7.10 (Negatively Correlated Itemset). Atr itemset X is neg- ativelv correlated if

k

” (x ) < f l ” ( ” i ) : s ( r r ) x s ( r2 )x . . . x s ( rk ) ,

j : r

where s(ri) is the support of an item ri.

The right-hand side of the preceding expression, lI!:tt@), represents an estimate of the probability that all the items in X are statistically independent. Definition 7.10 suggests that an itemset is negatively correlated if its support is below the expected support computed using the statistical independence assumption. The smaller s(X), the more negatively correlated is the pattern.

Definition 7.1L (Negatively Correlated Association Rule). An asso- ciation rule X —+ Y is negatively correlated if

s (XuY) < s (X )s ( ) ‘ ) , (7.4)

where X and Y are disjoint i temsets; i .e. , XUY :4.

The preceding definition provides only a partial condition for negative cor- relation between items in X and items in Y. A full condition for negative correlation can be stated as follows:

(7 .3)

(7.5)s(Xuv) . I s@;)fs(s),

where ri e X and gi e Y. Because the items in X (and in Y) are often positively correlated, it is more practical to use the partial condition to de- fine a negatively correlated association rule instead of the full condition. For example, although the rule

{eyeg}ass, lens cleaner} —–r {contact lens, saline solution}

is negatively correlated according to Inequality 7.4, eyeglass is positively correlated with lens cleaner and contact l-ens is positively correlated with saline solution. If Inequality 7.5 is applied instead, such a rule could be missed because it may not satisfy the full condition for negative correlation.

460 Chapter 7 Association Analysis: Advanced Concepts

The condition for negative correlation can also be expressed in terms of

the support for positive and negative itemsets. Let X and Y denote the

corresponding negative itemsets for X and Y, respectively. Since

s(xur ) -s (x)s(Y)

: s ( X U ) ‘ ) –

: s(X u Y)s(X u 7) – s(X u Y)s(X r-tY),

the condition for negative correlation can be stated as follows:

s(x u v)s(X u 7) < s(X u Y)s(X uv). (7.6)

The negatively correlated itemsets and association rules are known as nega- tively correlated patterns throughout this chapter.

7.6.3 Comparisons among Infrequent Patterns, Negative Pat-

terns, and Negatively Correlated Patterns

Infrequent patterns, negative patterns, and negatively correlated patterns are three closely related concepts. Although infrequent patterns and negatively correlated patterns refer only to itemsets or rules that contain positive items, while negative patterns refer to itemsets or rules that contain both positive

and negative items, there are certain commonalities among these concepts, as illustrated in Figure 7.22.

First, note that many infrequent patterns have corresponding negative pat-

terns. To understand why this is the case, consider the contingency table shown in Table 7.9. If XUY is infrequent, then it is likely to have a corre- sponding negative itemset unless rn’insup is too high. For example, assuming that mi,nsup < 0.25, if XUY is infrequent, then the support for at least one of

the following itemsets, X UY , X UY , or X U Y, must be higher than mi,nsup since the sum of the supports in a contingency table is 1.

Second, note that many negatively correlated patterns also have corre- sponding negative patterns. Consider the contingency table shown in Table

7.9 and the condition for negative correlation stated in Inequality 7.6. If X and Y have strong negative correlation, then

r – _ ‘ t f I

fs(x u Y) + s(x u Y)l

f(x u Y) + s(x u r)l

f . l : s(xu”) l t s(XuY) -s(xuv)-s(xuv) l s(XuY)s(xuY)

s(X uZ) x s(x u Y) >> s(X u r) x s(XuY).

7.6 InfrequentPatterns 461

Frequent Patterns

Figwe7.22. Comparisons among infrequent patterns, negative patterns, and negatively correlated patterns.

Tabfe 7.9. A two-way contingency table for the association rule X –.-+ Y.

Y Y

X

X

s(X u Y)

s(X u Y)

s (X uY)

s (X uY)

s(x)

s(x)

s(v) s(v) 1

Therefore, either X UY or X U Y, or both, must have relatively high support when X and Y are negatively correlated. These itemsets correspond to the negative patterns.