As with marspan) these constraints will affect the support counting step of sequential pattern discovery algorithms because some data sequences no longer support a candidate pattern when mingap and rnargap constraints are present. These algorithms must be modified to ensure that the timing con- straints are not violated when counting the support of a pattern. Otherwise, some infrequent sequences may mistakenly be declared as frequent patterns.

A side effect of using the margap constraint is that the Apri,ori principle might be violated. To illustrate this, consider the data set shown in Figure 7.5. Without mingap or rnargap constraints, the support for ({Z}{S}) and ({2}{3}{5}) are both equal to 60%. However, if mi,ngap: 0 and margap: L, then the support for ({Z}{5}) reduces to 40To, while the support for ({Z}{a}{S}) is still 60%. In other words, support has increased when the number of events in a sequence increases-which contradicts the Apri,ori, principle. The viola- tion occurs because the object D does not support the pattern ({2}{5}) since the time gap between events 2 and 5 is greater than rnargap. This problem can be avoided by using the concept of a contiguous subsequence.

Definition 7.2 (Corftiguous Subsequence). A sequence s is a contiguous subsequence of w – \e1e2…ek) if any one of the following conditions hold:

1. s is obtained from u-r after deleting an event from either €1 or ep,

2. s is obtained from tr after deleting an event from any element ei e w that contains at least two events, or

3. s is a contiguous subsequence of f and t is a contiguous subsequence of w .

The following examples illustrate the concept of a contiguous subsequence:

Data Sequence, s Sequential Pattern, I Is f a contiguous subsequence of s?

1) {2 ,3 }> r t 12 Yes 1,2) {2} {3} > 7 t 1 2 YES

<t3,4] t1 ,2) {2 ,3} t4} > 1 l 1 2 fes < { 1 } { 3 } { 2 } > r \ {2 No < { 1 , 2 } { 1 } { 3 } { 2 } > r l {2 No

Using the concept of contiguous subsequences) the Apri.ori, principle can be modified to handle n’Largap constraints in the following way.

Definition 7.3 (Modified Apri.orz Principle). If a k-sequence is frequent, then all of its contiguous k – l-subsequences must also be frequent.

7.4 Sequential Patterns 439

The modified Apri,ori, principle can be applied to the sequential pattern

discovery algorithm with minor modifications. During candidate pruning, not all k-sequences need to be verified since some of them may violate the margap constraint. For example,if margap:1, it is not necessary to check whether the subsequence ({1}{2,3}{5}) of the candidate ({1}{2,3}{4}{5}) is frequent since the time difference between elements {2,3} and {5} is greater than one time unit. Instead, only the contiguous subsequences of ({1}{2,3}{a}{5}) need to be examined. These subsequences include ({1}{2,3}{4}), ({2,3}{4}{5}), ( {1 } {2 } {4 } {5 } ) , and ( t1 } {3 } {4 } {5 } ) .

The Window Size Constraint

Finally, events within an element s7 do not have to occur at the same time. A window size threshold (tr.’s) can be defined to specify the maximum allowed time difference between the latest and earliest occurrences of events in any element of a sequential pattern. A window size of 0 means all events in the same element of a pattern must occur simultaneously.

The following example uses u)s : 2 to determine whether a data se- quence supports a given sequence (assuming mingap : 0, margap : 3, and marspan: -) .

Data Sequence, s Sequential Pattern, f Does s support f’l

1,3 3,4 4 l { 5 6 ,7 8 3,4 o Yes 1,3 3,4 4 I l 5 6 ,7 8 4,6 8 Yes 1,3 3,4 4 f l o 6,7 8 3 , 4 , 6 l 8 t > No 1,3 3,4 4 1 . { 5 6 ,7 8 1,3,4) 6,7,8) > No

In the last example, although the pattern ({1,3,4} {6,7,8}) satisfies the win- dow size constraint, it violates the margap constraint because the maximum time difference between events in the two elements is 5 units. The window size constraint also affects the support counting step of sequential pattern dis- covery algorithms. If Algorithm 7.I is applied without imposing the window size constraint, the support counts for some of the candidate patterns might be underestimated, and thus some interesting patterns may be lost.

7.4.4 Alternative Counting Schemes

There are several methods available for counting the support of a candidate k-sequence from a database of sequences. For illustrative purposes, consider the problem of counting the support for sequenc” ({p}{q}), as shown in Figure 7.8. Assume that ?rs : 0, mingap : 0, margap: 1, and marspan :2.

44O Chapter 7 Association Analysis: Advanced Concepts

Object’s Timeline p p p

p p q q q q q t_-_t-_______.1__ ]_, —‘i———‘t–

i1 i2 i3 i4 i5 i6 i7

# i i i i

Sequence: (p) (q)

(Method, Count)

COBJ

J cwlN

| “r’**’* Hi i t H

CDIST O 8

CDIST 5

Figure 7.8. Comparing different support counting methods.

COBJ: One occurrence per object. This method looks for at least one occurrence of a given sequence in an object’s timeline. In Figure 7.8, even though the sequence ((p)(q)) appears several times in the object’s timeline, it is counted only once- with p occurring at t:1 and q occuring at f : 3.

CWIN: One occurrence per sliding window. In this approach, a sliding time window of fixed length (marspan) is moved across an object’s timeline, one unit at a time. The support count is incremented each time the sequence is encountered in the sliding window. In Figure 7.8, the sequence ({p}{q}) is observed six times using this method.

CMINWIN: Number of minimal windows of occurrence. A minimal window of occurrence is the smallest window in which the sequence occurs given the timing constraints. In other words, a minimal

7.4 Sequential Patterns 44’l’

window is the time interval such that the sequence occurs in that time

interval, but it does not occur in any of the proper subintervals of it. This

definition can be considered as a restrictive version of CWIN, because

its effect is to shrink and collapse some of the windows that are counted

by CWIN. For example, sequenc” ({p}{q}) has four minimal window

occurrences: (1) the pair (p: t :2, q’. t :3)’ (2) the pair (p’. t — 3, q:

t : 4 ) , ( 3 ) t h e p a i r ( p : t : 5 , Q : t : 6 ) , a n d ( a ) t h e p a i r ( p : t : 6 , q :

t :7). The occurrence of event p at t :1 and event q at t :3 is not a

minimal window occurrence because it contains a smaller window with

(p: t:2, g: t: 3), which is indeed a minimal window of occurrence’

o CDIST-O: Distinct occurrences with possibility of event-timestamp

overlap. A distinct occurrence of a sequence is defined to be the set of event-

timestamp pairs such that there has to be at least one new event-

timestamp pair that is different from a previously counted occurrence.

Counting all such distinct occurrences results in the CDIST-O method.

If the occurrence time of events p and q is denoted as a tuple (t(p),t(q)),

then this method yields eight distinct occurrences of sequence ({p}{q})

at t imes (1,3), (2,3), (2,4), (3,4), (3,5), (5,6), (5,7), and (6,7).

o CDIST: Distinct occurrences with no event-timestamp overlap allowed.

In CDIST-O above, two occurrences of a sequence were allowed to have

overlapping event-timestamp pairs, e.g., the overlap between (1,3) and

(2,3). In the CDIST method, no overlap is allowed. Effectively, when an

event-timestamp pair is considered for counting, it is marked as used and

is never used again for subsequent counting of the same sequence. As

an example, there are five distinct, non-overlapping occurrences of the

sequence ({p}tq}) in the diagram shown in Figure 7.8. These occurrences

happen at times (7,3), (2,4), (3,5), (5,6), and (6,7). Observe that these

occurrences are subsets of the occurrences observed in CDIST-O.

One final point regarding the counting methods is the need to determine the

baseline for computing the support measufe. For frequent itemset mining, the

baseline is given by the total number of transactions. For sequential pattern

mining, the baseline depends on the counting method used. For the COBJ

method, the total number of objects in the input data can be used as the

baseline. For the CWIN and CMINWIN methods, the baseline is given by the

sum of the number of time windows possible in all objects. For methods such

as CDIST and CDIST_O, the baseline is given by the sum of the number of

distinct timestamps present in the input data of each object.

442 Chapter 7 Association Analysis: Advanced Concepts

7.5 Subgraph Patterns

This section describes the application of association analysis methods to more complex entities beyond itemsets and sequences. Examples include chemical compounds, 3-D protein structures, network topologies, and tree structured XML documents. These entities can be modeled using a graph representation, as shown in Table 7.7.

Table 7.7. Graph representation of entities in various application domains.

A useful data mining task to perform on this type of data is to derive a set of common substructures among the collection of graphs. Such a task is known as frequent subgraph mining. A potential application of frequent subgraph mining can be seen in the context of computational chemistry. Each year, new chemical compounds are designed for the development of pharmaceu- tical drugs, pesticides, fertilizers, etc. Although the structure of a compound is known to play a major role in determining its chemical properties, it is dif- ficult to establish their exact relationship. Flequent subgraph mining can aid this undertaking by identifying the substructures commonly associated with certain properties of known compounds. Such information can help scientists to develop new chemical compounds that have certain desired properties.

This section presents a methodology for applying association analysis to graph-based data. The section begins with a review of some of the basic graph-related concepts and definitions. The frequent subgraph mining problem is then introduced, followed by a description of how the traditional Apri,ori, algorithm can be extended to discover such patterns.

Application Graphs Vertices trdges Web mining Web browsing patterns Web paees Hyperlink between pages Computational chemistry

Structure of chemical compounds

Atoms or tons

Bond between atoms or rons

Network computing Computer networks Computers and servers

Interconnection between machines

Semantic Web Collection of XML documents

XML elements Parent-child relationship between elements

Bioinformatics Protein structures Amino acids Contact residue

7.5 Subgraph Patterns 443

7.5J Graphs and Subgraphs

A graph is a data structure that can be used to represent the relationships among a set of entities. Mathematically, a graph is composed of a vertex set V and a set of edges ,E connecting between pairs of vertices. Each edge is denoted by a vertex pair (r’i, u7), where ui,ui € I/. A label l(u,;) can be assigned to each vertex u, representing the name of an entity. Similarly each edge (ut,ui) can also be associated with a label l(ua, u7) describing the relationship between a pair of entities. Table 7.7 shows the vertices and edges associated with different types of graphs. For example, in a Web graph, the vertices correspond to Web pages and the edges represent the hyperlinks between Web pages.

Definition 7.4 (Subgraph). A graph G’ : (V’,,E/) is a subgraph of another graph G : (V,E) if its vertex set Vt is a subset of V and its edge set .E’ is a subset of ,8. The subgraph relationship is denoted as Gt es G.

Figure 7.9 shows a graph that contains 6 vertices and 11 edges along with one of its possible subgraphs. The subgraph, which is shown in Figure 7.9(b), contains only 4 of the 6 vertices and 4 of the 11 edges in the original graph.