Definition 7.1 (Sequential Pattern Discovery). Given a sequence data

set D and a user-specified minimum support threshold minsup, the task of sequential pattern discovery is to find all sequences with support ) minsup.

Figure 7.5 illustrates an example of a data set that contains five data

sequences. The support for the sequence < {1}{2} ) is equal to 80% because it

occurs in four of the five data sequences (every object except for D). Assuming

that the minimum support threshold is 50%, any sequence that appears in at least three data sequences is considered to be a sequential pattern. Examples of sequential patterns extracted from the given data set include <{1}{2}>, <{1,2}>, <{2,3}>, <u,2}{2,3}>, etc.

Sequential pattern discovery is a computationally challenging task because there are exponentially many sequences contained in a given data sequence. For example, the data sequence <{a,b} {c,d,e} {f} {g,h,i}> contains sequences such as <{a} {c,d} { f} {S}>, <{c,d,e}>, <{b} {s}>, etc. I t can be easi ly shown that the total number of k-sequences present in a data sequence with

n events ir (?) A data sequence with nine events therefore contains

(3)0.(3) distinct sequences.

++ : 29 -1 :511

7.4 Sequential Patterns 433

A brute-force approach for generating sequential patterns is to enumerate all possible sequences and count their respective supports. Given a collection of n events, candidate l-sequences are generated first, followed by candidate 2-sequences, candidate 3-sequences, and so on:

l-sequences: 1ir ) , f i2 ) , . . . , 1 xn ) 2-sequences: < {h, iz} >, < {r i r , ze} ) , . . . , 1 { in_t, ‘ in} } ,

< {z r } { i r } > , < { i t } { i z } ) , . . . , < { i -_ t } { i – } > 3-sequences: I { i1, ‘ i2, is} >, < { i r , , iz, iq} } , . . . , 1 { ,h, iz}{ i r} ) , . . . ,

< { i t } { z r , i z } > , . . . , < { z t } { i r } { r t } > , . . . , < { i ” } { i . } { i ^ } >

Notice that the number of candidate sequences is substantially larger than the number of candidate itemsets. There are two reasons for the additional number of candidates:

1. An item can appear at most once in an itemset, but an event can appear more than once in a sequence. Given a pair of items, ir and i2, only one candidate 2-itemset, {h,iz}, can be generated. On the other hand, there are many candidate 2-sequences, such as ( {i1,iz} >, < {it}{iz} >, < {iz}{it} ), and 1{h,it} >, that can be generated.

2. Order matters in sequences, but not for itemsets. For example, {1, 2} and

{2,1} refers to the same itemset, whereas < {it}{iz} ) and < {iz}{i} > correspond to different sequences, and thus must be generated separately.

The Apriori principle holds for sequential data because any data sequence that contains a particular k-sequence must also contain all of its (k – 1)- subsequences. An Apri,ori,-Iike algorithm can be developed to extract sequen- tial patterns from a sequence data set. The basic structure of the algorithm is shown in Algorithm 7.1.

Notice that the structure of the algorithm is almost identical to Algorithm 6.1 presented in the previous chapter. The algorithm would iteratively gen- erate new candidate k-sequences, prune candidates whose (k – l)-sequences are infrequent, and then count the supports of the remaining candidates to identify the sequential patterns. The detailed aspects of these steps are given next.

Candidate Generation A pair of frequent (k – 1)-sequences are merged to produce a candidate k-sequence. To avoid generating duplicate candidates, re- call that the traditional Apriori algorithm merges a pair of frequent k-itemsets only if their first k – 1 items are identical. A similar approach can be used

434 Chapter 7 Association Analysis: Advanced Concepts

Algorithm 7.L Apriora-like algorithm for sequential pattern discovery. I : k : l . 2: F1″:{i l ielA gI#D } mi,nsupl. {Find all frequent 1-subsequences.} 3: repeat 4 : k : k + I . 5: Cn : apriori-gen(Fs-1). {Generate candidate k-subsequences.} 6: for each data sequence t €T do 7: Ct : subsequence(C6, t). {Identify all candidates contained in t.} 8: for each candidate k-subsequence c € Ct do 9: o(c) : o(c) + 1. {Increment the support count’}

10: end for 11: end for t2: Fn: {cl c€CuA sfP > mi’nsup}. {Extract the frequent lc-subsequences’} 13: unt i l Fn:A 14 : Answer :UFn .

for sequences. The criteria for merging sequences are stated in the form of the

following procedure.

A sequence s(1) is merged with another sequence s(2) only if the subsequence

obtained by dropping the first event in s(1) is identical to the subsequence

obtained by dropping the last event in s(2). The resulting candidate is the

sequence 5(1), concatenated with the last event from s(2). The last event from

s(2) can either be merged into the same element as the last event in s(1) or

different elements depending on the following conditions:

1. If the last two events in s(2) belong to the same element, then the last event

in s(2) is part of the last element in s(1) in the merged sequence.

2. If.the last two events in s(2) belong to different elements, then the last event

in s(2) becomes a separate element appended to the end of s(1) in the merged sequence.

Figure 7.6 illustrates examples of candidate 4-sequences obtained by merg-

ing pairs of frequent 3-sequences. The first candidate ({t}{Z}{3}{4}) is ob’

tained by merging ((1X2X3)) with ((2)(3)(a)). Since events 3 and 4 belong

to different elements of the second sequence, they also belong to separate ele-

ments in the merged sequence. On the other hand, merging ({1}t5}{3}) with

({5}{3,4}) produces the candidate 4-sequence ({1}{5i{3,4}). In this case,

< (1) (2) (3) > < ( 1 ) ( 2 5 ) > < (1 ) (5 ) (3 ) > < (2) (3) (4)> < ( 2 5 ) ( 3 ) > < (3) (4) (5) > < ( 5 ) ( 3 4 ) >

< (1) (2) (3) (4) > < (1) (2 5) (3) > < (1) (5) (3 4) > . (2) (3) (4) (5)> < ( 2 5 ) ( 3 4 ) ‘

7.4 Sequential Patterns 435

Candidate Pruning

. (1) (2 5) (3) >

Frequent 3-sequences

Candidate Generation

Figure 7,6, Example of the candidate generation and pruning steps of a sequential pattern mining algorithm.

since events 3 and 4 belong to the same element of the second sequence, they are combined into the same element in the merged sequence. Finally, the se- quences ({1}{2}{3}) and ({t}{2,5}) do not have to be merged because remov- ing the fi.rst event from the first sequence does not give the same subsequence as removing the last event from the second sequence. Although ({1}{2,5}t3}) is a viable candidate, it is generated by merging a different pair of sequences, ({1}{2,5}) and ({2,5}{3}). This example shows that the sequence merging procedure is complete; i.e., it will not miss any viable candidate, while at the same time, it avoids generating duplicate candidate sequences.

Candidate Pruning A candidate k-sequence is pruned if at least one of its (k – l)-sequences is infrequent. For example, suppose ({1}{2}{3}{+}) is a can- didate 4-sequence. We need to check whether ({1}{2}{4}) and ({t}{3}{a}) are frequent 3-sequences. Since both are infrequent, the candidate ({t}{Z}{3}{4})

can be eliminated. Readers should be able to verify that the only candi- date 4-sequence that survives the candidate pruning step in Figure 7.6 is ( { 1 } { 2 5 } { 3 } )

Support Counting During support counting, the algorithm will enumer- ate all candidate k-sequences belonging to a particular data sequence. The support of these candidates will be incremented. After counting their sup- ports, the algorithm may identify the frequent k-sequences and may discard all candidates whose support counts are less than the m’insup threshold.

436 Chapter 7 Association Analysis:

u(s;*r)- l(s;) <= maxgap l(s;+r)- u(b;)> mingap

Advanced Concepts

window size WS

e

Sequence:

u(sn) – l(s1) <= tttdXSP?tl

Time window (w) for each element is characterized by [,u] where | : earliest time of occurrence of an event in w

u : latest time of occurrence of an event in w

Figure 7.7. Timing constraints of a sequential pattern.

7.4.3 Timing Constraints

This section presents a sequential pattern formulation where timing constraints are imposed on the events and elements of a pattern. To motivate the need for timing constraints, consider the following sequence of courses taken by two students who enrolled in a data mining class:

Student A: ( {Statistics} {Database Systems} {Data Mining} ). Student B: ( {Database Systems} {Statistics} {Data Mining} ).

The sequential pattern of interest is ( {Statistics, Database Systems} {Data Mining) ), which means that students who are enrolled in the data mining class must have previously taken a course in statistics and database systems. Clearly, the pattern is supported by both students even though they do not take statistics and database systems at the same time. In contrast, a student who took a statistics course ten years earlier should not be considered as supporting the pattern because the time gap between the courses is too long. Because the formulation presented in the previous section does not incorporate these timing constraints, a new sequential pattern definition is needed.

Figure 7.7 illustrates some of the timing constraints that can be imposed on a pattern. The definition of these constraints and the impact they have on sequential pattern discovery algorithms will be discussed in the next sections. Note that each element of the sequential pattern is associated with a time window [l,z], where I is the earliest occurrence of an event within the time window and u is the latest occurrence of an event within the time window.

7.4 Sequential Patterns 437

The maxspan Constraint

The marspan constraint specifies the maximum allowed time difference be-

tween the latest and the earliest occurrences of events in the entire sequence. For example, suppose the following data sequences contain events that oc-

cur at consecutive time stamps (1, 2, 3, . ..). Assuming that marspan : 3,

the following table contains sequential patterns that are supported and not

supported by a given data sequence.

Data Secuence. s Sequential Pattern. t Does s support f’l 1 , 3 1 3,4 4 l t 5 6 ,7 8 <{e {4 fes 1,3) 3,4 4 1 1 5 6,7 8 <{3 { 6 fes

1,313,4 4j {5i {6,7 8 < { 1 , 3 } 6 l> No

In general, the longer the marspan, the more likely it is to detect a pattern

in a data sequence. However, a longer marspan can also capture spurious pat-

terns because it increases the chance for two unrelated events to be temporally

related. In addition, the pattern may involve events that are already obsolete.

The marspan constraint affects the support counting step of sequential pattern discovery algorithms. As shown in the preceding examples, some data

sequences no longer support a candidate pattern when the marspan constraint is imposed. If we simply apply Algorithm 7.7., the support counts for some patterns may be overestimated. To avoid this problem, the algorithm must be

modified to ignore cases where the interval between the first and last occur-

rences of events in a given pattern is greater t’han marspan.

The mingap and maxgap Constraints

Timing constraints can also be specified to restrict the time difference be-

tween two consecutive elements of a sequence. If the maximum time difference (margap) is one week, then events in one element must occur within a week’s

time of the events occurring in the previous element. If the minimum time dif-

ference (mi,ngap) is zero, then events in one element must occur immediately

after the events occurring in the previous element. The following table shows

examples of patterns that pass or fail the margap and mingap constraints,

assuming that margap: 3 and mi,ngap: t.

Data Sequence, s Sequential Pattern, f rnargap rnl.ngap

1,3 3,4 4 l l 5 6 ,7 8 3 o Pass Pass 1,3 3,4 4 l { 5 6 ,7 8 o 8 Pass Fail

1,3 314 4 t 15 6,7 8 1 , 3 6 l> Fail Pass 1,3 314 4 f t 5 6,7 8 1 3 { 8 } > Fail Fail

438 Chapter 7 Association Analysis: Advanced Concepts