Figure 7.3. Example of a sequence database,
7.4 Sequential Patterns
Market basket data often contains temporal information about when an item was purchased by customers. Such information can be used to piece together the sequence of transactions made by a customer over a certain period of time. Similarly, event-based data collected from scientific experiments or the mon- itoring of physical systems such as telecommunications networks, computer networks, and wireless sensor networks, have an inherent sequential nature to them. This means that an ordinal relation, usually based on temporal or spatial precedence, exists among events occurring in such data. However, the concepts of association patterns discussed so far emphasize only co-occurrence relationships and disregard the sequential information of the data. The latter information may be valuable for identifying recurring features of a dynamic system or predicting future occurrences of certain events. This section presents the basic concept of sequential patterns and the algorithms developed to dis- cover them.
7.4.L Problem Formulation
The input to the problem of discovering sequential patterns is a sequence data set, which is shown on the left-hand side of Figure 7.3. Each row records the occurrences of events associated with a particular object at a given time. For example, the first row contains the set of events occurring at timestamp t : 10
43O Chapter 7 Association Analysis: Advanced Concepts
for object A. By sorting all the events associated with object A in increasing order of their timestamps, a sequence for object A is obtained, as shown on
the right-hand side of Figure 7.3. Generally speaking, a sequence is an ordered list of elements. A sequence
can be denoted as s : (ep2es . . .en),, where each element e3 is a collection of one or more events, i.e., ej : {h,’i2, . . . ,26}. The following is a list of examples of sequences:
o Sequence of Web pages viewed by a Web site visitor:
( {Homepage} {Electronics} {Cameras and Camcorders} {Digital Cam- eras) {Shopping Cart} {Order Confirmation} {Return to Shoppi”e} )
o Sequence of events leading to the nuclear accident at Three-Mile Island:
( {clogged resin} {outlet valve closure} {loss of feedwater} {condenser polisher outlet valve shut) {booster pumps trip} {main waterpump trips}
{main turbine trips} {reactor pressure increases} )
o Sequence of classes taken by a computer science major:
( {Algorithms and Data Structures, Introduction to Operating Systems}
{Database Systems, Computer Architecture} {Computer Networks, Sofb- ware Engineering) {Computer Graphics, Parallel Programming} )
A sequence can be characterized by its length and the number of occur- ring events. The length of a sequence corresponds to the number of elements present in the sequence, while a k-sequence is a sequence that contains k events. The Web sequence in the previous example contains 7 elements and 7 eventsl the event sequence at Three-Mile Island contains 8 elements and 8 events; and the class sequence contains 4 elements and 8 events.
Figure 7.4 provides examples of sequences, elements, and events defined for a variety of application domains. Except for the last row, the ordinal attribute associated with each of the first three domains corresponds to calendar time. For the last row, the ordinal attribute corresponds to the location of the bases (A, C, G, T) in the gene sequence. Although the discussion on sequential patterns is primarily focused on temporal events, it can be extended to the case where the events have spatial ordering.
Subsequences
A sequence f is a subsequence of another sequence s if each ordered element in f is a subset of an ordered element in s. Formally, the sequence 1 : (tfi2 . . .t^)
Sequence Database
Sequence Element (Transaction)
Event (ltem)
Customer Purchase history of a given customer
A set of items bought by a customer at time t
Books, diary products, CDs. etc
Web Data Browsing activity of a particular Web visitor
The collection of files viewed by a Web visitor after a single mouse click
Home page, index page, contact info, etc
Event data History of events generated by a given sensor
Events triggered by a sensor at time t
Types of alarms generated by sensors
Genome sequences
DNA sequence of a particular species
An element of the DNA sequence
Bases A,T,G,C
7.4 Sequential Patterns 431
Ordinal Attribute
Figure 7.4. Examples of elements and events in sequence data sets,
is a subsequence of s : (“r”2.. .s,) i f there exist integers L a j t < jz <.. . a jrn 1n such that h e sj , tz e sj” , . . . , t^ e sj* . I f t is a subsequence of s, then we say that t is contained in s. The following table gives examples illustrating the idea of subsequences for various sequences.
Sequence, s Sequence, f Is f a subsequence of s? <12,413,5,6 8 2 | { 3 ,61 {8 } > Yes <12,413’5,6 8 2 [ . { 8 Yes < { 1 , 2 } 3,41 > t t 12 No <{2 ,412,412,5 2 1 1 4 Yes
7.4.2 Sequential Pattern Discovery
Let D be a data set that contains one or more data sequences. The term data sequence refers to an ordered list of events associated with a single data object. For example, the data set shown in Figure 7.3 contains three data sequences, one for each object A, B, and C.
The support of a sequence s is the fraction of all data sequences that contain s. If the support for s is greater than or equal to a user-specified
432 Chapter 7 Association Analysis: Advanced Concepts
Obiect Timestamp Events A 1 1 . 2 , 4 A 2 2 , 3 A 3 5 B 1 1 , 2 B 2 2 , 3 , 4 c 1 1 , 2 c 2 2 , 3 , 4 c 3 2 , 4 , 5 D 1 2 D 2 3 , 4 D 3 4 , 5 E 1 1 , 3 E 2 2 , 4 , 5
Minsup = 50o/o
Examples of Sequential Patterns:
<{1,2}> s=607o <{2,3}> s=607o <{2,4\> s=807o <{3}{5}> s=807″ <{1}{2}> s=807″ <{2){2}> s=607″ <{1i{2,3}> s=607o <{2) {2,3)> s=607o <{1,2} {2,3}> s=607o
Figure 7.5. Sequential pafterns derived from a data set that contains five data sequences.
threshold m’insup, then s is declared to be a sequential pattern (or frequent
sequence).