To handle the continuous attribute, we apply the equal-frequency approach with 3, 4, and 6 intervals. Categorical attributes are handled by introducing as many new asymmetric binary attributes as the number of categorical values. Assume that the support threshold is 10% and the confidence threshold is 70%.

(a) Suppose we discretize the Age attribute into 3 equal-frequency intervals. Find a pair of values for a1 and a2 that satisfy the minimum support and minimum confidence requirements.

(b) Repeat part (a) by discretizing the Age attribute into 4 equal-frequency intervals. Compare the extracted rules against the ones you had obtained in part (a).

(c) Repeat part (a) by discretizing the Age attribute into 6 equal-frequency intervals. Compare the extracted rules against the ones you had obtained in part (a).

(d) From the results in part (u), (b), and (c), discuss how the choice of dis- cretization intervals will affect the rules extracted by association rule min- ing algorithms.

Consider the transactions shown in Table 7.15, with an item taxonomy given in Figure 7.25.

7.

478 Chapter 7 Association Analysis: Advanced Concepts

Table 7.15. Example ol market basket transactions.

Tbansaction ID Items tsought

1 z

2

4 r

tl

7

Chips, Cookies, Regular Soda, Ham Chips, Ham, Boneless Chicken, Diet Soda Ham, Bacon, Whole Chicken, Regular Soda Chips, Ham, Boneless Chicken, Diet Soda Chips, Bacon, Boneless Chicken Chips, Ham, Bacon, Whole Chicken, Regular Soda Chips, Cookies, Boneless Chicken, Diet Soda

(a) What are the main challenges of mining association rules with item tax- onomy?

(b) Consider the approach where each transaction t is replaced by an extended transaction tt that contains all the items in f as well as their respective ancestors. For example, the transaction t : { Chips, Cookies} will be replaced by tt : {Ctrips, Cookies, Snack Food, Food}. Use this approach to derive all frequent itemsets (up to size 4) with support > 70%.

(c) Consider an alternative approach where the frequent itemsets are gener- ated one level at a time. Initially, all the frequent itemsets involving items at the highest level of the hierarchy are generated. Next, we use the fre- quent itemsets discovered at the higher level of the hierarchy to generate candidate itemsets involving items at the lower levels of the hierarchy. For example, we generate the candidate itemset {Cnips, Diet Soda} only if

{Snack Food, Soda} is frequent. Use this approach to derive all frequent itemsets (up to size 4) with support > 70%.

(d) Compare the frequent itemsets found in parts (b) and (c). Comment on the efficiency and completeness of the algorithms.

The following questions examine how the support and confi.dence of an associ- ation rule may vary in the presence of a concept hierarchy.

(a) Consider an item r in a given concept hierarchy. Let 11, fr2, . . .,7r denote

the k children of r in the concept hierarchy. Show that “(r)

< Li:r t(T’), where s(.) is the support of an item. Under what conditions will the inequality become an equality?

(b) Let p and q denote a pair of items, while p and Q are their corresponding parents in the concept hierarchy. If s({p, q}) > mi,nsup, which of the fol- Iowing itemsets are guaranteed to be frequent? (i)

“({p, q}), (ii) s({p,4}),

and (i i i) s({p,,?})

(c) Consider the association rule {p} ——+ {q}. Suppose the confidence of the rule exceeds mi,nconf . Which of the following rules are guaranteed to

8.

9.

7.8 Exercises 479

have confidence higher than m’inconfZ (i) {p} —— {,?}, (ii) {p} ——+ {q}, and (iii) {p} —-.+ {4}.

(a) List all the 4-subsequences contained in the following data sequence:

< {1 ,3 } {2 } {2 ,3 } {4 } > ,

assuming no timing constraints.

(b) List all the 3-element subsequences contained in the data sequence for part (a) assuming that no timing constraints are imposed.

(c) List all the 4-subsequences contained in the data sequence for part (a) (assuming the timing constraints are flexible).

(d) List all the 3-element subsequences contained in the data sequence for part (a) (assuming the timing constraints are flexible).

Find all the frequent subsequences with support > 50% given the sequence database shown in Table 7.16. Assume that there are no timing constraints imposed on the sequences.

Table 7.1 6, Example of event sequences generated by various sensors.

Sensor Timestamp Events

S1 1 2 3 ^

A ,B C D,E C

S2 1 2 .)

A,B C,D E

S3 1 2 3 4

B A B D,E

S4 1 2 3 4

C D,E C E

S5 1 2 .) 4

B A B,C A,D

10.

480 Chapter 7 Association Analysis: Advanced Concepts

11 . (a ) Fo r each o f t he sequences r r :< â‚¬ rez . . . e t . . . e t+ t . . . â‚¬ tas t ) g i ven be low , determine whether they are subsequences of the sequence

< {7 , 2 , 3 } {2 , 4 } {2 ,4 , 5 } {3 , 5 } i 6 } >

subjected to the following timing constraints:

mingap : 6 (interval between last event in ea and first event in e6’r1 is > 0)

maxgap : 3 (interval between first event in e4 and last event in eial is < 3)

maxspan : 5 (interval between first event in e1 and last event in e1o”1 is < 5) (time between first and last events in ei is ( 1)

o , u ) : 1 { 1 } { 2 } { 3 } > o w : 1 { t , 2 , J , 4 } { 5 , 6 } > o r. t ) :1 {2, 4}{2,4}{Gi > o , u ) : 1 { 1 } { 2 , 4 } { 6 } > e u : ( { 1 , 2 } { 3 , 4 } { 5 , 6 } >

(b) Determine whether each of the subsequences to given in the previous ques- tion are contiguous subsequences of the following sequences s.

o s : { { I , 2 , 3 , 4 , 5 , 6 } { 1 , 2 , 3 , 4 , 5 , 6 } { 1 , 2 , 3 , 4 , 5 , 6 } > . s : < { 1 , 2 , 3 , 4 } { I , 2 , 3 , 4 , 5 , 6 } { 3 , 4 , 5 , 6 } > o s : ( { 1 , 2 } { 1 , 2 , 3 , 4 } { 3 , 4 , 5 , 6 } { 5 , 6 } > o s : ( { 1 , 2 , 3 } { 2 , 3 , 4 , 5 } { 4 , 5 , 6 } >

12. For each of the sequence w : \”t,. . .,eh”t) below, determine whether they are subsequences of the following data sequence:

({A, B}{C, D}{A, B}{C, D}{A, B}{C, D}>

subjected to the following timing constraints:

mingap : Q (interval between last event in ea and first event in e,a1 is > 0)

maxgap : 2 (interval between first event in ei and last event in e;a1 is S 2)

maxspan : 6 (interval between first event in er and last event in e1o”1 is < 6) (time between first and last events in ei is < 1)

(a) u, : ({A){B}{C}{D})

(b ) . : ( {A} {B ,C,D} {A) )

(c ) t r : ( {A} {8 , C,D} {A} )

(d ) tn : ( {B ,C} {A, D} {B,Ch

w s : 1 –

UJ.9 : 1

Exercises 48L

( ” ) , : ( { A , B , C , D } { A , B , C , D } )

13. Consider the following frequent 3-sequences:

< { r ,2 ,3 } > , < {1 ,2 } {3 } > , < {1 } {2 ,3 } > , < { r , 2 } {4 } > , < {1 ,3 } {4 } > , < {1 , 2 ,4 } > , < {2 ,3 } {3 } > , < {2 ,3 } {4 } > , < {2}{3}{3} ), and < {2}{3}{4} >.

List all the candidate 4-sequences produced by the candidate generation step of the GSP algorithm.

List all the candidate 4-sequences pruned during the candidate pruning step of the GSP algorithm (assuming no timing constraints).

List all the candidate 4-sequences pruned during the candidate pruning step of the GSP algorithm (assuming margap : l).

14. Consider the data sequence shown in Table 7.17 for a given object. Count the number of occurrences for the sequence ({p}{q}{“}) according to the following counting methods:

7.8

(a)

(b)

(c)

(a)

(b)

(“)

(d)

(e)

COBJ (one occurrence per object).

CWIN (one occurrence per sliding window).

CMINWIN (number of minimal windows of occurrence).

CDIST-O (distinct occurrences with possibility of event-timestamp over- Iap).

CDIST (distinct occurrences with no event timestamp overlap allowed).

Table 7.17. Example of event sequence data for Exercise 14.

Timestamp Events 1 2 t)

4 tr

6 7 8 9 10

P ‘ Q r S

P , Q f r s

p

Q t r

Q r S

p

Q r r t S

482 Chapter 7 Association Analysis: Advanced Concepts

15. Describe the types of modifications necessary to adapt the frequent subgraph mining algorithm to handle:

(a) Directed graphs

(b) Unlabeled graphs

(c) Acyclic graphs

(d) Disconnected graphs

For each type of graph given above, describe which step of the algorithm will be affected (candidate generation, candidate pruning, and support counting), and any further optimization that can help improve the efEciency of the algorithm.

16. Draw all candidate subgraphs obtained from joining the pair of graphs shown in Figure 7.28. Assume the edge-growing method is used to expand the subgraphs.

(a)

17. by joining the pair of graphs shown method is used to expand the sub-

(b)

Figure 7.28. Graphs for Exercise 16.

Draw all the candidate subgraphs obtained in Figure 7.29. Assume the edge-growing graphs.

7.8 Exercises 483

(a)

18. (a)

(b)

(c)

le. (a)

(d)

(b)

Figure 7.29, Graphs for Exercise 17.

If support is defined in terms of induced subgraph relationship, show that the confidence ofthe rule 91 – gz can be greater than 1 ifgl and !2 are allowed to have overlapping vertex sets.

What is the time complexity needed to determine the canonical label of a graph that contains lVl vertices?

The core of a subgraph can have multiple automorphisms. This will in- crease the number of candidate subgraphs obtained after merging two frequent subgraphs that share the same core. Determine the maximum number of candidate subgraphs obtained due to automorphism of a core of size k.

Two frequent subgraphs of size k may share multiple cores. Determine the maximum number of cores that can be shared by the two frequent subgraphs.

Consider a graph mining algorithm that uses the edge-growing method to join the two undirected and unweighted subgraphs shown in Figure 19a.

+

484 Chapter 7 Association Analysis: Advanced Concepts

Draw all the distinct cores obtained when meiging the two subgraphs.

How many candidates are generated using the following core?

The original association rule mining framework considers only presence of items together in the same transaction. There are situations in which itemsets that are infrequent may also be informative. For instance, the itemset TV, DVD, – VCR suggests that many customers who buy TVs and DVDs do not buy VCRs.

In this problem, you are asked to extend the association rule framework to neg- ative itemsets (i.e., itemsets that contain both presence and absence of items). We will use the negation symbol (-) to refer to absence of items.

(a) A naive way for deriving negative itemsets is to extend each transaction to include absence of items as shown in Table 7.18.

Table 7.18, Example of numeric data set.

TID TV -TV DVD -DVD VCR -VCR 1 2

I 1

0 0

0 0

1 I

0 0

I I

i. Suppose the transaction database contains 1000 distinct items. What is the total number of positive itemsets that can be generated from these items? (Note: A positive itemset does not contain any negated items).

ii. What is the maximum number of frequent itemsets that can be gen- erated from these transactions? (Assume that a frequent itemset may contain positive, negative, or both types of items)

iii. Explain why such a naive method of extending each transaction with negative items is not practical for deriving negative itemsets.