+1 (208) 254-6996 [email protected]
  

1364] R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Perfor- mance Improvements. In Proc. of the Sth Intl Conf. on Ertending Database Technologg (EDBT’96), pages 18 32, Avignon, France, 1996.

1365] P. N. Tan, V. Kumar, and J. Srivastava. Indirect Association: Mining Higher Order Dependencies in Data. In Proc. of the lth European Conf. of Princ’iples and Practice of Knouledge D’iscoueryin Databases, pages 632-637, Lyon, FYance, 2000.

Don't use plagiarized sources. Get Your Custom Essay on
1364] R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Perfor- mance Improvements.
Just from $13/Page
Order Essay

[366] W. G. Teng, M. J. Hsieh, and M.-S. Chen. On the Mining of Substitution Rules for

Statistically Dependent Items. In Proc. of the 2002 IEEE Intl. Conf. on Data Mining, pages 442-449, Maebashi City, Japan, December 2002.

1367] K. Wang, S. H. Tay, and B. Liu. Interestingness-Based Interval Merger for Numeric Association Rules. In Proc. of the lth Intl. Conf. on Knowledge Discouery and Data Mr,ruing, pages 121 128, New York, NY, August 1998.

f368] G. I. Webb. Discovering associations with numeric variables. In Proc. of the 7th IntI.

Conf. on Knowledge Discouery and, Data M’ining, pages 383-388, San Francisco, CA, August 2001.

f369] X. Wu, C. Zhang, and S. Zhang. Mining Both Positive and Negative Association Rules. ACM Trans. on Informat’ion Sgstems,22(3):381 405,2004.

[370] X. Yan and J. Han. gSpan: Graph-based Substructure Pattern Mining. In Proc. of the 2002 IEEE IntI. Conf. on Data Mi,ni.ng, pages 72L-724, Maebashi City, Japan, December 2002.

f3711 M. J. Zakl. Efficiently mining frequent trees in a forest. In Proc. of the 8th Intl. Conf. on Knowledge Discouery and Data Mining, pages 71-80, Edmonton, Canada, Jdy 2002.

1372] H. Zlnang, B. Padmanabhan, and A. Tuzhilin. On the Discovery of Significant Statis- tical Quantitative Rules. In Proc. of the 10th Intl. Conf. on Knowled,ge D’i,scouery and’ Data Mi,ni,ng, pages 374-383, Seattle, WA, August 2004.

 

 

7.8 Exercises 473

7.8 Exercises

1. Consider the traffic accident data set shown in Table 7.10.

Table 7,10. Traffic accident data set. Weat

Condition her Driver’s

Condition Tlaffic

Violation Seat belt Urash

Severity Good Bad

Good Good Bad

Good Bad

Good Good Bad

Good Bad

Alcohol-impaired Sober Sober Sober Sober

Alcohol-impaired Alcohol-impaired

Sober Alcohol-impaired

Sober Alcohol-impaired

Sober

Exceed speed limit None

Disobey stop sign Exceed speed limit

Disobey traffic signal Disobey stop sign

None Disobey trafrc signal

None Disobey traffic signal Exceed speed limit Disobey stop sign

No Yes Yes Yes No Yes Yes Yes No No Yes Yes

Major Minor Minor Major Major Minor Major Major Major Major Major Minor

(a) Show a binarized version of the data set.

(b) What is the maximum width of each transaction in the binarized data?

(c) Assuming that support threshold is 30%, how many candidate and fre- quent itemsets will be generated?

(d) Create a data set that contains only the following asymmetric binary attributes: (LJeather : Bad, Driver’s condition : Alcohol-impaired, Traf f ic v io lat ion: Yes, Seat Bel t : No, Crash Sever i ty : t ‘ ta jor) . For Traffic violation, only None has a value of 0. The rest of the attribute values are assigned to 1. Assuming that support threshold is 30%, how many candidate and frequent itemsets will be generated?

(e) Compare the number of candidate and frequent itemsets generated in parts (c) and (d).

2. (a) Consider the data set shown in Table 7.11. Suppose we apply the following discretization strategies to the continuous attributes of the data set.

Dl: Partition the range of each continuous attribute into 3 equal-sized bins.

D2: Partition the range of each continuous attribute into 3 bins; where each bin contains an eoual number of transactions

 

 

474 Chapter 7 Association Analysis: Advanced Concepts

Table 7.11, Data set for Exercise 2. TID Temperature Pressure Alarm 1 Alarm 2 Alarm 3

I 2 3 4 o

r) 7

8 o

9l)

6D

103 97 80 100 83 86 101

1 105 1040 1090 1084 1038 1080 1025 1030 1 100

0 I I 1 0 1 1 1 1

0 1 I

1 0 1 1 0 0 1

1 0 1 0 1 0 1 0 I

For each strategy, answer the following questions:

i. Construct a binarized version of the data set.

ii. Derive all the frequent itemsets having support > 30%.

(b) The continuous attribute can also be discretized using a clustering ap- proach.

i. PIot a graph of temperature versus pressure for the data points shown in Table 7.11.

ii. How many natural clusters do you observe from the graph? Assign a label (Cr, Cr, etc.) to each cluster in the graph.

iii. What type of clustering algorithm do you think can be used to iden- tify the clusters? State your reasons clearly.

iv. Replace the temperature and pressure attributes in Table 7.11 with asymmetric binary attributes C1, C2, etc. Construct a transac- tion matrix using the new attributes (along with attributes Alarml, Alarm2, and Alarm3).

v. Derive all the frequent itemsets having support > 30% from the bi- narized data.

Consider the data set shown in Table 7.I2. The first attribute is continuous, while the remaining two attributes are asymmetric binary. A rule is considered to be strong if its support exceeds 15% and its confidence exceeds 60%. The data given in Table 7.12 supports the following two strong rules:

( i ) { (1 < A < 2) ,8 : 1} – – -+ {C : 1}

( i i ) { ( 5 < A < 8 ) ,8 : 1 } – -+ {C : 1 }

(a) Compute the support and confidence for both rules.

(b) To find the rules using the traditional Apriori algorithm, we need to discretize the continuous attribute A. Suppose we apply the equal width

,).

 

 

7.8 Exercises 475

Table7.12. Data set for Exercise 3. A B C 1 2 ,1

4 o

o 7

8 q

10 1 1 12

1 1 1 1 1 0 0 1 0 0 0 0

I I 0 0 1 1 0 1 0 0 0 1

binning approach to discretize the data, with bin-wi,dth : 2,3,4. For each b’in-w’idth, state whether the above two rules are discovered by the Apriori, algorithm. (Note that the rules may not be in the same exact form as before because it may contain wider or narrower intervals for A.) For each rule that corresponds to one of the above two rules, compute its support and confidence.

(c) Comment on the effectiveness of using the equal width approach for clas- sifying the above data set. Is there a bin-width that allows you to find both rules satisfactorily? If not, what alternative approach can you take to ensure that vou will find both rules?

4. Consider the data set shown in Table 7.13.

Table 7.13. Data set for Exercise 4. Age (A)

Number of Hours Online per Week (B) 0 -5 5 – 1 0 1 0 – 2 0 2 0 – 3 0 3 0 – 4 0

1 0 – 1 5 1 5 – 2 5 2 5 – 3 5 3 5 – 5 0

2 2 10 4

.f

(

15 o

(

10 r

K

10 .J

3

2 3 2 2

(a) For each combination of rules given below, specify the rule that has the highest confidence.

i . 1 5 < A < 2 5 – – – – + 1 0 < B < 2 0 , I 0 < A < 2 5 – – – – + 1 0 < B < 2 0 , a n d 75 < A ( 35——-+ 70 < B <20.

 

 

476 Chapter 7 Association Analysis: Advanced Concepts

i i . 15 < A<25 – – – – – -+ 10 <B < 20 , 15 <A<25 ‘ – – – – -+ 5 < B (20 , and I 5 < 4 1 2 5 – – – – – + 5 < B < 3 0 .

i i i . 1 5 < A < 2 5 – – + I 0 < B < 2 0 a n d 1 0 < A ( 3 5 – – – – – + 5 < B < 3 0 .

Suppose we are interested in finding the average number of hours spent online per week by Internet users between the age of 15 and 35. Write the

corresponding statistics-based association rule to characterizethe segment of users. To compute the average number of hours spent online, approx- imate each interval by its midpoint value (e.g., use B:7.5 to represent t h e i n t e r v a l 5 < B < 1 0 ) .

Test whether the quantitative association rule given in part (b) is statis- tically significant by comparing its mean against the average number of hours spent online by other users who do not belong to the age group.

5. For the data set with the attributes given below, describe how you would con- vert it into a binary transaction data set appropriate for association analysis. Specifically, indicate for each attribute in the original data set

(a) how many binary attributes it would correspond to in the transaction data set,

(b) how the values of the original attribute would be mapped to values of the binary attributes, and

(c) if there is any hierarchical structure in the data values of an attribute that could be useful for grouping the data into fewer binary attributes.

The following is a list of attributes for the data set along with their possible values. Assume that all attributes are collected on a per-student basis:

o Year : FYeshman, Sophomore, Junior, Senior, Graduate:Masters, Gradu- ate:PhD, Professional

o Zip code : zip code for the home address of a U.S. student, zip code for

the local address of a non-U.S. studenr

o College : Agriculture, Architecture, Continuing Education, Education, Liberal Arts, Engineering, Natural Sciences, Business, Law, Medical, Den- tistry, Pharmacy, Nursing, Veterinary Medicine

o On Campus : 1 if the student lives on campus, 0 otherwise

o Each of the following is a separate attribute that has a value of 1 if the person speaks the language and a value of 0, otherwise.

– Arabic – Bengali – Chinese Mandarin – English – Portuguese

(b)

(c )

 

 

7.8 Exercises 477

– Russian – Spanish

6. Consider the data set shown in Table 7.14. Suppose we are interested in ex- tracting the following association rule:

{41 ( Age l ctz,Play Piano : Yes} ——‘ {Enjoy Classical Music : Yes}

Table 7.14. Data set for Exercise 6. Age Play Piano Eniov Classical Music

a

Order your essay today and save 10% with the discount code ESSAYHELP