For model M1, suppose you choose the cutoffthreshold to be f :0.5. In other words, any test instances whose posterior probability is greater than C will be classified as a positive example. Compute the precision, recall, and F-measure for the model at this threshold value.
k 1
d. (Vt ,v ) : t lLL – ‘ ‘ ‘ 1 , – | n r n z It – I I
Figure pair of
1 5 .
l o .
17.
5.1-0 Exercises 323
Table 5.14. Posterior orobabilities for Exercise 17.
Instance Tlue Class P(+ \A , . . . , Z , Mr ) P (+1 ,4 , . . . , 2 ,M2 ) 1 2 3 4 5 6 7 8 q
10
+ +
+ +
+
0. 0.69 0.44 0.55 0.67 0.47 0.08 0.15 0.45 0.35
0.61 0.03 0.68 0.31 0.45 0.09 0.38 0.05 0.01 0.04
18.
(c) Repeat the analysis for part (c) using the same cutoff threshold on model
M2. Compare the F-measure results for both models. Which model is
better? Are the results consistent with what you expect from the ROC
curve?
(d) Repeat part (c) for model M1 using the threshold t : 0.1. Which thresh-
old do you prefer, t : 0.5 or f : 0.1? Are the results consistent with what
you expect from the ROC curve?
Following is a data set that contains two attributes, X and Y, and two class
labels. “+” and “-“. Each attribute can take three different values: 0′ 1, or 2.
The concept for the “+” class is Y : 1 and the concept for the “-” class is
X : 0 Y X : 2 .
(a) Build a decision tree on the data set. Does the tree capture ffts ((1″ 3,nd tt-tt concepts?
X Y Number of Instances +
0 1 2 0 I 2 0 1 2
0 0 0 1
1 I 2 2 2
0 0 0
10 10 10 0 0 0
100 U
100 100
0 100 100
0 100
324 Chapter 5 Classification: Alternative Techniques
What are the accuracy, precision, recall, and F1-measure of the decision tree? (Note that precision, recall, and F1-measure are defined with respect to the “+” class.)
Build a new decision tree with the following cost function:
f 0 , i f i : i ; c ( i , j ) : 11 , i f i : * , j : – :
l ,*N’#B:ifii+fi#*tr , ir’i: -, j : +. (Hint: only the leaves of the old decision tree need to be changed.) Does the decision tree capture the “+” concept?
(d) What are the accuracy) precision, recall, and f’1-measure of the new deci- sion tree?
(a) Consider the cost matrix for a two-class problem. Let C(*, *) : C(-, -) : p , C(+,- ) : C(- ,1) : q , and q ) p. Show that min imiz ing the cost function is equivalent to maximizing the classifier’s accuracy.
(b) Show that a cost matrix is scale-invariant. For example, if the cost matrix is rescaled from C(i, j) —— PC(i,j), where B is the scaling factor, the decision threshold (Equation 5.82) will remain unchanged.
(c) Show that a cost matrix is translation-invariant. In other words, adding a constant factor to all entries in the cost matrix will not affect the decision threshold (Equation 5.82).
Consider the task of building a classifier from random data, where the attribute values are generated randomly irrespective of the class labels. Assume the data set contains records from two classes, “+” and “-.” Half of the data set is used for training while the remaining half is used for testing.
(a) Suppose there are an equal number of positive and negative records in the data and the decision tree classifier predicts every test record to be positive. What is the expected error rate of the classifier on the test data?
(b) Repeat the previous analysis assuming that the classifier predicts each test record to be positive class with probability 0.8 and negative class with probability 0.2.
(c) Suppose two-thirds of the data belong to the positive class and the re- maining one-third belong to the negative class. What is the expected error of a classifier that predicts every test record to be positive?
(d) Repeat the previous analysis assuming that the classifier predicts each test record to be positive class with probabllity 213 and negative class with orobabilitv 1/3.
(b)
(“)
19.
20.
5.10 Exercises 325
21. Derive the dual Lagrangian for the linear SVM with nonseparable data where the objective function is
r , – – – \ l l * l l ‘ , cr $ r . i2″ f (w) : , +u \Le t ) .
22. Consider the XOR problem where there are four training points:
(1 , 1 , – ) , ( 1 ,0 , + ) , (0 , 1 , + ) , (0 ,0 , – ) .
Tlansform the data into the following feature space:
iD : (1, Jirr, r/i*r, Jirrrr, “?, *7).
Find the maximum margin linear decision boundary in the transformed space.
23. Given the data sets shown in Figures 5.49, explain how the decision tree, naive Bayes, and k-nearest neighbor classifiers would perform on these data sets.
326 Chapter 5 Classification: Alternative Techniques
Attrlbub
Noise Aflributes
l r l
t l I
1 l
Class A
l r r l
I
l l l l I
Clss B
Distinguishing Anribule6
(a) Synthetic data set 1.
(c) Synthetic data set 3.
(e) Synthetic data set 5.
(b) Synthetic data set 2.
(d) Synthetic data set 4
(f) Synthetic data set 6.
Afrribuie6
Dislinguishing Anributes Noise Atribut€s
i:iii;:i;iii Class A
” ‘ i i i ‘ t t
‘ ‘ , ,
l l t r
r r i i ‘
, , i , ‘ : ‘ i l r l
Class B
Attributes
Oistinguishing Attribute ret 1
Oislinguishing Attribute set 2 Noise Attribules
60% lilled with 1
40% filled with 1 Class
40ol. filled wilh 1
60% filled with I
Clss
Class A Class B Class A Class B Class A
Class B Class A Class B Class A Class B
Class A Class B Cla$ A Class B Class A
Class B Class A Class B Class A Class B
Figure 5.49. Data set for Exercise 23.
Association Analysis: Basic Concepts and Algorithms
Many business enterprises accumulate large quantities of data from their day- to-day operations. For example, huge amounts of customer purchase data are collected daily at the checkout counters of grocery stores. Table 6.1 illustrates an example of such data, commonly known as market basket transactions. Each row in this table corresponds to a transaction, which contains a unique identifier labeled TID and a set of items bought by a given customer. Retail- ers are interested in analyzing the data to learn about the purchasing behavior of their customers. Such valuable information can be used to support a vari- ety of business-related applications such as marketing promotions, inventory management, and customer relationship management.
This chapter presents a methodology known as association analysis, which is useful for discovering interesting relationships hidden in large data sets. The uncovered relationships can be represented in the form of associa-
Table 6.1. An example of market basket transactions.
TID Items 1 2 3 4
tr
{Bread, Milk}
{Bread, Diapers, Beer, Eggs}
{Milk, Diapers, Beer, Cola}
{Bread, Milk, Diapers, Beer}
{Bread, Milk, Diapers, Cola}
328 Chapter 6 Association Analvsis
tion rules or sets of frequent items. For example, the following rule can be extracted from the data set shown in Table 6.1:
{liapers} –‘ {eeer}.
The rule suggests that a strong relationship exists between the sale of diapers and beer because many customers who buy diapers also buy beer. Retailers can use this type of rules to help them identify new opportunities for cross- selling their products to the customers.
Besides market basket data, association analysis is also applicable to other application domains such as bioinformatics, medical diagnosis, Web mining, and scientific data analysis. In the analysis of Earth science data, for example, the association patterns may reveal interesting connections among the ocean, land, and atmospheric processes. Such information may help Earth scientists develop a better understanding of how the different elements of the Earth system interact with each other. Even though the techniques presented here are generally applicable to a wider variety of data sets, for illustrative purposes) our discussion will focus mainly on market basket data.
There are two key issues that need to be addressed when applying associ- ation analysis to market basket data. First, discovering patterns from a large transaction data set can be computationally expensive. Second, some of the discovered patterns are potentially spurious because they may happen simply by chance. The remainder of this chapter is organized around these two is- sues. The first part of the chapter is devoted to explaining the basic concepts of association analysis and the algorithms used to efficiently mine such pat- terns. The second part of the chapter deals with the issue of evaluating the discovered patterns in order to prevent the generation of spurious results.
6.1 Problem Definition
This section reviews the basic terminology used in association analysis and presents a formal description of the task.
Binary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds to a transaction and each column corresponds to an item. An item can be treated as a binary variable whose value is one if the item is present in a transaction and zero otherwise. Because the presence of an item in a transaction is often considered more important than its absence, an item is an asymmetric binary variable.
\ Problem Definition 329
Table 6.2. A binary 0/1 representation of market basket data,
TID Bread Milk Diapers Beer Eggs Cola I 2 3 4
5
1 1 0 1 1
-l
0 1 1 1
0 1 1 1 1
0 I
1 I 0
0 1 0 0 0
0 0 1 0 I
This representation is perhaps a very simplistic view of real market basket data
because it ignores certain important aspects of the data such as the quantity
of items sold or the price paid to purchase them. Methods for handling such
non-binary data will be explained in Chapter 7.
Itemset and Support Count Let I : {h,i2,…,i ‘a} be the set of all items
in a market basket data and T : {h,t2,..-,t1″} be the set of all transactions’
Each transaction ti contains a subset of items chosen from 1. In association
analysis, a collection of zero or more items is termed an itemset. If an itemset
contains /c items, it is called a k-itemset. For instance, {Beer, Diapers, Mj-Ik}
is an example of a 3-itemset. The null (or empty) set is an itemset that does
not contain any items. The transaction width is defined as the number of items present in a trans-