Figure5.42. Constructing an ROC curve.
Figure 5,43. BOC curve for the data shown in Figure 5.42.
Iabels of the test records are shown in the first row of the table. The second row corresponds to the sorted output values for each record. For example, they may correspond to the posterior probabilities P(*lx) generated by a naive Bayes classifier. The next six rows contain the counts of.TP, FP,TN, and f.ly’, along with their corresponding TPR and FPR. The table is then filled from left to right. Initially, all the records are predicted to be positive. Thus, TP : FP :5 and TPR : FPR : 1. Next, we assign the test record with the lowest output value as the negative class. Because the selected record is actually a positive example, the TP count reduces from 5 to 4 and the FP count is the same as before. The FPR and TPR are updated accordingly. This process is repeated until we reach the end of the list, where TPR : 0 and FPR:0. The ROC curve for this example is shown in Figure 5.43.
3O2 Chapter 5 Classification: Alternative Techniques
5.7.3 Cost-Sensitive Learning
A cost matrix encodes the penalty of classifying records from one class as another. LeL C(i,,j) denote the cost of predicting a record from class i as class j. With this notation, C(1, -) is the cost of committing a false negative error, while C(-, +) is the cost of generating a false alarm. A negative entry in the
cost matrix represents the reward for making correct classification. Given a collection of ly’ test records, the overall cost of a model M is
C{M) : TP x C(+, +) + FP x C(- ,+) + F,^r x C(*, – )
+ T,^/ x C(-, -). (5.79)
Under the 0/1 cost matr ix , i .e . , C(+,* ) : C(- , – ) :0 and C(+, , -1 :
C(-, +) : 1, it can be shown that the overall cost is equivalent to the number of misclassification errors.
Ct(M) : 0 x gP +7,^/) + 1 x (FP+f.lr) : N x Err, (5.80)
where Err is the error rate of the classiher.
Example 5.9. Consider the cost matrix shown in Table 5.7: The cost of committing a false negative error is a hundred times larger than the cost of committing a false alarm. In other words, failure to detect any positive
example is just as bad as committing a hundred false alarms. Given the classification models with the confusion matrices shown in Table 5.8, the total cost for each model is
Ct(M) : 150 x ( -1) *60 x 1 +40 x 100 : 3910,
Ct(Mz) : 250 x (-1) * 5 x 1 -f 45 x 100 : 4255.
Table 5,7. Cost matrix for Example 5.9.
Predicted Class Ulass : * Class : –
Actual Class
Class : f – l 100 UIaSS : – I (.)
5.7 Class Imbalance Problem 3OB
Table 5,8. Confusion matrix for two classification models. Model M1 Predicted Olass
UIASS + UIASS –
Actual Class
Class + 150 40 UIaSS – 60 250
Model M2 Predicted Class Ulass + Ulass –
Actual Class
Class + 250 45 Class – 200
Notice that despite improving both of its true positive and false positive counts, model Mz is still inferior since the improvement comes at the expense of in- creasing the more costly false negative errors. A standard accuracy measure would have preferred model M2 over M1. I
A cost-sensitive classification technique takes the cost matrix into consid- eration during model building and generates a model that has the lowest cost. For example, if false negative errors are the most costly, the learning algorithm will try to reduce these errors by extending its decision boundary toward the negative class, as shown in Figure 5.44. In this way, the generated model can cover more positive examples, although at the expense of generating additional false alarms.
Figure 5.44. Modifying the decision boundary (from 81 to F2)to reduce the false negative errors of a classifier.
There are various ways to incorporate cost information into classification algorithms. For example, in the context of decision tree induction, the cost
3O4 Chapter 5 Classification: Alternative Techniques
information can be used to: (1) choose the best attribute to use for splitting the data, (2) determine whether a subtree should be pruned, (3) manipulate the weights ofthe training records so that the learning algorithm converges to a decision tree that has the lowest cost, and (4) modify the decision rule at each leaf node. To illustrate the last approach, let p(ilt) denote the fraction of training records from class i that belong to the leaf node t. A typical decision rule for a binary classification problem assigns the positive class to node t if the following condition holds.
The preceding decision rule suggests that the class label of a leaf node depends on the majority class of the training records that reach the particular node. Note that this rule assumes that the misclassification costs are identical for both positive and negative examples. This decision rule is equivalent to the expression given in Equation 4.8 on page 165.
Instead of taking a majority vote, a cost-sensitive algorithm assigns the class label e to node t if it minimizes the following expression:
:=+ ==+
+
p(+lt)C(+, -)
+ p(+lt)C(+, -)
p(+lt) > p(-lt)
e1lt) > (1 -e(+lr)) zp(+lt) > 1 p(+lt) > 0.5. (5 .81)
(5.82)c (i,lt) — D e(j lt) c (j, i.). J
In the case where C(+,+) : C(-,-) : 0, a leaf node f is assigned to the positive class if:
+ p(+l t ) >
>’p(- l t )C(- ,+) > (1 – e(+lr))c(-, +) c(-, +) (5.83)
C(- , +) + C(+, – ) ‘
This expression suggests that we can modify the threshold of the decision rule from 0.5 to C(-,+)lQe, +)+ C(+, -)) to obtain a cost-sensitive classifier. If C(-,+) < C(+,-), then the threshold will be less than 0.5. This result makes sense because the cost of making a false negative error is more expensive than that for generating a false alarm. Lowering the threshold will expand the decision boundary toward the negative class, as shown in Figure 5.44.
Class Imbalance Problem 305
X
(a) Without oversampling (b) With oversampling
Figute 5.45. lllustrating the effect of oversampling of the rare class.
5.7.4 Sampling-Based Approaches
Sampling is another widely used approach for handling the class imbalance problem. The idea of sampling is to modify the distribution of instances so that the rare class is well represented in the training set. Some of the available techniques for sampling include undersampling, oversampling, and a hybrid of both approaches. To illustrate these techniques, consider a data set that contains 100 positive examples and 1000 negative examples.
In the case of undersampling, a random sample of 100 negative examples is chosen to form the training set along with all the positive examples. One potential problem with this approach is that some of the useful negative exam- ples may not be chosen for training, therefore, resulting in a less than optimal model. A potential method to overcome this problem is to perform undersam- pling multiple times and to induce multiple classifiers similar to the ensemble Iearning approach. Focused undersampling methods may also be used, where the sampling procedure makes an informed choice with regard to the nega- tive examples that should be eliminated, e.g., those located far away from the decision boundary.
Oversampling replicates the positive examples until the training set has an equal number of positive and negative examples. Figure 5.45 illustrates the effect of oversampling on the construction of a decision boundary using a classi- fier such as a decision tree. Without oversampling, only the positive examples at the bottom right-hand side of Figure 5.45(a) are classified correctly. The positive example in the middle of the diagram is misclassified because there
D . J
N
X1
306 Chapter 5 Classification: Alternative Techniques
are not enough examples to justify the creation of a new decision boundary to separate the positive and negative instances. Oversampling provides the additional examples needed to ensure that the decision boundary surrounding the positive example is not pruned, as illustrated in Figure 5.45(b).
However, for noisy data, oversampling may cause model overfitting because some of the noise examples may be replicated many times. In principle, over* sampling does not add any new information into the training set. Replication of positive examples only prevents the learning algorithm from pruning certain parts of the model that describe regions that contain very few training exam- ples (i.e., the small disjuncts). The additional positive examples also tend to increase the computation time for model building.
The hybrid approach uses a combination of undersampling the majority class and oversampling the rare class to achieve uniform class distribution. Undersampling can be performed using random or focused subsampling. Over- sampling, on the other hand, can be done by replicating the existing positive
examples or generating new positive examples in the neighborhood of the ex- isting positive examples. In the latter approach, we must first determine the k-nearest neighbors for each existing positive example. A new positive ex- ample is then generated at some random point along the line segment that joins the positive example to one of its k-nearest neighbors. This process is repeated until the desired number of positive examples is reached. Unlike the data replication approach, the new examples allow us to extend the decision boundary for the positive class outward, similar to the approach shown in Fig’ ure 5.44. Nevertheless, this approach may still be quite susceptible to model overfitting.
5.8 Multiclass Problem
Some of the classification techniques described in this chapter, such as support vector machines and AdaBoost, are originally designed for binary classification problems. Yet there are many real-world problems, such as character recogni- tion, face identification, and text classification, where the input data is divided into more than two categories. This section presents several approaches for extending the binary classifiers to handle multiclass problems. To illustrate these approaches, let Y : {Ar,yz,. . . ,AK} be the set of c lasses of the input data.
The first approach decomposes the multiclass problem into K binary prob- lems. For each class At e Y , a binary problem is created where aII instances that belong to 96 are considered positive examples, while the remaining in-
Multiclass Problem 307
stances are considered negative examples. A binary classifier is then con- structed to separate instances of class y6 from the rest of the classes. This is known as the one-against-rest (1-r) approach.
The second approach, which is known as the one-against-one (1-1) ap- proach, constructs K(K – L)/2binary classifiers, where each classifier is used to distinguish between a pair of classes, (Ao,Ai). Instances that do not belong to either Ai or yj are ignored when constructing the binary classifier for (gi,yi). In both 1-r and 1-1 approaches, a test instance is classified by combining the predictions made by the binary classifiers. A voting scheme is typically em- ployed to combine the predictions, where the class that receives the highest number of votes is assigned to the test instance. In the 1-r approach, if an instance is classified as negative, then all classes except for the positive class receive a vote. This approach, however, may lead to ties among the different classes. Another possibility is to transform the outputs of the binary classifiers into probability estimates and then assign the test instance to the class that has the highest probability.