29O Chapter 5 Classification: Alternative Techniques
Algorithm 5.7 AdaBoost algorithm. 1: w : {r j : r lN I i : I,2,. . . ,,4f}. {Init ialize the weights for all l / examples.} 2: Let k be the number of boosting rounds. 3: for i,: I to k do 4′. Create training set Da by sampling (with replacement) from D according to w. 5: tain a base classifier Ci on Di. 6: Apply Ci to aIl examples in the original training set, D. 7: et : *lDrlu, 6(C{r) I Ai)l {Calculate the weighted error.} 8: if e; > 0.5 then 9: w : {wj : l lN I j : t ,2 , . . . ,N} . {Reset the weights for a l l l / examples.}
10: Go back to Step 4. 11: end if 12 : o . i : +h * . 13: Update the weight of each example according to Equation 5.69. 14: end for 15: C.(x) : argmax ll:ro,5(C1(*) : g))
a
where e,; is the error rate of each base classifier i. If the error rate of the base classifier is less than 50%, we can write e; : 0.5 – li, where l; fir€asur€s how much better the classifier is than random guessing. The bound on the training error of the ensemble becomes
€ensemble ( T (5.7r)
If m < 7x for all i’s, then the training error of the ensemble decreases expo- nentially, which leads to the fast convergence of the algorithm. Nevertheless, because of its tendency to focus on training examples that are wrongly classi- fied, the boosting technique can be quite susceptible to overfitting.
5.6.6 Random Forests
Random forest is a class of ensemble methods specifically designed for decision tree classifiers. It combines the predictions made by multiple decision trees, where each tree is generated based on the values of an independent set of random vectors, as shown in Figure 5.40. The random vectors are generated from a fixed probability distribution, unlike the adaptive approach used in AdaBoost, where the probability distribution is varied to focus on examples that are hard to classify. Bagging using decision trees is a special case of random forests, where randomness is injected into the model-building process
Uf – +”: ( exp (_ r.”r)
Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 o.7 0.7 0.7 0.8 1
v 1 1 -1 -1 -1 -1 -1 -1 1 1
5.6 Ensemble Methods 29L
Round x=0.1 x=0.2 x=0,3 x=0,4 x=0.5 x=0.6 x=0.7 x=0,8 x=0.9 x=1,0 1 0.1 0.1 0 .1 0.1 0 .1 0.1 0.1 0 .1 0.1 2 0 .311 0.3’t ‘ l 0 .3t 1 0.01 0.01 0.01 0.01 0.01 0.01 0.01 J 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009
(b) Weights of training records
Figure 5.38. Example of boosting.
by randomly choosing ly’ samples, with replacement, from the original training set. Bagging also uses the same uniform probability distribution to generate its bootstrapped samples throughout the entire model-building process.
It was theoretically proven that the upper bound for generalization error of random forests converges to the following expression, when the number of trees is sufficiently large.
where p is the average correlation among the trees and s is a quantity that measures the “strength” of the tree classifiers. The strength of a set of classi- fiers refers to the average performance of the classifiers, where performance is measured probabilistically in terms of the classifier’s margin:
Genera l iza t ion er ror =- ry ,
margin, M(X,Y) : PlYo – Y) – ry#P(Yo : Z),’ Z+Y
(5,.72)
(5.73)
where YB is the predicted class of X according to a classifier built from some random vector d. The higher the margin is, the more likely it is that the
Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
v 1 1 1 1 1 1 1 1 1 1
Boosting Round 3:
x o.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
v 1 1 -1 1 -1 -1 -1 -1 -1 -1
(a) Training records chosen during boosting
ffi+ (c
292 Chapter 5 Classification: Alternative Techniques
Round Split Point Left Glass Right Class cx, 1 0.75 1 1 1.738
2 0.05 I 1 2.7784
.t 0.3 1 1 4 .1195
(a)
Round x=0.1x=0.2 x=0.3 x–0.4 x=0.5 x=0.6 x=0.7 x=0,8 x=0.9 x=l.0
1 1 1 1 1 1 -1 1 1 I
2 1 1 1 1 1 I 1 1
.t 1 1 1 1 1 1 1 1 1 1
Sum 5 . 1 6 5 . 1 6 5 . 1 6 -3.08 -3.08 -3.08 -3.08 0.3970.397 0.397
Sign 1 1 1 1 1 1 1 1 1
(b)
Figure 5,39. Example ol combining classifiers constructed using the AdaBoost approach.
Original Training data
Step 2: Use random
vector to build multiple decision trees
Step 3: Combine
decision trees
Figure 5.40. Random forests.
classifier correctly predicts a given example X. Equation5.72 is quite intuitive; as the trees become more correlated or the strength of the ensemble decreases, the generalization error bound tends to increase. Randomization helps to reduce the correlation among decision trees so that the generalization error of the ensemble can be improved.
+
5.6 Ensemble Methods 293
Each decision tree uses a random vector that is generated from some fixed probability distribution. A random vector can be incorporated into the t,ree- growing process in many ways. The first approach is to randomly select F input features to split at each node of the decision tree. As a result, instead of examining all the available features, the decision to split a node is determined from these selected f. features. The tree is then grown to its entirety without any pruning. This may help reduce the bias present in the resulting 1;ree. Once the trees have been constructed, the predictions are combined using a majority voting scheme. This approach is known as Forest-Rl, where RI refers to random input selection. To increase randomness, bagging can also be used to generate bootstrap samples for Forest-Rl The strength and correlation of random forests may depend on the size of F. If F is sufficiently small, then the trees tend to become less correlated. On the other hand, the strength of the tree classifier tends to improve with a larger number of features, F. As a tradeoff, the number of features is commonly chosen to be F – logz d + L, where d is the number of input features. Since only a subset of the features needs to be examined at each node, this approach helps to significantly re<luce the runtime of the algorithm.
If the number of original features d is too small, then it is difficult to choose an independent set of random features for building the decision trees. One way to increase the feature space is to create linear combinations of the input features. Specifically, at each node, a new feature is generated by randomly selecting .t of the input features. The input features are linearly combined using coefficients generated from a uniform distribution in the range of [-1, 1]. At each node, F of such randomly combined new features are generated, and the best of them is subsequently selected to split the node. This approach is known as Forest-RC.
A third approach for generating the random trees is to randomly select one of the f’ best splits at each node of the decision tree. This approach may potentially generate trees that are more correlated than Forest-Rl and Fonest- RC, unless -F is sufficiently large. It also does not have the runtime savings of Forest-Rl and Forest-RC because the algorithm must examine all the splitting features at each node of the decision tree.
It has been shown empirically that the classification accuracies of random forests are quite comparable to the AdaBoost algorithm. It is also more robust to noise and runs much faster than the AdaBoost algorithm. The classification accuracies of various ensemble algorithms are compared in the next section.
294 Chapter 5 Classification: Alternative Techniques
Table 5.5. Comparing the accuracy of a decision tree classifier against three ensemble methods.
Data Set Number of (Attributes, Classes,
Records)
Decision Tlee (%)
Bagging (%)
Boosting (%)
RF (%)
Anneal Australia Auto Breast Cleve Credit Diabetes German Glass Heart Hepatitis Horse Ionosphere Iris Labor LedT Lymphography Pima Sonar Tic-tac-toe Vehicle Waveform Wine Zoo
(39 , 6 ,898 ) (15, 2, 690) (26, 7, 205) (11, 2, 699) (14, 2, 303) (16, 2, 690) (9, 2,769)
(21, 2, 1000) (r0, 7, 2L4)
\14 ,2 ,270 ) (20,2, L55) (23, 2, 369) (35, 2, 351) (5 ,3 , 150 ) (17, 2, 57)
(8, 10, 3200) (19, 4, 148) ( 9 , 2 , 7 6 9 ) (6r , 2 ,208) (10, 2, g5g) (r9, 4,846) (22, 3, 5000) (14, 3, 178) (r7,7, Lol )
92.09 85.51 81.95 95.r4 76.24 85.8 72.40 70.90 67.29 80.00 81.94 85.33 89.L7 94.67 78.95 73.34 77.03 74.35 78.85 83.72 7r.04 76.44 94.38 93.07
94.43 87.10 85.37 96.42 81.52 86.23 76.30 73.40 76.r7 81.48 81.29 85.87 92.02 94.67 84.2r 73.66 79.05 76.69 78.85 93.84 74.rr 83.30 96.07 93.07
95.43 85.22 85.37 a7 2F,
82.18 86.09 73.18 73.00 a , < n
80.74 83.87 8r.25 93.73 94,00 89,47 73.34 85.14 73,44 84.62 98.54 78.25 83.90 97.75 95.05
85.80 84.39 96.14 82.18 85.8 r o . r o 74.5 78.04 83.33 83.23 85.33 93.45 93.33 84.21 73.06 82.43 77.60 db.bd
95.82 74.94 84.04 97.75 97.03
.4395
5.6.7 Empirical Comparison among Ensemble Methods
Table 5.5 shows the empirical results obtained when comparing the perfor-
mance of a decision tree classifier against bagging, boosting, and random for- est. The base classifiers used in each ensemble method consist of fifty decision trees. The classification accuracies reported in this table are obtained from ten-fold cross-validation. Notice that the ensemble cla,ssifi.ers generally out- perform a single decision tree classifier on many of the data sets.
5.7 Class Imbalance Problem
Data sets with imbalanced class distributions are quite common in many real applications. For example, an automated inspection system that monitors products that come off a manufacturing assembly line may find that the num-
5.7 Class Imbalance Problem 295
ber of defective products is significantly fewer than that of non-defective prod- ucts. Similarly, in credit card fraud detection, fraudulent transactions are outnumbered by legitimate transactions. In both of these examples, there is a disproportionate number of instances that belong to different classes. The degree of imbalance varies from one application to another-a manufacturing plant operating under the six sigma principle may discover four defects in a million products shipped to their customers, while the amount of credit card fraud may be of the order of 1 in 100. Despite their infrequent occurrences, a correct classification of the rare class in these applications ofben has greater value than a correct classification of the majority class. However, because the class distribution is imbalanced, this presents a number of problems to existing classification algorithms.
The accuracy measure, which is used extensively to compare the perfor- mance of classifiers, may not be well suited for evaluating models derived from imbalanced data sets. For example, if lTo of the credit card transactions are fraudulent, then a model that predicts every transaction as legitimate has an accuracy of 99% even though it fails to detect any of the fraudulent activities. Additionally, measures that are used to guide the learning algorithm (e.g., in- formation gain for decision tree induction) may need to be modified to focus on the rare class.
Detecting instances of the rare class is akin to finding a needle in a haystack. Because their instances occur infrequently, models that describe the rare class tend to be highly specialized. For example, in a rule-based classifier, the rules extracted for the rare class typically involve a large number of attributes and cannot be easily simplified into more general rules with broader coverage (unlike the rules for the majority class). Such models are also susceptible to the presence of noise in training data. As a result, many of the existing classification algorithms may not effectively detect instances of the rare class.
This section presents some of the methods developed for handling the class imbalance problem. F irst, alternative metrics besides accuracy are introduced, along with a graphical method called RoC analysis. we then describe how cost-sensitive Iearning and sampling-based methods may be used to improve the detection of rare classes.
5.7.t Alternative Metrics
Since the accuracy measure treats every class as equally important, it rnay not be suitable for analyzing imbalanced data sets, where the rare class is considered more interesting than the majority class. For binary classification, the rare class is often denoted as the positive class, while the majority class is