Figure 5.33. Two decision trees with different complexities induced from the same training data.

the observed difference for a decision tree classifier. This result suggests that

the bias of a l-nearest neighbor classifier is lower than the bias of a decision

tree classifier. On the other hand, the l-nearest neighbor classifier is more sensitive to

the composition of its training examples. If we examine the models induced

from different training sets, there is more variability in the decision boundary

of a l-nearest neighbor classifier than a decision tree classifier. Therefore, the

decision boundary of a decision tree classifier has a lower variance than the

l-nearest neighbor classifier.

5.6.4 Bagging

Bagging, which is also known as bootstrap aggregating, is a technique that

repeatedly samples (with replacement) from a data set according to a uniform probability distribution. Each bootstrap sample has the same size as the origi-

nal data. Because the sampling is done with replacement, some instances may

appear several times in the same training set, while others may be omitted

from the training set. On average, a bootstrap sample D; contains appnoxi-

d o o o o o ooo

o oooo 3

o

o _ o f , ; o t f f i ‘ +oo l *

* + + +

+ + + *

+ * * * * * *

o

o o o

– * + ‘ +

+ +

*4

+

x1 <-1 .24

o o o id o o o o o

ooo o oooo 3

++ + +

+

+ ^ o

o * * o o 8- o +

+* l * + * + t + + *

t a J * *

*a

+

o o

o o o

284 Chapter 5 Classification: Alternative Technioues

-30 L -30 + 0 L

(a) Decision boundary for decision tree. (b) Decision boundary for l-nearest neighbor.

-10

Figure 5.34. Bias of decision tree and 1-nearest neighbor classifiers.

+ .. , + +

: . . { o * ** *** * ** #..* #”3…

ii* .. *H o o

+ j i , / ” ” : ” o – – – l )

,-tjf”.= t “.. ?” ooo^

or i fJe”” o” .pdg . ,

o o ^ o n ” O o r c )

o, ‘o

r + +

\*

o

T

‘ f + + , ‘

J+ r.,.

T +

‘ ‘ I

+ T ‘

+

, r + + +

* J a o H

– ^ o o o o

O ‘ 9 O O

– O O

o o o . p d I a _ ‘

o . ‘ o ^ o ^t o o ( g

Algorithm 5.6 Bagging algorithm. 1: Let k be the number of bootstrap samples. 2 : f o r i , : I t o k d o 3: Create a bootstrap sample of size N, Dt. 4: Tlain a base classifier Ca on the bootstrap sample Di. 5: end for 6: C. (r) : argmax 1,6(Ct1r1 : y) .

c {d(‘) : 1 if its argument is true and 0 otherwise}.

mately 63% of the original training data because each sample has a probability 1 – (1 – 1/l/)r’ of being selected in each D,i.. If l/ is sufficiently large, this probability converges to 1- Ll”- 0.632. The basic procedure for bagging is summarized in Algorithm 5.6. After training the k classifiers, a test instance is assigned to the class that receives the highest number of votes.

To illustrate how bagging works, consider the data set shown in Table 5.4. Let r denote a one-dimensional attribute and y denote the class label. Suppose we apply a classifier that induces only one-level binary decision trees, with a test condition r ( k, where k is a split point chosen to minimize the entropy of the leaf nodes. Such a tree is also known as a decision stump.

Without bagging, the best decision stump we can produce splits the records at either z < 0.35 or r I 0.75. Either way, the accuracy of the tree is at

r 0 .1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 I

a I 1 1 – 1 – 1 – 1 – 1 1 1 1

5.6 Ensemble Methods 285

Table 5.4. Example of data set used to construct an ensemble of bagging classifiers.

most 70%. Suppose we apply the bagging procedure on the data set using

ten bootstrap samples. The examples chosen for training in each bagging

round are shown in Figure 5.35. On the right-hand side of each table, we also

illustrate the decision boundary produced by the classifier. We classify the entire data set given in Table 5.4 by taking a majority

vote among the predictions made by each base classifier. The results of the predictions are shown in Figure 5.36. Since the class labels are either -1 or

*1, taking the majority vote is equivalent to summing up the predicted values

of gr and examining the sign of the resulting sum (refer to the second to last

row in Figure 5.36). Notice that the ensemble classifier perfectly classifies all

ten examples in the original data. The preceding example illustrates another advantage of using ensemble

methods in terms of enhancing the representation of the target function. Ilven

though each base classifier is a decision stump, combining the classifiers can

Iead to a decision tree of depth 2. Bagging improves generalization error by reducing the variance of the base

classifiers. The performance of bagging depends on the stability of the base

classifier. If a base classifier is unstable, bagging helps to reduce the errors

associated with random fluctuations in the training data. If a base classifier

is stable, i.e., robust to minor perturbations in the training set, then the

error of the ensemble is primarily caused by bias in the base classifier. In

this situation, bagging may not be able to improve the performance of the

base classifiers significantly. It may even degrade the classifier’s performance

because the effective size of each training set is about 37% smaller than the

original data. Finally, since every sample has an equal probability of being selected, bag-

ging does not focus on any particular instance of the training data. It is

therefore less susceptible to model overfitting when applied to noisy data.

5.6.5 Boosting

Boosting is an iterative procedure used to adaptively change the distribution

of training examples so that the base classifiers will focus on examples that

are hard to classify. Unlike bagging, boosting assigns a weight to each training

Round 1: x 1 0 . 1 o.2 o.2 0.3 0.4 o.4 0.5 0.6 0.9 0.9

1v 1 1 1 -1 -1 – l 1 1 1

286 Chapter 5 Classification: Alternative Techniques

x < = 0 . 3 5 – = > y = 1 x > 0 . 3 5 = = > y = – 1

x < = 0 . 6 5 = = 1 y = | x > 0 . 6 5 – = > y = 1

x < = 0 . 3 5 = = y y = | x > 0 . 3 5 = = y y = – l

x < = 0 . 3 = = 1 y = l x v Q . t = = 1 y = – 1

x < = 0 . 3 5 = = > y = l x > 0 . 3 5 = = > y = – 1

x <= 0.75 ==> y – -1

x > 0 . 7 5 = = > y = 1

x <= 0.75 ==> y = -1

x > 0 . 7 5 = = > y = 1

x <= 0.75 ==> y = -1

x > 0 . 7 5 = = 2 y = l

x <= O.75 ==> y – -1

x > 0 . 7 5 = = > y = – t

x <= 0.05 ==> y = -1

x > 0 . 0 5 = = > y = 1

Figure 5.35. Example of bagging.

example and may adaptively change the weight at the end of each boosting round. The weights assigned to the training examples can be used in the following ways:

1. They can be used as a sampling distribution to draw a set of bootstrap samples from the original data.

2. They can be used by the base classifier to learn a model that is biased toward higher-weight examples.

Round 2: x 1 0 . 1 o.2 0.3 0.4 0.5 0.8 0.9 1 1 1

1v 1 1 -1 -1 1 1 1 1 1

Round 3: x I 0 . 1 o.2 0.3 o.4 o.4 0.5 0.7 o.7 0.8 0.9 v 1 1 1 -1 -1 -1 1 1

Round 4: x 1 0 . 1 0.1 o.2 0.4 o.4 0.5 0.5 0.7 0.8 0.9

1v 1 1 -1 -1 1 -1 1 1 1

Bagging Round 5:

x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 v 1 1 1 1 -1 I 1 1 1

Round 6: x o.2 0.4 0.5 0.6 0.7 o.7 o.7 0.8 0.9 1 v 1 1 1 1 1 1 1 1 1

Round 7: x 0.1 o.4 o.4 0.6 0.7 0.8 0.9 0.9 0.9 1 v 1 -1 1 1 1 1 1 1 1 1

Round 8: x 0.1 o.2 0.5 0.5 0.5 0.7 o.7 0.8 0.9 1 v 1 1 1 -1 -1 -t 1 1 1 1

Round x 0.1 0.3 0.4 o.4 0.6 o.7 o.7 0.8 1 1 v 1 1 1 1 1 1 1

Round 10: x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 v 1 1 1 1 1 1 I 1 1

Ensemble Methods 287

Round x=0,1x=0.2x=0.3x=0.4 x=0,5 x=0,6x=0.7x=0.8 x=0.9x=1.0 1 1 1 1 1 1 1 1 1 -1 1

2 1 1 1 1 1 1 1 1 1

3 1 1 1 -1 1 1 1 1 I

4 1 1 1 1 1 1 1 -1 1 -1

5 1 1 1 1 1 -1 1 1 1

6 -1 -1 -1 1 1 1 -1 1 1 ,l

‘l -1 -1 1 1 1 1 1 1

I -1 -1 -1 1 1 1 -1 1 1 1

I 1 1 1 1 1 1 -1 1 1 1

1 0 1 1 1 ‘l t 1 1 1 1 1

Sum 2 2 2 -o -6 -6 -6 2 2 2

Sign 1 1 1 1 1 1 1 1 1

True Class 1 1 1 1 1 1 1 1 1 1

Figure 5.36. Example of combining classifiers constructed using the bagging approach.

This section describes an algorithm that uses weights of examples to de-

termine the sampling distribution of its training set. InitiallS the examples

are assigned equal weights, 1/N, so that they are equally likely to be chosen

for training. A sample is drawn according to the sampling distribution of the

training examples to obtain a new training set. Next, a classifier is induced

from the training set and used to classify all the examples in the original data.

The weights of the training examples are updated at the end of each boost-

ing round. Examples that are classified incorrectly will have their weights

increased, while those that are classified correctly will have their weights de-

creased. This forces the classifier to focus on examples that are difficult to

classify in subsequent iterations. The following table shows the examples chosen during each boosting round.

Boosting (Round 1): 7 3 2 8 7 I 4 10 6 3 Boosting (Round 2): 5 4 I 4 2 5 1 F7

I 4 2 Boosting (Round 3): 4 4 8 10 4 5 4 6 3 4

Initially, all the examples are assigned the same weights. However, some ex-

amples may be chosen more than once, e.g., examples 3 and 7, because the

sampling is done with replacement. A classifier built from the data is then

used to classify all the examples. Suppose example 4 is difficult to classify.

The weight for this example will be increased in future iterations as it gets

misclassified repeatedly. Meanwhile, examples that were not chosen in the pre-

5 .6

288 Chapter 5 Classification: Alternative Techniques

vious round, e.g., examples 1 and 5, also have a better chance of being selected in the next round since their predictions in the previous round were Iikely to be wrong. As the boosting rounds proceed, examples that are the hardest to classify tend to become even more prevalent. The final ensemble is obtained by aggregating the base classifiers obtained from each boosting round.

Over the years, several implementations of the boosting algorithm have been developed. These algorithms differ in terms of (1) how the weights of the training examples are updated at the end of each boosting round, and (2) how the predictions made by each classifier are combined. An implementation called AdaBoost is explored in the next section.

AdaBoost

Let {(xy, y i) | j : 7,2,. . . , l t r } denote a set of N training examples. In the AdaBoost algorithm, the importance of a base classifier Ci depends on its error rate, which is defined as

(5.68)

where I (p) : 1 if the predicate p is true, and 0 otherwise. The importance of a classifier Ci is given by the following parameter,

1 , / 1 – e r \ a i : r t ” \

* /

Note that a; has a large positive value if the error rate is close to 0 and a large negative value if the error rate is close to 1, as shown in Figure 5.37.

The a6 parameter is also used to update the weight of the training ex- amples. To illustrate,let w[i) denote the weight assigned to example (*t,A) during the jth boosting round. The weight update mechanism for AdaBoost is given by the equation:

# tf ‘1 r(co61t +r,)],

lexp-“.r lexpo:

.U+r1 : ‘ [ i ) *’ z.i

If Ci(x) : y,

i f C i@) I y i ‘ (5.6e)

where Zi isthe normalization factor used to ensure that !r.\l+t1 :1. The weight update formula given in Equation 5.69 increases the weights of incor- rectly classified examples and decreases the weights of those classified correctly.

5.6 Ensemble Methods 289

e l

E – 1

-4

o.4 E

Figure 5.37, Plot of o as a function of training error e .

Instead of using a majority voting scheme, the prediction made by each classifier C7 is weighted according to oi. This approach allows AdaBoost to penalize models that have poor accuracy, e.g., those generated at the earlier boosting rounds. In addition, if any intermediate rounds produce an error rate higher than 50%, the weights are reverted back to their original uniform values, ut: IlN, and the resampling procedure is repeated. The AdaBoost algorithm is summarized in Algorithm 5.7.

Let us examine how the boosting approach works on the data set shown in Table 5.4. Initially, all the examples have identical weights. Afber three boosting rounds, the examples chosen for training are shown in Figure 5.38(a). The weights for each example are updated at the end of each boosting round using Equation 5.69.

Without boosting, the accuracy of the decision stump is, at best, 70%. With AdaBoost, the results of the predictions are given in Figure 5.39(b). The final prediction of the ensemble classifier is obtained by taking a weighted average of the predictions made by each base classifi.er, which is shown in the last row of Figure 5.39(b). Notice that AdaBoost perfectly classifies all the examples in the training data.

An important analytical result of boosting shows that the training error of the ensemble is bounded bv the following expression: