Example 5.6. Consider the polynomial kernel function given in Equation 5.63. Let 9(z) be a function that has a finite L2 rrorm, i.e., -[ g(x)2dx < a.
f
J 8’Y + l)ee(x) s(Y)dxdY
” P / \
: / t ( l ) t* .y) ig(*)g(y)d.xd,y J 7_o \x / ‘
: f (1) I ,-8, (‘,J, ) [r””lo'(rzaz)o”(‘zas)o”
276 Chapter 5
D
: \ – \ – L L i :O a t , az , . . .
Classification: Alternative Techniques
Because the result of the integration is non-negative, the polynomial kernel function therefore satisfies Mercer’s theorem. r
5.5.5 Characteristics of SVM
SVM has many desirable qualities that make it one of the most widely used classification algorithms. Following is a summary of the general characteristics of SVM:
1. The SVM learning problem can be formulated as a convex optimization problem, in which efficient algorithms are available to find the global minimum of the objective function. Other classification methods, such as rule-based classifiers and artificial neural networks, employ a greedy- based strategy to search the hypothesis space. Such methods tend to find only locally optimum solutions.
2. SVM performs capacity control by maximizing the margin of the decision boundary. Nevertheless, the user must still provide other parameters such as the type of kernel function to use and the cost function C for introducing each slack variable.
3. SVM can be applied to categorical data by introducing dummy variables for each categorical attribute value present in the data. For example, if Marital Status has three values {Single, Married, Divorced}, we can introduce a binary variable for each of the attribute values.
4. The SVM formulation presented in this chapter is for binary class prob- lems. Some of the methods available to extend SVM to multiclass orob- lems are presented in Section 5.8.
5.6 Ensemble Methods
The classification techniques we have seen so far in this chapter, with the ex- ception of the nearest-neighbor method, predict the class labels of unknown examples using a single classifier induced from training data. This section presents techniques for improving classification accuracy by aggregating the predictions of multiple classifiers. These techniques are known as the ensem- ble or classifier combination methods. An ensemble method constructs a
( ,)(- , : , ) [ l ‘?”7’ s(rt ‘ r2 ‘ )d’r1d'”
] ‘
5.6 Ensemble Methods 277
set of base classifiers from training data and performs classification by taking a vote on the predictions made by each base classifier. This section explains why ensemble methods tend to perform better than any single classifier and presents techniques for constructing the classifier ensemble.
5.6.1 Rationale for Ensemble Method
The following example illustrates how an ensemble method can improve a classifier’s performance.
Example 5.7. Consider an ensemble of twenty-five binary classifiers, each of which has an error rate of e : 0.35. The ensemble classifier predicts the class Iabel of a test example by taking a majority vote on the predictions made by the base classifiers. If the base classifiers are identical, then the ensemble will misclassify the same examples predicted incorrectly by the base classifiers. Thus, the error rate of the ensemble remains 0.35. On the other hand, if the base classifiers are independent-i.e., their errors are uncorrelated-then the ensemble makes a wrong prediction only if more than half of the base classifiers predict incorrectly. In this case, the error rate of the ensemble classifier is
O a 1 –
vensemDle – – e )25- t : 0 ’06 , (5.66)
which is considerably lower than the error rate of the base classifiers. r
Figure 5.30 shows the error rate of an ensemble of twenty-five binary clas- sifiers (e”.,””-bre) for different base classifier error rates (e). The diagonal line represents the case in which the base classifiers are identical, while the solid line represents the case in which the base classifiers are independent. Observe that the ensemble classifier performs worse than the base classifiers when e is larger than 0.5.
The preceding example illustrates two necessary conditions for an ensem- ble classifier to perform better than a single classifier: (1) the base classifiers should be independent ofeach other, and (2) the base classifiers should do bet- ter than a classifier that performs random guessing. In practice, it is difficult to ensure total independence among the base classifiers. Nevertheless, improve- ments in classification accuracies have been observed in ensemble methods in which the base classifiers are slightly correlated.
278 Chapter 5 Classification: Alternative Techniques
Base classifier error
Figure 5.30. Comparison between enors of base classifiers and errors of the ensemble classifier.
Step 1: Create Multiple
Data Sets
Step 2: Build Multiple
Classifiers
Step 3: Combine
Classifiers
Figure 5.31. A logicalview of the ensemble learning method.
5.6.2 Methods for Constructing an Ensemble Classifier
A logical view of the ensemble method is presented in Figure 5.31. The basic idea is to construct multiple classifiers from the original data and then aggre- gate their predictions when classifying unknown examples. The ensemble of classifiers can be constructed in many ways:
1
0.9
6 o o
a o o o
E o @ c
uJ
0.7
u.o
o.4
U J
0.2
0 .1
1 .
5.6 Ensemble Methods 279
By manipulating the training set. In this approach, multiple train-
ing sets are created by resampling the original data according to some
sampling distribution. The sampling distribution determines how lili<ely
it is that an example will be selected for training, and it may vary from
one trial to another. A classifier is then built from each training set using
a particular learning algorithm. Bagging and boosting are two exam-
ples of ensemble methods that manipulate their training sets. These
methods are described in further detail in Sections 5.6.4 and 5.6.5.
By manipulating the input features. In this approach, a subset
of input features is chosen to form each training set. The subset can
be either chosen randomly or based on the recommendation of domain
experts. Some studies have shown that this approach works very well
with data sets that contain highly redundant features. Random forest,
which is described in Section 5.6.6, is an ensemble method that manip-
ulates its input features and uses decision trees as its base classifiers.
By manipulating the class labels. This method can be used when the
number of classes is sufficiently large. The training data is transformed
into a binary class problem by randomly partitioning the class la,bels
into two disjoint subsets, Ag and A1. TYaining examples whose class
label belongs to the subset As are assigned to class 0, while those that
belong to the subset Al are assigned to class 1. The relabeled examples
are then used to train a base classifier. By repeating the class-relabeling
and model-building steps multiple times, an ensemble of base classifiers
is obtained. When a test example is presented, each base classifiet Ci is
used to predict its class label. If the test example is predicted as class
0, then all the classes that belong to As will receive a vote. Conversely’
if it is predicted to be class 1, then all the classes that belong to A1
will receive a vote. The votes are tallied and the class that receives the
highest vote is assigned to the test example. An example of this approach
is the error-correcting output coding method described on page 307.
4.By manipulating the learning algorithm. Many learning algo-
rithms can be manipulated in such a way that applying the algoriithm
several times on the same training data may result in different models.
For example, an artificial neural network can produce different rnod-
els by changing its network topology or the initial weights of the links
between neurons. Similarly, an ensemble of decision trees can be con-
structed by injecting randomness into the tree-growing procedure. For
2 .
3.
280 Chapter 5 Classification: Alternative Techniques
example, instead of choosing the best splitting attribute at each node, we can randomly choose one of the top k attributes for splitting.
The first three approaches are generic methods that are applicable to any classifiers, whereas the fourth approach depends on the type of classifier used. The base classifiers for most of these approaches can be generated sequentially (one after another) or in parallel (all at once). Algorithm 5.5 shows the steps needed to build an ensemble classifier in a sequential manner. The first step is to create a training set from the original data D. Depending on the type of ensemble method used, the training sets are either identical to or slight modifications of D. The size of the training set is often kept the same as the original data, but the distribution of examples may not be identicall i.e., some examples may appear multiple times in the training set, while others may not appear even once. A base classifier Ci is then constructed from each training set Da. Ensemble methods work better with unstable classifiers. i.e.. base classifiers that are sensitive to minor perturbations in the training set. Ex- amples of unstable classifiers include decision trees, rule-based classifiers, and artificial neural networks. As will be discussed in Section 5.6.3, the variability among training examples is one of the primary sources of errors in a classifier. By aggregating the base classifiers built from different training sets, this may help to reduce such types of errors.
Finally, a test example x is classified by combining the predictions made by the base classifiers C;(x):
C. (*) : V ote(Ct(x), C2(x), . . . , Cn(*)) .
The class can be obtained by taking a majority vote on the individual predic- tions or by weighting each prediction with the accuracv of the base classifier.
Algorithm 5.5 General procedure for ensemble method. l: Let D denote the original training data, k denote the number of base classifiers,
and 7 be the test data. 2 : f o r i : l t o k d o 3: Create training set, D4 ftom D. 4: Build a base classifier Ci from Di. 5: end for 6: for each test record r € T d,o 7 : C* ( r ) : Vo te (C1(x ) ,Cz ( * ) , . . . , C r ( * ) ) 8: end for
5 .6 Ensemble Methods 28’1.
5.6.3 Bias-Variance Decomposition
Bias-variance decomposition is a formal method for analyzing the prediction
error of a predictive model. The following example gives an intuitive explana-
tion for this method. Figure 5.32 shows the trajectories of a projectile launched at a particular
angle. Suppose the projectile hits the floor surface at some location r, aL a
distance d away from the target position t. Depending on the force applied
to the projectile, the observed distance may vary from one trial to another.
The observed distance can be decomposed into several components. The first
component, which is known as bias, measures the average distance between
the target position and the Iocation where the projectile hits the floor. The
amount of bias depends on the angle of the projectile launcher. The second
component, which is known as variance, measutes the deviation between r
and the average position 7 where the projectile hits the floor. The variance
can be explained as a result of changes in the amount of force applied to the
projectile. Finally, if the target is not stationary, then the observed distance
is also affected by changes in the location of the target. This is considered the
noise component associated with variability in the target position. Putting
these components together, the average distance can be expressed as:
dy,e(a,t) : BiasB* VarianceTf Noiset, (5,67)
where / refers to the amount of force applied and 0 is the angle of the launcher.
The task of predicting the class label of a given example can be analyzed
using the same apploach. For a given classifier, some predictions may turn out
to be correct, while others may be completely off the mark. We can decompose
the expected error of a classifier as a sum of the three terms given in Equation
5.67, where expected error is the probability that the classifier misclassifies a
{—-} <–> ‘Variance’ ‘Noise’
\_ ‘Bias’
Figure 5.32. Bias-variance decomposition.
282 Chapter 5 Classification: Alternative Techniques
given example. The remainder of this section examines the meaning of bias, variance, and noise in the context of classification.
A classifier is usually trained to minimize its training error. However, to be useful, the classifier must be abie to make an informed guess about the class labels of examples it has never seen before. This requires the classifier to generalize its decision boundary to regions where there are no training exam- ples available–a decision that depends on the design choice of the classifier. For example, a key design issue in decision tree induction is the amount of pruning needed to obtain a tree with low expected error. Figure 5.33 shows two decision trees, ft and 72, that are generated from the same training data, but have different complexities. 7z is obtained by pruning ?r until a tree with maximum depth of two is obtained . Ty, on the other hand, performs very little pruning on its decision tree. These design choices will introduce a bias into the classifier that is analogous to the bias of the projectile launcher described in the previous example. In general, the stronger the assumptions made by a classifier about the nature of its decision boundary, the larger the classi- fier’s bias will be. 72 therefore has a larger bias because it makes stronger assumptions about its decision boundary (which is reflected by the size of the tree) compared to ?r. other design choices that may introduce a bias into a classifier include the network topology of an artificial neural network and the number of neighbors considered by a nearest-neighbor classifier.
The expected error of a classifier is also affected by variability in the train- ing data because different compositions of the training set may lead to differ- ent decision boundaries. This is analogous to the variance in r when different amounts of force are applied to the projectile. The last component of the ex- pected error is associated with the intrinsic noise in the target class. The target class for some domains can be non-deterministic; i.e., instances with the same attribute values can have different class labels. Such errors are unavoidable even when the true decision boundary is known.
The amount of bias and variance contributing to the expected error depend on the type of classifier used. Figure 5.34 compares the decision boundaries produced by a decision tree and a 1-nearest neighbor classifier. For each classifier, we plot the decision boundary obtained by “averaging” the models induced from 100 training sets, each containing 100 examples. The true deci- sion boundary from which the data is generated is also plotted using a dashed line. The difference between the true decision boundary and the “averaged” decision boundary reflects the bias of the classifier. After averaging the mod- els, observe that the difference between the true decision boundary and the decision boundary produced by the l-nearest neighbor classifier is smaller than
x1 <-1 .24 x 1 < 1 1 . 0 0
5.6 Ensemble Methods 283