The generalization error can also be estimated as a statistical correction to the training error. Since generalization error tends to be larger than training error, the statistical correction is usually computed as an upper bound to the training error, taking into account the number of training records that reach a particular leaf node. For instance, in the C4.5 decision tree algorithm, the number of errors committed by each Ieaf node is assumed to follow a binomial distribution. To compute its generalization error, we must determine the upper bound limit to the observed training error, as illustrated in the next example.

Example 4.3. Consider the left-most branch of the binary decision trees shown in Figure 4.27. Observe that the left-most leaf node of Tp has been expanded into two child nodes in 72. Before splitting, the error rate of the node is 217 :0.286. By approximating a binomial distribution with a normal distribution, the following upper bound of the error rate e can be derived:

euw. r (N re rc r ) :

_2 T: .2 ,+7#a”,/2\,1t1+=/ +ff i

, (4 .10), 1 -l-

-d/2

r r l y ‘

where c is the confidence \evel, zo12 is the standardized value from a standard normal distribution, and -Ay’ is the total number of training records used to compute e. By replacing a : 25To, N : 7, and e : 217, the upper bound for the error rate is eupp.r(7,217,0.25):0.503, which corresponds to 7 x 0.503: 3.521 errors. If we expand the node into its child nodes as shown in ft,, the training error rates for the child nodes are Lf 4: 0.250 and 1/3 : 0.333,

L84 Chapter 4 Classification

respectively. Using Equation 4.10, the upper bounds of these error rates are eupp.,(4,714,0.25): 0.537 and e,,*” ,(3,1f 3,0.25) : 0.650, respect ively. The overall training error of the child nodes is 4 x 0.537+3 x 0.650:4.098, which is larger than the estimated error for the corresponding node in 76. r

Using a Validation Set

In this approach, instead of using the training set to estimate the generalization error, the original training data is divided into two smaller subsets. One of the subsets is used for training, while the other, known as the validation set, is used for estimating the generalization error. Typically, two-thirds of the training set is reserved for model building, while the remaining one-third is used for error estimation.

This approach is typically used with classification techniques that can be parameterized to obtain models with different levels of complexity. The com- plexity of the best model can be estimated by adjusting the parameter of the learning algorithm (e.g., the pruning level of a decision tree) until the empir- ical model produced by the learning algorithm attains the lowest error rate on the validation set. Although this approach provides a better way for esti- mating how well the model performs on previously unseen records, less data is available for training.

4.4.5 Handling Overfitting in Decision Tree Induction

In the previous section, we described several methods for estimating the gen- eralization error of a classification model. Having a reliable estimate of gener- alization error allows the learning algorithm to search for an accurate model without overfitting the training data. This section presents two strategies for avoiding model overfitting in the context of decision tree induction.

Prepruning (Early Stopping Rule) In this approach, the tree-growing algorithm is halted before generating a fully grown tree that perfectly fits the entire training data. To do this, a more restrictive stopping condition must be used; e.g., stop expanding a leaf node when the observed gain in impurity measure (or improvement in the estimated generalization error) falls below a certain threshold. The advantage of this approach is that it avoids generating overly complex subtrees that overfit the training data. Nevertheless, it is difficult to choose the right threshold for early termination. Too high of a threshold will result in underfitted models, while a threshold that is set too low may not be sufficient to overcome the model overfitting problem. Furthermore,

Model Overfitting L85

Figure 4,29. Post-pruning of the decision tree for Web robot detection,

even if no significant gain is obtained using one of the existing attribute test conditions, subsequent splitting may result in better subtrees.

Post-pruning In this approach, the decision tree is initially grown to its

maximum size. This is followed by a tree-pruning step, which proceeds to trim the fully grown tree in a bottom-up fashion. Ttimming can be done by replacing a subtree with (1) a new leaf node whose class label is determined from the majority class of records affiliated with the subtree, or (2) the most frequently used branch of the subtree. The tree-pruning step terminates when no further improvement is observed. Post-pruning tends to give better results than prepruning because it makes pruning decisions based on a fully grown

tree, unlike prepruning, which can suffer from premature termination of the tree-growing process. However, for post-pruning, the additional computations needed to grow the full tree may be wasted when the subtree is pruned.

Figure 4.29 illustrates the simplified decision tree model for the Web robot

detection example given in Section 4.3.6. Notice that the subtrees rooted at

4 .4

lmagePages> 0.375: class 0 lmagePages<= 0.375:

totalPages> 6: I breadth <= 1: class 1 I breadth > 1: class 0

1\Lul!ilP-=-o—- t ilmagePages:f o.i$3: Gtas-sT’i I i lmagePages> 0.1 333: I i breadth <= 6: class 0 i I Lbjqaltl

“]j-

jbgs_l_ _ _ _ _ _ _ _l Mul t i lP = 1: I TotalTime <= 361: class 0 I TotalTime > 361: class 1

NLul!i4s_e!t_=_ot _ I ideoth”iz-:

– c-la-ss-il

– – – – – – – – – – – – | ^ l

| | deplh <= 2: I i I MultilP = 1: class 0 l r l M u l t i l P = 0 : I I | | breadth <= 6: class 0 l r l I b read th>6 : lll | | RepeatedAccess<=0.322: classo i liLIlEgp_eet9dAc9es1>_q jq2_2i_c!_a9s_1__l MultiAgent = 1: I tota lPages<=81: c lass0 I totalPages > 81: class I

Simplified Decision Tree:

depth = 1: t iimageP-a-ges-<—o leds:

-c-tass T–l

I ilmagePages > 0.1333: I I breadth <= 6: class 0

I i I breadth > 6: class I

| | tota lPages<=81: c lass0 | | tota lPages>81: c lassl

deDth > 1:

uMuiriAsaT = oa EE;to ——‘l

T tr,tuniaGn-t-=l:

186 Chapter 4 Classification

depth : t have been replaced by one of the branches involving the attribute InagePages. This approach is also known as subtree raising. The depth ) 1 and MultiAgent : 0 subtree has been replaced by a leaf node assigned to class 0. This approach is known as subtree replacement. The subtree for depth ) 1 and MultiAgent : 1 remains intact.

4.5 Evaluating the Performance of a Classifier

Section 4.4.4 described several methods for estimating the generalization error of a model during training. The estimated error helps the learning algorithm to do model selection; i.e., to find a model of the right complexity that is not susceptible to overfitting. Once the model has been constructed, it can be applied to the test set to predict the class labels of previously unseen records.

It is often useful to measure the performance of the model on the test set because such a measure provides an unbiased estimate of its generalization error. The accuracy or error rate computed from the test set can also be used to compare the relative performance of different classifiers on the same domain. However, in order to do this, the class Iabels of the test records must be known. This section reviews some of the methods commonly used to evaluate the performance of a classifier.

4.5.L Holdout Method

In the holdout method, the original data with labeled examples is partitioned into two disjoint sets, called the training and the test sets, respectively. A classification model is then induced from the training set and its performance is evaluated on the test set. The proportion of data reserved for training and for testing is typically at the discretion of the analysts (e.g., 50-50 or two- thirds for training and one-third for testing). The accuracy of the classifier can be estimated based on the accuracy of the induced model on the test set.

The holdout method has several well-known limitations. First, fewer la- beled examples are available for training because some of the records are with- held for testing. As a result, the induced model may not be as good as when all the labeled examples are used for training. Second, the model may be highly dependent on the composition of the training and test sets. The smaller the training set size, the larger the variance of the model. On the other hand, if the training set is too large, then the estimated accuracy computed from the smaller test set is Iess reliable. Such an estimate is said to have a wide con- fidence interval. Finally, the training and test sets are no longer independent

4.5 Evaluating the Performance of a Classifier 187

of each other. Because the training and test sets are subsets of the original data, a class that is overrepresented in one subset will be underrepresented in the other, and vice versa.

4.5.2 Random Subsampling

The holdout method can be repeated several times to improve the estimation of a classifier’s performance. This approach is known as random subsampling. Let acc.i be the model accuracy during the i,th iteration. The overall accuracy is given by acq,,6 : Ditaccif k. Random subsampling still encounters some of the problems associated with the holdout method because it does not utilize

as much data as possible for training. It also has no control over the number of times each record is used for testing and training. Consequently, some records might be used for training more often than others.

4.5.3 Cross-Validation

An alternative to random subsampling is cross-validation. In this approach,

each record is used the same number of times for training and exactly once for testing. To illustrate this method, suppose we partition the data into two equal-sized subsets. First, we choose one of the subsets for training and the other for testing. We then swap the roles of the subsets so that the previous

training set becomes the test set and vice versa. This approach is called a two- fold cross-validation. The total error is obtained by summing up the errors for both runs. In this example, each record is used exactly once for training and once for testing. The k-fold cross-validation method generalizes this approach

by segmenting the data into k equal-sized partitions. During each run, one of the partitions is chosen for testing, while the rest of them are used for training.

This procedure is repeated k times so that each partition is used for testing exactly once. Again, the total error is found by summing up the errors for all k runs. A special case of the k-fold cross-validation method sets k : N,

the size of the data set. In this so-called leave-one-out approach, each test set contains only one record. This approach has the advantage of utilizing

as much data as possible for training. In addition, the test sets are mutually exclusive and they effectively cover the entire data set. The drawback of this approach is that it is computationally expensive to repeat the procedure ly’

times. Furthermore, since each test set contains only one record, the variance

of the estimated performance metric tends to be high.

188 Chapter 4 Classification

4.5.4 Bootstrap