How does the multiple comparison procedure relate to model overfitting? Many learning algorithms explore a set of independent alternatives, {7a}, and then choose an alternative, ‘y*ur., that maximizes a given criterion function. The algorithm will add 7*ur” to the current model in order to improve its overall performance. This procedure is repeated until no further improvement is observed. As an example, during decision tree growing, multiple tests are performed to determine which attribute can best split the training data. The attribute that leads to the best split is chosen to extend the tree as long as the observed improvement is statistically significant.

Let ?o be the initial decision tree and Trbe the new tree after inserting an internal node for attribute r. In principle, r can be added to the tree if the observed gain, A(?s, T*), is greater than some predefined threshold a. If there is only one attribute test condition to be evaluated, then we can avoid inserting spurious nodes by choosing a Iarge enough value of a. However, in practice,

more than one test condition is available and the decision tree algorithm must choose the best attribute r,,,u* from a set of candidates, {*r,*2,. ..,rp}, to partition the data. In this situation, the algorithm is actually using a multiple comparison procedure to decide whether a decision tree should be extended. More specifically, it is testing for A(?s, T*^u*) > a instead of A(?0, T”) > o. As the number of alternatives, k, increases, so does our chance of finding A(fo, T*^u*) ) a. Unless the gain function A or threshold a is modified to account for k, the algorithm may inadvertently add spurious nodes to the model, which leads to model overfitting.

This effect becomes more pronounced when the number of training records from which z-ur. is chosen is small, because the variance of A(Tg, Tr^u*) is high when fewer examples are available for training. As a result, the probability of finding A(fo, Tr^u*) ) c increases when there are very few training records. This often happens when the decision tree grows deeper, which in turn reduces the number of records covered by the nodes and increases the likelihood of adding unnecessary nodes into the tree. Failure to compensate for the large number of alternatives or the small number of training records will therefore lead to model overfitting.

4.4.4 Estimation of Generalization Errors

Although the primary reason for overfitting is still a subject of debate, it is generally agreed that the complexity of a model has an impact on model overfitting, as was illustrated in Figure 4.23. The question is, how do we

180 Chapter 4 Classification

determine the right model complexity? The ideal complexity is that of a model that produces the lowest generalization error. The problem is that the learning algorithm has access only to the training set during model building (see Figure 4.3). It has no knowledge of the test set, and thus, does not know how well the tree will perform on records it has never seen before. The best it can do is to estimate the generalization error of the induced tree. This section presents several methods for doing the estimation.

Using Resubstitution Estimate

The resubstitution estimate approach assumes that the training set is a good representation of the overall data. Consequently, the training error, otherwise known as resubstitution error, can be used to provide an optimistic estimate for the generalization error. Under this assumption, a decision tree induction algorithm simply selects the model that produces the lowest training error rate as its final model. However, the training error is usually a poor estimate of generalization error.

Example 4.1. Consider the binary decision trees shown in Figure 4.27. As- sume that both trees are generated from the same training data and both make their classification decisions at each leaf node according to the majority class. Note that the left tree, Ty, is more complex because it expands some of the leaf nodes in the right tree, ?a. The training error rate for the left tree is e(?:7):4124:0.167, while the training error rate for the right tree is

Decision Tree, Ta Decision Tree, Ta

Figure 4.27. Example of two decision trees generated from the same training data.

4.4 Model Overfitting L81

e(Tp) :6f24:0.25. Based on their resubstitution estimate, the left

considered better than the right tree.

Incorporating Model Complexity

As previously noted, the chance for model overfitting increases as the model

becomes more complex. For this reason, we should prefer simpler models, a

strategy that agrees with a well-known principle known as Occam’s razor or

the principle of parsimony:

Definition 4.2. Occam’s Razor: Given two models with the same general-

ization errors, the simpler model is preferred over the more complex model.

Occam’s razor is intuitive because the additional components in a complex model stand a greater chance of being fitted purely by chance. In the words of

Einstein, “Everything should be made as simple as possible, but not simpler.” Next, we present two methods for incorporating model complexity into the

evaluation of classification models.

Pessimistic Error Estimate The first approach explicitly computes gener-

alization error as the sum of training error and a penalty term for model com- plexity. The resulting generalization error can be considered its pessimistic

error estimate. For instance, let n(t) be the number of training records classi-

fied by node t and e(t) be the number of misclassified records. The pessimistic

error estimate of a decision tree 7, “s(T),

can be computed as follows:

tree is I

e.(T\ : If:, [“.(t ‘) + cl(to)] – Y’

Df:rn(tt)

/ m \ 6 + 4 x 0 . 5 es \ t R) : 24

eQ) + o(r) l/,

where k is the number of leaf nodes, e(“) is the overall training error of the

decision tree, N1 is the number of training records, and O(t6) is the penalty

term associated with each node l;.

Example 4.2. Consider the binary decision trees shown in Figure 4.27. If

the penalty term is equal to 0.5, then the pessimistic error estimate for the

left tree is 4 – t7 x0 .5 7 .5: 0 .3125eg(Tr):

24 and the pessimistic error estimate for the right tree is

24

: * , : 0 ‘ 3333 ‘

182 Chapter 4 Classification

Unlabeled

The minimum description length (MDL) principle.

Thus, the left tree has a better pessimistic error rate than the right tree. For binary trees, a penalty term of 0.5 means a node should always be expanded into its two child nodes as long as it improves the classification of at least one training record because expanding a node, which is equivalent to adding 0.5 to the overall error, is less costly than committing one training error.

If fr(r) : 1 for all the nodes t, the pessimistic error estimate for the left tree is es(Tr) : 71124: 0.458, while the pessimistic error estimate for the r ight t ree is en(Tp):10124:0.4L7. The r ight t ree therefore has a better pessimistic error rate than the left tree. Thus, a node should not be expanded into its child nodes unless it reduces the misclassification error for more than one training record.

Minimum Description Length Principle Another way to incorporate model complexity is based on an information-theoretic approach known as the minimum description length or MDL principle. To illustrate this principle, consider the example shown in Figure 4.28. ln this example, both A and B are given a set of records with known attribute values x. In addition, person A knows the exact class label for each record, while person B knows none of this information. B can obtain the classification of each record by requesting that A transmits the class labels sequentially. Such a message would require @(n) bits of information, where n is the total number of records.

Alternatively, A may decide to build a classification model that summarizes the relationship between x and g. The model can be encoded in a compact

B

.)

T A

.)

T

Figure 4.28.

4.4 Model Overfitting 183

form before being transmitted to B. If the model is 100% accurate, then the cost of transmission is equivalent to the cost of encoding the model. Otherwise, A must also transmit information about which record is classified incorrectly by the model. Thus, the overall cost of transmission is

C o st(model, dat a) : C o st (model) + C o st (datalmodel), (4.e)

where the first term on the right-hand side is the cost of encoding the model, while the second term represents the cost of encoding the mislabeled records. According to the MDL principle, we should seek a model that minimizes the overall cost function. An example showing how to compute the total descrip- tion length of a decision tree is given by Exercise 9 on page 202.

Estimating Statistical Bounds