Gini index as its impurity measure is then applied to the training set. To
investigate the effect of overfitting, different levels of pruning are applied to
the initial, fully-grown tree. Figure .23(b) shows the training and test error rates of the decision tree.
Training sel
:r E ..;* “;T 1i.’_ : “;:’.-tsi.i *i..j..rr ;liS $ .gJ’ j:.,. .’^i:#.$”-i.”kii
,{. }tll.* fi : rl iT “f, – :-g:’i; it .. *’-+ * **- * 1 ++ +t*+
* , * * * ; . * * – *3S* *o * l + * * * +*
I * .^ .dr;
qoF oL-Y o r^**l * ** *
* l **** t
*ao i**3o-..* ia:. . – i .^oo. i 1 81 61 2
L74 Chapter 4 Classification
0.35 – Training Error .- – Test Error
100 150 200 250 300 Number of Nodes
Figure 4.23. Training and test error rates.
Notice that the training and test error rates of the model are large when the size of the tree is very small. This situation is known as model underfitting. Underfitting occurs because the model has yet to learn the true structure of the data. As a result, it performs poorly on both the training and the test sets. As the number of nodes in the decision tree increases, the tree will have fewer training and test enors. However, once the tree becomes too large, its test error rate begins to increase even though its training error rate continues to decrease. This phenomenon is known as model overfitting.
To understand the overfitting phenomenon, note that the training error of a model can be reduced by increasing the model complexity. For example, the Ieaf nodes of the tree can be expanded until it perfectly fits the training data. Although the training error for such a complex tree is zero, the test error can be large because the tree may contain nodes that accidently fit some of the noise points in the training data. Such nodes can degrade the performance of the tree because they do not generalize well to the test examples. Figure 4.24 shows the structure of two decision trees with different number of nodes. The tree that contains the smaller number of nodes has a higher training error rate, but a lower test error rate compared to the more complex tree.
Overfitting and underfitting are two pathologies that are related to the model complexity. The remainder of this section examines some of the poten- tial causes of model overfitting.
0.25 o Gr h 0.2 Y
ul
o.4
0.1s
0.1
0.05
4.4 Model Overfitting L75
(a) Decision tree with 11 leaf nodes.
(b) Decision tree with 24 leaf nodes.
Figure 4.24. Decision trees with ditferent model complexities.
4.4.L Overfitting Due to Presence of Noise
Consider the training and test sets shown in Tables 4.3 and 4.4 for the mammal classification problem. Two of the ten training records are mislabeled: bats and whales are classifi.ed as non-mammals instead of mammals.
A decision tree that perfectly fits the training data is shown in Figure a.25(a). Although the training error for the tree is zero, its error rate on
Table 4.3. An example training set for classifying mammals. Class labels with asterisk symbols repre- sent mislabeled records.
Name Body Temperature
Grves Birth
Four- Iegged
Hibernates Olass Label
porcupme
cat bat whale salamander komodo dragon python salmon eagle guppy
warm-blooded warm-blooded warm-blooded warm-blooded cold-blooded cold-blooded cold-blooded cold-blooded
warm-blooded cold-blooded
yes yes yes yes no no no no no yes
yes yes no no yes yes no no no no
yes no yes no yes no yes no no no
yes yes no no no no no no no no
+
2 n
+
L76 Chapter 4 Classification
(a) Model Ml (b) Model M2
Figure 4.25. Decision tree induced from the data set shown in Table 4.3.
the test set is 30%. Both humans and dolphins were misclassified as non- mammals because their attribute values for Body Tenperature, Gives Birth, and Four-legged are identical to the mislabeled records in the training set. Spiny anteaters, on the other hand, represent an exceptional case in which the class label of a test record contradicts the class labels of other similar records in the training set. Errors due to exceptional cases are often unavoidable and establish the minimum error rate achievable bv anv classifier.
Table 4.4. An example test set for classifying mammals. Name tsody
Temperature Gl ves Birth
Four- legged
Hibernates UIaSS
Label human plgeon elephant leopard shark turtle penguin eel dolphin spiny anteater gila monster
warm-blooded warm-blooded warm-blooded coid-blooded cold-blooded cold-blooded cold-blooded
warm-blooded warm-blooded cold-blooded
yes no yes yes no no no yes no no
no no yes no yes no no no yes yes
no no no no no no no no yes yes
yes no yes no no no no yes yes no
4.4 Model Overfitting L77
In contrast, the decision tree M2 shown in Figure 4.25(b) has a lower test error rate (I0%) even though its training error rate is somewhat higher (20%). It is evident that the first decision tree, MI, has overfitted the training data because there is a simpler model with lower error rate on the test set. The Four-legged attribute test condition in model MI is spurious because it fits the mislabeled training records, which leads to the misclassification of records in the test set.
4.4.2 Overfitting Due to Lack of Representative Samples
Models that make their classification decisions based on a small number of training records are also susceptible to overfitting. Such models can be gener- ated because of lack of representative samples in the training data and learning algorithms that continue to refine their models even when few training records are available. We illustrate these effects in the example below.
Consider the five training records shown in Table 4.5. All of these training records are labeled correctly and the corresponding decision tree is depicted in Figure 4.26. Although its training error is zero, its error rate on the test set is 30%.
Table 4.5. An example training set for classifying mammals.
Name Body Temperature
Lt lves
Birth Four- Iegged
Hibernates Class Label
salamander guppy eagle poorwill platypus
blooded cold-blooded
warm-blooded warm-blooded warm-blooded
cold- no yes no no no
yes no no no yes
yes no no yes
Yes
no no no no yes
Humans, elephants, and dolphins are misclassified because the decision tree classifies all warm-blooded vertebrates that do not hibernate as non-mammals. The tree arrives at this classification decision because there is only one training record, which is an eagle, with such characteristics. This example clearly demonstrates the danger of making wrong predictions when there are not enough representative examples at the leaf nodes of a decision tree.
I78 Chapter 4 Classification
Figure 4.26. Decision tree induced from the data set shown in Table 4.5.
4.4.3 Overfitting and the Multiple Comparison Procedure
Model overfitting may arise in learning algorithms that employ a methodology known as multiple comparison procedure. To understand multiple comparison procedure, consider the task of predicting whether the stock market will rise or fall in the next ten trading days. If a stock analyst simply makes random guesses, the probability that her prediction is correct on any trading day is 0.5. However, the probability that she will predict correctly at least eight out of the ten times is
( ‘ r ‘ ) +( ‘ r )+(13) : o.ob4l , 2ro
which seems quite unlikely. Suppose we are interested in choosing an investment advisor from a pool of
fifty stock analysts. Our strategy is to select the analyst who makes the most correct predictions in the next ten trading days. The flaw in this strategy is that even if all the analysts had made their predictions in a random fashion, the probability that at least one of them makes at least eight correct predictions rS
1 – (1 – 0.0547)50 : 0.9399,
which is very high. Although each analyst has a low probability of predicting at least eight times correctly, putting them together, we have a high probability of finding an analyst who can do so. Furthermore, there is no guarantee in the
4.4 Model Overfitting LTg
future that such an analyst will continue to make accurate predictions through random guessing.