+1 (208) 254-6996 essayswallet@gmail.com

Gini index as its impurity measure is then applied to the training set. To

investigate the effect of overfitting, different levels of pruning are applied to

Don't use plagiarized sources. Get Your Custom Essay on
Gini index as its impurity measure is then applied to the training set. To
Just from $13/Page
Order Essay

the initial, fully-grown tree. Figure .23(b) shows the training and test error rates of the decision tree.

Training sel

:r E ..;* “;T 1i.’_ : “;:’.-tsi.i *i..j..rr ;liS $ .gJ’ j:.,. .’^i:#.$”-i.”kii

,{. }tll.* fi : rl iT “f, – :-g:’i; it .. *’-+ * **- * 1 ++ +t*+

* , * * * ; . * * – *3S* *o * l + * * * +*

I * .^ .dr;

qoF oL-Y o r^**l * ** *

* l **** t

*ao i**3o-..* ia:. . – i .^oo. i 1 81 61 2



L74 Chapter 4 Classification

0.35 – Training Error .- – Test Error

100 150 200 250 300 Number of Nodes

Figure 4.23. Training and test error rates.

Notice that the training and test error rates of the model are large when the size of the tree is very small. This situation is known as model underfitting. Underfitting occurs because the model has yet to learn the true structure of the data. As a result, it performs poorly on both the training and the test sets. As the number of nodes in the decision tree increases, the tree will have fewer training and test enors. However, once the tree becomes too large, its test error rate begins to increase even though its training error rate continues to decrease. This phenomenon is known as model overfitting.

To understand the overfitting phenomenon, note that the training error of a model can be reduced by increasing the model complexity. For example, the Ieaf nodes of the tree can be expanded until it perfectly fits the training data. Although the training error for such a complex tree is zero, the test error can be large because the tree may contain nodes that accidently fit some of the noise points in the training data. Such nodes can degrade the performance of the tree because they do not generalize well to the test examples. Figure 4.24 shows the structure of two decision trees with different number of nodes. The tree that contains the smaller number of nodes has a higher training error rate, but a lower test error rate compared to the more complex tree.

Overfitting and underfitting are two pathologies that are related to the model complexity. The remainder of this section examines some of the poten- tial causes of model overfitting.

0.25 o Gr h 0.2 Y








4.4 Model Overfitting L75

(a) Decision tree with 11 leaf nodes.

(b) Decision tree with 24 leaf nodes.

Figure 4.24. Decision trees with ditferent model complexities.

4.4.L Overfitting Due to Presence of Noise

Consider the training and test sets shown in Tables 4.3 and 4.4 for the mammal classification problem. Two of the ten training records are mislabeled: bats and whales are classifi.ed as non-mammals instead of mammals.

A decision tree that perfectly fits the training data is shown in Figure a.25(a). Although the training error for the tree is zero, its error rate on

Table 4.3. An example training set for classifying mammals. Class labels with asterisk symbols repre- sent mislabeled records.

Name Body Temperature

Grves Birth

Four- Iegged

Hibernates Olass Label


cat bat whale salamander komodo dragon python salmon eagle guppy

warm-blooded warm-blooded warm-blooded warm-blooded cold-blooded cold-blooded cold-blooded cold-blooded

warm-blooded cold-blooded

yes yes yes yes no no no no no yes

yes yes no no yes yes no no no no

yes no yes no yes no yes no no no

yes yes no no no no no no no no


2 n




L76 Chapter 4 Classification

(a) Model Ml (b) Model M2

Figure 4.25. Decision tree induced from the data set shown in Table 4.3.

the test set is 30%. Both humans and dolphins were misclassified as non- mammals because their attribute values for Body Tenperature, Gives Birth, and Four-legged are identical to the mislabeled records in the training set. Spiny anteaters, on the other hand, represent an exceptional case in which the class label of a test record contradicts the class labels of other similar records in the training set. Errors due to exceptional cases are often unavoidable and establish the minimum error rate achievable bv anv classifier.

Table 4.4. An example test set for classifying mammals. Name tsody

Temperature Gl ves Birth

Four- legged

Hibernates UIaSS

Label human plgeon elephant leopard shark turtle penguin eel dolphin spiny anteater gila monster

warm-blooded warm-blooded warm-blooded coid-blooded cold-blooded cold-blooded cold-blooded

warm-blooded warm-blooded cold-blooded

yes no yes yes no no no yes no no

no no yes no yes no no no yes yes

no no no no no no no no yes yes

yes no yes no no no no yes yes no



4.4 Model Overfitting L77

In contrast, the decision tree M2 shown in Figure 4.25(b) has a lower test error rate (I0%) even though its training error rate is somewhat higher (20%). It is evident that the first decision tree, MI, has overfitted the training data because there is a simpler model with lower error rate on the test set. The Four-legged attribute test condition in model MI is spurious because it fits the mislabeled training records, which leads to the misclassification of records in the test set.

4.4.2 Overfitting Due to Lack of Representative Samples

Models that make their classification decisions based on a small number of training records are also susceptible to overfitting. Such models can be gener- ated because of lack of representative samples in the training data and learning algorithms that continue to refine their models even when few training records are available. We illustrate these effects in the example below.

Consider the five training records shown in Table 4.5. All of these training records are labeled correctly and the corresponding decision tree is depicted in Figure 4.26. Although its training error is zero, its error rate on the test set is 30%.

Table 4.5. An example training set for classifying mammals.

Name Body Temperature

Lt lves

Birth Four- Iegged

Hibernates Class Label

salamander guppy eagle poorwill platypus

blooded cold-blooded

warm-blooded warm-blooded warm-blooded

cold- no yes no no no

yes no no no yes

yes no no yes


no no no no yes

Humans, elephants, and dolphins are misclassified because the decision tree classifies all warm-blooded vertebrates that do not hibernate as non-mammals. The tree arrives at this classification decision because there is only one training record, which is an eagle, with such characteristics. This example clearly demonstrates the danger of making wrong predictions when there are not enough representative examples at the leaf nodes of a decision tree.



I78 Chapter 4 Classification

Figure 4.26. Decision tree induced from the data set shown in Table 4.5.

4.4.3 Overfitting and the Multiple Comparison Procedure

Model overfitting may arise in learning algorithms that employ a methodology known as multiple comparison procedure. To understand multiple comparison procedure, consider the task of predicting whether the stock market will rise or fall in the next ten trading days. If a stock analyst simply makes random guesses, the probability that her prediction is correct on any trading day is 0.5. However, the probability that she will predict correctly at least eight out of the ten times is

( ‘ r ‘ ) +( ‘ r )+(13) : o.ob4l , 2ro

which seems quite unlikely. Suppose we are interested in choosing an investment advisor from a pool of

fifty stock analysts. Our strategy is to select the analyst who makes the most correct predictions in the next ten trading days. The flaw in this strategy is that even if all the analysts had made their predictions in a random fashion, the probability that at least one of them makes at least eight correct predictions rS

1 – (1 – 0.0547)50 : 0.9399,

which is very high. Although each analyst has a low probability of predicting at least eight times correctly, putting them together, we have a high probability of finding an analyst who can do so. Furthermore, there is no guarantee in the



4.4 Model Overfitting LTg

future that such an analyst will continue to make accurate predictions through random guessing.

Order your essay today and save 10% with the discount code ESSAYHELP