The methods presented so far assume that the training records are sampled without replacement. As a result, there are no duplicate records in the training and test sets. In the bootstrap approach, the training records are sampled with replacement; i.e., a record already chosen for training is put back into the original pool of records so that it is equally likely to be redrawn. If the original data has .ly’ records, it can be shown that, on average, a bootstrap sample of size ly’ contains about 63.2% of the records in the original data. This approximation follows from the fact that the probability a record is chosen by a bootstrap sample is 1 – (1 – 1/,^f)^i. When l,r is sufficiently large, the probability asymptotically approaches 1 – e-r :0.632. Records that are not included in the bootstrap sample become part of the test set. The model induced from the training set is then applied to the test set to obtain an estimate of the accuracy of the bootstrap sample, e.;. The sampling procedure is then repeated b times to generate b bootstrap samples.
There are several variations to the bootstrap sampling approach in terms of how the overall accuracy of the classifier is computed. One of the more widely used approaches is the .632 bootstrap, which computes the overall accuracy by combining the accuracies of each bootstrap sample (e;) with the accuracy computed from a training set that contains all the labeled examples in the original data (acc”):
1-a Accuracy, (rcc6661 :
i ),{0.U32 x e6+ 0.368 x acc”). (4 .11)
4.6 Methods for Comparing Classifiers
It is often useful to compare the performance of different classifiers to deter- mine which classifier works better on a given data set. However, depending on the size of the data, the observed difference in accuracy between two clas- sifiers may not be statistically significant. This section examines some of the statistical tests available to compare the performance of different models and classifiers.
For illustrative purposes, consider a pair of classification models, M4 and M6. Suppose Mn achieves 85To accuracy when evaluated on a test set con- taining 30 records, while Il[B achieves 75To accvacy on a different test set containing 5000 records. Based on this information, is M4 a better model than MB?
4.6 Methods for Comparing Classifiers 189
The preceding example raises two key questions regarding the statistical significance of the performance metrics:
1. Although Me has a higher accuracy than Ms, it was tested on a smaller test set. How much confidence can we place on the accuracy for Ma?
2. Is it possible to explain the difference in accuracy as a result of variations in the composition of the test sets?
The first question relates to the issue of estimating the confidence interval of a given model accuracy. The second question relates to the issue of testing the statistical significance of the observed deviation. These issues are investigated in the remainder of this section.
4.6.1 Estimating a Confidence Interval for Accuracy
To determine the confidence interval, we need to establish the probability
distribution that governs the accuracy measure. This section describes an ap- proach for deriving the confidence interval by modeling the classification task as a binomial experiment. Following is a list of characteristics of a binomial experiment:
1. The experiment consists of l/ independent trials, where each trial has
two possible outcomes: success or failure.
2. The probability of success, p, in each trial is constant.
An example of a binomial experiment is counting the number of heads that turn up when a coin is flipped N times. If X is the number of successes observed in l/ trials, then the probability that X takes a particular value is given by a binomial distribution with mean l/p and variance l/p(l – p):
P(X :u ) : – p )N – ”
For example, if the coin is fair (p : 0.5) and is flipped fifty times, then the probability that the head shows up 20 times is
P(X :20) :
If the experiment is repeated many times, then the average number of heads
expected to show up is 50 x 0.5 : 25, while its variance is 50 x 0.5 x 0.5 : I2.5.
On,’
(13;0.t”,1 – 0.5)30 : 0.041e.
190 Chapter 4 Classification
The task of predicting the class labels of test records can also be consid- ered as a binomial experiment. Given a test set that contains l{ records, let X be the number of records correctly predicted by a model and p be the true accuracy of the model. By modeling the prediction task as a binomial experi- ment, X has a binomial distribution with mean I/p and variance Np(L – p). It can be shown that the empirical accuracy? acc : X lN , also has a binomial distribution with mean p and variance p(t-p)lN (see Exercise 12). Although the binomial distribution can be used to estimate the confidence interval for acc, it is often approximated by a normal distribution when N is sufficiently large. Based on the normal distribution, the following confidence interval for acc car:’ be derived:
a c c – p (4.12)
where Zo12 and Zt-o/z are the upper and lower bounds obtained from a stan- dard normal distribution at confidence level (1 – a). Since a standard normal distribution is symmetric around Z:0, it follows that Zop: Zt-o/z.Rear- ranging this inequality leads to the following confidence interval for p:
< zr_.p) :7 _ (r,
2xNxacc*Z ‘ , t r+Z* / r@ (4 .13)
2(N + 22^,r) ” t –
The following table shows the values of Zo12 at different confidence levels:
l , – a 0.99 0.98 0.95 0.9 0.8 u . l 0.5 Zo/2 2.58 2.33 1.96 1.65 I .28 r .04 0.67
Example 4.4. Consider a model that has an accuracy of 80% when evaluated on 100 test records. What is the confidence interval for its true accuracy at a 95% confidence level? The confidence level of 95% correspondsto Zo12:1.96 according to the table given above. Inserting this term into Equation 4.13 yields a confidence interval between 71.1% and 86.7%. The following table shows the confidence interval when the number of records, l/, increases:
t ( – zo tz I
N 20 50 100 500 1000 5000 Confidence
Interval 0.584
– 0.919 0.670
– 0.888 0.71.t
– 0.867 0.763
– 0.833 0.774
– 0.824 0.789
– 0.811
Note that the confidence interval becomes tiehter when N increases.
Methods for Comparing Classifiers 1-91
4.6.2 Comparing the Performance of Two Models
Consider a pair of models, M1 and M2, that are evaluated on two independent
test sets, D1 and D2. Let n1 denote the number of records in D1 and n2 denote the number of records in D2. In addition, suppose the error rate for M1 oL Dl is el and the error rate for Mz on D2 is e2. Our goal is to test whether the observed difference between e1 and e2 is statisticaliy significant.
Assuming that n1 and n2 are sufficiently large, the error rates e1 and e2 can be approximated using normal distributions. If the observed difference in
the error rate is denoted as d : er – €2t then d is also normally distributed
with mean d1,its true difference, and variance, o]. Tne variance of d can be
computed as follows:
e 1 ( L – e 1 ) , e 2 ( l – e 2 ) – 7 – 1
nL n2 (4.r4)
where “t(I
– et)lu and e2(1 – ez)lnz are the variances of the error rates. Finally, at the (t – a)% confidence level, it can be shown that the confidence interval for the true difference dl is given by the following equation:
d t : d L z o 1 2 G 6 . (4 .15)
Example 4.5. Consider the problem described at the beginning of this sec-
tion. Model Ma has an error rate of er : 0.15 when applied to A1 : 39
test records, while model Ms has an error rate of ez : 0.25 when applied
to Ab : 5000 test records. The observed difference in their error rates is
d : 10.15 – 0.251 : 0.1. In this example, we are performing a two-sided test
to check whether dt : 0 or d,6 I 0. The estimated variance of the observed
difference in error rates can be computed as follows:
4.6
n 2 . – i ? – ” d
– ” d
^, w d –
0.15(1 – 0 .15).qag#a:ooo43 30
or 64 :0.0655. Inserting this value into Equation 4.L5, we obtain the following
confidence interval for d+ at 95% confidence level:
d + : 0 . 1 + 1 . 9 6 x 0 . 0 6 5 5 : 0 . 1 * 0 . 1 2 8 .
As the interval spans the value zero) we can conclude that the observed differ-
ence is not statistically significant at a g5% confidence level. I
L92 Chapter 4 Classification
At what confi.dence level can we reject the hypothesis that dt:0? To do this, we need to determine the value of Zop such that the confidence interval for d1 does not span the value zero. We can reverse the preceding computation and look for the valte Zo12 such that d > Zo126a. Replacing the values of d and 66 gives Zo12 < 7.527. This value first occurs when (1 – a) S 0.936 (for a two-sided test). The result suggests that the null hypothesis can be rejected at confidence level of 93.6% or lower.
4.6.3 Comparing the Performance of Two Classifiers
Suppose we want to compare the performance of two classifiers using the /c-fold cross-validation approach. Initially, the data set D is divided into /c equal-sized partitions. We then apply each classifier to construct a model from k – 1 of the partitions and test it on the remaining partition. This step is repeated /c times, each time using a different partition as the test set.
Let Mii denote the model induced by classification technique tr; during the jth iteration. Note that each pair of models M1i and, M2i are tested on the same partition j. Let e1i and e2i be their respective error rates. The difference between their error rates durin g the jth fold can be written as di – eU – e2j. If k is sufficiently large, then d7 is normally distributed with mean dfu, which is the true difference in their error rates, and variance a”‘. Unlike the previous approach, the overall variance in the observed differences is estimated using the following formula:
(4 .16)
where d is the average difference. For this approach, we need to use a f- distribution to compute the confidence interval for df”:
d t ” : d I t g _ . ‘ 1 , t _ r 6 a . ” .
The coefficient t1r-a;,r-1 is obtained from a probability table with two input parameters, its confidence level (1 –
“) and the number of degrees of freedom,
k – I. The probability table for the t-distribution is shown in Table 4.6.
Example 4.6. Suppose the estimated difference in the accuracy of models generated by two classification techniques has a mean equal to 0.05 and a standard deviation equal to 0.002. If the accuracy is estimated using a 30-fold cross-validation approach, then at a gbTo confidence level, the true accuracy difference is
df” :0 .05 +2.04 x 0.002. (4.17)
4.7 Bibliographic Notes 193
Table 4.6. Probability table for t-distribution.