0.99 0.98 0.95 0.9 0.8 1 l-

2 A

9 I4 19 24 29

3.08 1.89 1.53 1.38 7.34 1.33 r .32 1 .31

6.31 2.92 2 .L3 1.83 r .76 1.73 7.7r 7.70

72.7 4.30 2.78 2.26 2 .L4 2.09 2.06 2.04

31.8 6.96 3.75 2.82 2.62 2.54 2.49 2.46

63.7 9.92 4.60 3.25 2.98 2.86 2.80 2.76

Since the confidence interval does not span the value zero, the observed dif- ference between the techniques is statistically significant. r

4.7 Bibliographic Notes

Early classification systems were developed to organize a large collection of objects. For example, the Dewey Decimal and Library of Congress classifica- tion systems were designed to catalog and index the vast number of library books. The categories are typically identified in a manual fashion, with the help of domain experts.

Automated classification has been a subject of intensive research for many yearc. The study of classification in classical statistics is sometimes known as discriminant analysis, where the objective is to predict the group member- ship of an object based on a set of predictor variables. A well-known classical method is Fisher’s linear discriminant analysis [117], which seeks to find a lin- ear projection of the data that produces the greatest discrimination between objects that belong to different classes.

Many pattern recognition problems also require the discrimination of ob- jects from different classes. Examples include speech recognition, handwritten character identification, and image classification. Readers who are interested in the application of classification techniques for pattern recognition can refer to the survey articles by Jain et al. 1722] and Kulkarni et al. [128] or classic pattern recognition books by Bishop [107], Duda et al. [114], and Fukunaga

[118]. The subject of classification is also a major research topic in the fields of neural networks, statistical learning, and machine learning. An in-depth treat-

L94 Chapter 4 Classification

ment of various classification techniques is given in the books by Cherkassky and Mulier [112], Hastie et al. [120], Michie et al. [133], and Mitchell [136].

An overview of decision tree induction algorithms can be found in the survey articles by Buntine [110], Moret [137], Murthy [tSS], and Safavian et aL l7a7l. Examples of some well-known decision tree algorithms include CART

[108], ID3 [143], C4.5 [ta5], and CHAID [125]. Both ID3 and C4.5 employ the entropy measure as their splitting function. An in-depth discussion of the C4.5 decision tree algorithm is given by Quinlan [145]. Besides explaining the methodology for decision tree growing and tree pruning, Quinlan [145] also described how the algorithm can be modified to handle data sets with missing values. The CART algorithm was developed by Breiman et al. [108] and uses the Gini index as its splitting function. CHAID [tZ5] uses the statistical y2 test to determine the best split during the tree-growing process.

The decision tree algorithm presented in this chapter assumes that the splitting condition is specified one attribute at a time. An oblique decision tree can use multiple attributes to form the attribute test condition in the internal nodes [121, 152]. Breiman et al. [108] provide an option for using linear combinations of attributes in their CARf implementation. Other approaches for inducing oblique decision trees were proposed by Heath et al. [121], Murthy et al. [139], Cantil-Paz and Kamath 1111], and Utgoff and Brodley 11.52]1. Although oblique decision trees help to improve the expressiveness of a decision tree representation, learning the appropriate test condition at each node is computationally challenging. Another way to improve the expressiveness of a decision tree without using oblique decision trees is to apply a method known as constructive induction [132]. This method simplifies the task of learning complex splitting functions by creating compound features from the original attributes.

Besides the top-down approach, other strategies for growing a decision tree include the bottom-up approach by Landeweerd et al. 1130] and Pattipati and Alexandridis [142], as well as the bidirectional approach by Kim and Landgrebe

[126]. Schuermann and Doster [150] and Wang and Suen [154] proposed using a soft splitting criterion to address the data fragmentation problem. In this approach, each record is assigned to different branches of the decision tree with different probabilities.

Model overfitting is an important issue that must be addressed to ensure that a decision tree classifier performs equally well on previously unknown records. The model overfitting problem has been investigated by many authors including Breiman et al. [108], Schaffer [148], Mingers [135], and Jensen and Cohen [123]. While the presence of noise is often regarded as one of the

Bibliography 195

primary reasons for overfitting [t35, 140], Jensen and Cohen [123] argued that overfitting is the result of using incorrect hypothesis tests in a multiple comparison procedure.

Schapire 1149] defined generalization error as “the probability of misclas- sifying a new example” and test error as “the fraction of mistakes on a newly sampled test set.” Generalization error can therefore be considered as the ex- pected test error of a classifier. Generalization error may sometimes refer to the true error 1136] of a model, i.e., its expected error for randomly drawn data points from the same population distribution where the training set is sampled. These definitions are in fact equivalent if both the training and test sets are gathered from the same population distribution, which is often the case in many data mining and machine learning applications.

The Occam’s razor principle is often attributed to the philosopher William of Occam. Domingos [113] cautioned against the pitfall of misinterpreting Occam’s razor as comparing models with similar training errors, instead of generalization errors. A survey on decision tree-pruning methods to avoid overfitting is given by Breslow and Aha [109] and Esposito et al. [116]. Some of the typical pruning methods include reduced error pruning [144], pessimistic error pruninglL44], minimum error pruning [141], critical value pruning 1134], cost-complexity pruning [108], and error-based pruning [1a5]. Quinlan and Rivest proposed using the minimum description length principle for decision tree pruning in [146].

Kohavi [127] had performed an extensive empirical study to compare the performance metrics obtained using different estimation methods such as ran- dom subsampling, bootstrapping, and k-fold cross-validation. Their results suggest that the best estimation method is based on the ten-fold stratified cross-validation. Efron and Tibshirani [115] provided a theoretical and empir- ical comparison between cross-validation and a bootstrap method known as the 632* rule.

Current techniques such as C4.5 require that the entire training data set fit into main memory. There has been considerable effort to develop parallel and scalable versions of decision tree induction algorithms. Some of the proposed algorithms include SLIQ by Mehta et al. [131], SPRINT by Shafer et al. [151], CMP by Wang and Zaniolo [153], CLOUDS by Alsabti et al. [106], RainForest by Gehrke et al. [119], and ScalParC by Joshi et al. j2al. A general survey of parallel algorithms for data mining is available in 1129].

11071

l1o8l

l1Oel

196 Chapter 4 Classification

Bibliography [106] K. Alsabti, S. Ranka, and V. Singh. CLOUDS: A Decision TYee Classifier for Large

Datasets. In Proc. of the lth Intl. Conf. on Knowledge D’iscouery and Data M’ining, pages 2-8, New York. NY, August 1998.

C. M. Bishop. Neural Networks for Pattern Recogniti,on. Oxford University Press,

Oxford, U.K., 1995.

L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone. Classi,fi’cati’on and Regression Trees. Chaprnan & Hall, New York, 1984.

L. A. Breslow and D. W. Aha. Simplifying Decision Tlees: A Survey. Knowledge

Engineering Reui,ew, L2(l):L-40, 1997.

[110] W. Buntine. Learning classification trees. In Artifici,al Intelligence Frontiers i;n Statis-

lics, pages I82 20I. Chapman & Hall, London, 1993.

1111] E. Cantri-Paz and C. Kamath. Using evolutionary algorithms to induce oblique decision

trees. In Proc. of the Genetic and Euoluti,onarg Computation Conf., pages 1053-1060,

San Flancisco, CA, 2000.

[112] V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and Method’s.

Wiley Interscience, 1998.

[113] P. Domingos. The Role of Occam’s Razor in Knowledge Discovery. Data Mi’ning and

Knouledg e Discouery, 3(4) :409-425, t999.

[114] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Class’ification John Wiley & Sons, Inc., New York, 2nd edition, 2001.

[115] B. Efron and R. Tibshirani. Cross-validation and the Bootstrap: Estimating the Error

Rate of a Prediction Rule. Technical report, Stanford University, 1995.

[116] F. Esposito, D. Malerba, and G. Semeraro. A Comparative Analysis of Methods for

Pruning Decision Trees. IEEE Trans. Pattern Analysis and Machine Intelli,gence, 19 (5):476-491, May 1997. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of

Eugenics, 7:179 188, 1936. K. Fukunaga. Introduct’ion to Statist’ical Pattern Recognit’i,on. Academic Press, New

York, 1990.

1119] J. Gehrke, R. Ramakrishnan, and V. Ganti. RainForest-A Framework for Fast De-

cision Tree Construction of Large Datasets. Data Mi,ning and Knowledge D’iscouerg, 4

(213):127-162, 2000.

[120] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Stati.stical Leanting:

Data Mi,ning, Inference, Pred’icti,on. Springer, New York, 2001.

f121] D. Heath, S. Kasif, and S. Salzberg. Induction of Oblique Decision Trees. In Proc. of

the 13th IntI. Joint Conf. on Arti,fici,al Intelligence, pages 1002-1007, Chambery, France,

August 1993.