Learning, 38(3) :309-338, March 2000.
ll24l M. V. Joshi, G. Karypis, and V. Kumar. ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets. In Proc. of 12th IntI.
Parallel Processing Sgmp. (IPPS/SPDP), pages 573 579, Orlando, FL, April 1998.
[125] G. V. Kass. An Exploratory Technique for Investigating Large Quantities of Categor- ical Data. Appli,ed Statistics, 29:Ll9-127, 1980.
[1 17]
11181
[126]
lr27l
u28l
[12e]
Bibtiography Lg7
[130j G. Landeweerd, T. Timmers, E. Gersema, M. Bins, and M. Halic. Binary tree versussingle level tree classification of white blood cells. pattern Recognzt,ion, L6:bTr-bTZ,1983.
[131]-M’ Mehta, R. Agrawal, and J. Rissanen. sLIQ: A Fast scalable classifier for DataMining. ln Proc. of -the sth Intt. conf. on Extend,,ing Database Technorogy,pages 1g-32,Avignon, Flance, March 19g6.
[132] R’ S’ Michalski’ A theory and methodology of inductive learning. Arti,fi,cial Intell,igence,20 :111 -116 ,1988 .
[133j D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, .
Statistical Classi,fi,cati,oz. Ellis Horwood, Upper Saddle River, NJ, 1g94. [134] -J’
Mingers. Expert systems-Rule Induction with statisticar Data. J Research Soc’ietg, Bg:89 47, 19g2.
[135] J’ Mingers’ An empirical comparison of pruning methods for decision tree induction.Machine Learni,ng, 4:227_24J, Iggg. [136] T. Mitchell. Machine Learn,ing. McGraw_Hill, Boston, MA, IggT. 1137] B. M. E’ Moret. Decision T}ees and Diagrams. computing suraegs, r4(4):893-62J,
t982.
I138]-S’ K’ Murthy. Automatic Construction of Decision TYees from Data: A Multi-Disciplinary survey. Data Mi,ning and Knowredge Discouery,2(4):34b-3gg, 1ggg. [139] S’ K’ Murthy, s. r.<3if,
_and S. salzberg. A system for induction of oblique decisiontrees. ,.I of Artificial Intelligence Research,2:I,JJ, Igg4. [140] T’ Niblett. constructing decision trees in noisy domain s. In proc. of the 2nd. EuropeanWorking Session on Leaming, pages 67_2g, Bled, yugoslavia, May 19g7. [141] T’ Niblett and I. Bratko’ Learning Decision Rules in Noisy Domai ns. rn Research and,Deuelopment in Er,pert systems r11, cambridge, 1gg6. cu-bridg” University press. [142] K’ R’ Pattipati and M. G. Alexandri<Iis. Application of heuristic search and informationtheory to sequentiar fault diagnosis. IEEE Trans. on sgstems, Mon, ord, cgbernetics,2o(4):872-s82, rss}. [143] J’ R’ Quinlan. Discovering rules by induction from large collection of examples. InD’ Michie, editot, Eupert systems ,in the M,icro Erectronic Ag”. Edinburgh universityPress, Edinburgh, UK, 1g79. Ir44) J- R’ Quinlan. simplifying Decision Ttees. 1nrl. J. Man-Machine stud,ies,2T:22r_284,
1987.
[145] J. R. Quinlan. cl.s: progr.ams for Machine Learni,ng. Morgan-Kaufmann publishers, San Mateo, CA, 1gg3.
[146] J’ R’ Quinlan and R. L. Rivest. Inferring Decision Tlees using the Minimum Descrip-tion Length Principle. Information and Computation, g0(B):2i7_24g, 19gg.
Neural and
Operational
1-98 Chapter 4 Classification
[147] S. R. Safavian and D. Landgrebe. A Survey of Decision Ttee Classifier Methodology.
IEEE Trans. Sgstems, Man and Cyber”netics, 22:660-674, May/June 1998.
[148] C. Schaffer. Overfitting avoidence as bias. Machine Learn’ing, 10:153-178, 1993.
[149] R. E. Schapire. The Boosting Approach to Machine Learning: An Overview. In MSRI
Workshop on Nonlinear Estimati’on and Classificati’on, 2002.
[150] J. Schuermann and W. Doster. A decision-theoretic approach in hierarchical classifier
design. P attern Recogn’ition, 17:359-369, 1984.
[151] J. C. Shafer, R. Agrawal, and M. Mehta. SPRINT: A Scalable Parallel Classifier
for Data Mining. In Proc. of the 22nd VLDB Conf., pages 544-555, Bombay, India,
September 1996.
1152] P. E. Utgoff and C. E. Brodley. An incremental method for finding multivariate splits
for decision trees. In Proc. of the 7th IntI. Conf. on Mach’ine Learning, pages 58-65,
Austin. TX. June 1990.
1153] H. Wang and C. Zaniolo. CMP: A Fast Decision Tlee Classifier Using Multivariate
Predictions. In Proc. of the 16th IntI. Conf. on Data Engineering, pages 449-460, San
Diego, CA, March 2000.
1154] Q. R. Wang and C. Y. Suen. Large tree classifier with heuristic search and global
training. IEEE Trans. on Pattern Analys’is and Machine Intell’igence,9(1):91-102, 1987.
4.8 Exercises
Draw the full decision tree for the parity function of four Boolean attributes, A, B, C, and D. Is it possible to simplify the tree?
Consider the training examples shown in Table 4.7 for a binary classification problem.
(a) Compute the Gini index for the overall collection of training examples.
(b) Compute the Gini index for the Custoner ID attribute.
(c) Compute the Gini index for the Gender attribute.
(d) Compute the Gini index for the Car Type attribute using multiway split.
(e) Compute the Gini index for the Shirt Size attribute using multiway split.
(f) Which attribute is better, Gender, Car Type, or Shirt Size?
(g) Explain why Custoner ID should not be used as the attribute test con- dition even though it has the lowest Gini.
Consider the training examples shown in Table 4.8 for a binary classification problem.
(a) What is the entropy of this collection of training examples with respect to the positive class?
1 .
2.
. f .
Table 4.7. Data set for Exercise 2. Customer ID Gender Car Type Shirt Size Class
1 2 3 4 b
6 F7 I
8 9 10 1 1 t2 13 t4 15 16 77 18 19 20
M
M M M M M F F F F M M M M F F F F F F
Family Sports Sports Sports Sports Sports Sports Sports Sports Luxury Family Family Family Luxury Luxury Luxury Luxury Luxury Luxury Luxury
Small Medium Medium
Large Extra Large Extra Large
Small Small
Medium Large Large
Extra Large Medium
Extra Large Small Small
Medium Medium Medium
Larqe
CO CO CO CO CO CO CO CO CO CO C1 C1 C1 C1 C1 C1 C1 C1 C1 C1
4.8 Exercises 199
Table 4,8. Data set for Exercise 3. Instance aL a2 as Target Class
I 2 3 4 5 6 a T
8 o
T T 1 . O T T 6 . 0 T F 5 . O F F 4 . O F T 7.0 F T 3 . O F F 8 . O T F 7.0 F T 5 . O
T
-T
-T-
-T-
(b)
(“)
What are the information gains of o1 and o2 relative to these training examples?
For o3, which is a continuous attribute, compute the information gain for every possible split.
200 Chapter 4 Classification
(d) What is the best split (among e7t a2t and o3) according to the information gain?
(e) What is the best split (between 01 and o2) according to the classification error rate?
(f) What is the best split (between 01 and o2) according to the Gini index?
4. Show that the entropy of a node never increases after splitting it into smaller successor nodes.
5. Consider the following data set for a binary class problem.
(a) Calculate the information gain when splitting on ,4 and B. Which at- tribute would the decision tree induction algorithm choose?
(b) Calculate the gain in the Gini index when splitting on .A and B. Which attribute would the decision tree induction algorithm choose?
(c) Figure 4.13 shows that entropy and the Gini index are both monotonously increasing on the range [0, 0.5] and they are both monotonously decreasing on the range [0.5, 1]. Is it possible that information gain and the gain in the Gini index favor different attributes? Explain.
Consider the following set of training examples.
X Y Z No. of Class C1 Examples No. of Class C2 Examples 0 0 0 5 40 0 0 1 0 15 0 1 0 10 5 0 I I 45 0 1 0 0 10 I 0 I 25 0 I 1
I 0 5 20 1 1 1 0 15
o .
A B Class Label I
T T T T F F F T T
! ‘
T T F T F F F T F
-r -r
+
+
7
4.8 Exercises zOL
(a) Compute a two-level decision tree using the greedy approach described in this chapter. Use the classification error rate as the criterion for splitting. What is the overall error rate of the induced tree?
(b) Repeat part (a) using X as the first splitting attribute and then choose the best remaining attribute for splitting at each of the two successor nodes. What is the error rate of the induced tree?
(c) Compare the results of parts (a) and (b). Comment on the suitability of the greedy heuristic used for splitting attribute selection.
The following table summarizes a data set with three attributes A, B. C and two class labels *, -. Build a two-level decision tree.
A B C Number of Instances +
T F T F T F T F
T T F F T T F F
T T T T F F F F
(
0 20 0 0
25 0 0
0 20 0 K
0 0 0
25
(a) According to the classification error rate, which attribute would be chosen as the first splitting attribute? For each attribute, show the contingency table and the gains in classification error rate.
(b) Repeat for the two children of the root node.
(c) How many instances are misclassified by the resulting decision tree?
(d) Repeat parts (a), (b), and (c) using C as the splitting attribute.
(e) Use the results in parts (c) and (d) to conclude about the greedy nature of the decision tree induction algorithm.
8. Consider the decision tree shown in Figure 4.30.
(a) Compute the generalization error rate of the tree using the optimistic approach.
(b) Compute the generalization error rate of the tree using the pessimistic approach. (For simplicity, use the strategy of adding a factor of 0.5 to each leaf node.)
(c) Compute the generalization error rate of the tree using the validation set shown above. This approach is known as reduced error pruning.
2O2 Chapter 4 Classification
Figure 4.30. Decision tree and data sets for Exercise 8.
Consider the decision trees shown in Figure 4.31. Assume they are generated
from a data set that contains 16 binarv attributes and 3 classes, C1, C2, and C3.
(a) Decision tree with 7 errors (b) Decision tree with 4 errors
Figure 4.31. Decision trees for Exercise 9.
9.
Instance A B c Class +
2 0 0 + 3 1 + 4 1 5 1 0 0 + 6 1 0 0 + 7 1 1 0 8 1 0 + I 1 1 0 1 0 1 1 0
Validation: lnstance A B c Class
0 0 0 + 1 2 0 1 1 + 1 3 1 1 0 + 1 4 1 0 1 1 5 1 0 0 +
4.8 Exercises 2Og
Compute the total description length of each minimum description length principle.
tree according to the
o The total description length of a tree is given by:
C o st (tr e e, dat a) : C o st (tr ee) -f C o st (dat altr ee) .
Each internal node of the tree is encoded by the ID of the splitting at- tribute. If there are rn attributes, the cost of encodins each attribute is Iog, rn bits.
Each leaf is encoded using the ID of the class it is associated with. If there are k classes, the cost of encoding a class is log, k bits.
Cost(tree) is the cost of encoding all the nodes in the tree. To simplify the computation, you can assume that the total cost of the tree is obtained by adding up the costs of encoding each internal node and each leaf node.
Cost(dataltree) is encoded using the classification errors the tree commits on the training set. Each error is encoded by log2 n bits, where n is the total number of training instances.
Which decision tree is better, according to the MDL principle?
10. While the .632 bootstrap approach is useful for obtaining a reliable estimate of model accuracy, it has a known limitation 1127]. Consider a two-class problem, where there are equal number of positive and negative examples in the data. Suppose the class labels for the examples are generated randomly. The classifier used is an unpruned decision tree (i.e., a perfect memorizer). Determine the accuracy of the classifier using each of the following methods.
The holdout method, where two-thirds of the data are used for training and the remaining one-third are used for testing.
Ten-fold cross-validation.
The .632 bootstrap method.
Flom the results in parts (u), (b), and (c), which method provides a more reliable evaluation of the classifier’s accuracy?
11. Consider the following approach for testing whether a classifier A beats another classifier B. Let l[ be the size of a given data set, pa be the accuracy of classifier A, ps be the accuracy of classifier B, and p: (pt +pB)12 be the average accuracy for both classifiers. To test whether classifier A is significantly better than B, the following Z-statistic is used:
P A – P B
(a)
(b)
(c)
(d)
Classifier A is assumed to be better than classifier B if Z >
2O4 Chapter 4 Classification
Table 4.9 compares the accuracies of three different classifiers, decision tree classifiers, naive Bayes classifiers, and support vector machines, on various data sets. (The latter two classifiers are described in Chapter 5.)
Table 4.9. Comparing the accuracy of various classification methods.
Data Set Size (r/)
Decision Tlee (%)
nalve
Bayes (%)
Support vector machine (%)
nneal Australia Auto Breast Cleve Credit Diabetes German GIass Heart Hepatitis Horse Ionosphere Iris Labor LedT Lymphography Pima Sonar Tic-tac-toe Vehicle Wine Zoo
898 690 205 699 303 690 768 1000 214 270 155 368 351 150
F A
3200 L48 / b 6
208 958 846 178 101
92.09 6 b . b l
81.95 95.14 76.24 85.80 72.40 70.90 67.29 80.00 81.94 85.33 89.17 94.67 78.95 73.34 77.03 1 4 . ; t O
78.85 83.72 7t .04 94.38 93.07
79.62 76.81 58.05 95.99 83.50 77.54 75.91 74.70 48.59 84.07 83.23 78.80 82.34 95.33 94.74 73.16 83.1 1 76.04 69.7L 70.04 45.04 96.63 93.07
87.19 84.78 1 U . 1 J
96.42 84.49 85.07 76.82 74.40 59.81 83.70 87.10 82.6r 88.89 96.00 92.98 73.56 86.49 76.95 76.92 98.33 74.94 98.88 96.04
Summarize the performance of the classifiers given in Table 4.9 using the fol- lowing3 x 3table:
win-loss-draw Decision tree Naive Bayes Support vector machine
Decision tree 0 – 0 – 2 3 NaiVe tsayes 0 – 0 – 2 3 Support vector machine 0 – 0 – 2 3
Each cell in the table contains the number of wins. losses, and draws when comparing the classifier in a given row to the classifier in a given column.
12.
4.8 Exercises 2O5
Let X be a binomial random variable with mean lr’p and variance lfp(l -p).
Show that the ratio Xf N also has a binomial distribution with mean p and var iance p( t -p) lN.
Classification: Alternative Tech n iq ues
The previous chapter described a simple, yet quite effective, classification tech- nique known as decision tree induction. Issues such as model overfitting and classifier evaluation were also discussed in great detail. This chapter presents alternative techniques for building classification models-from simple tech- niques such as rule-based and nearest-neighbor classifiers to more advanced techniques such as support vector machines and ensemble methods. Other key issues such as the class imbalance and multiclass problems are also dis- cussed at the end of the chapter.
5.1 Rule-Based Classifier