A vast amount of literature on cost-sensitive learning can be found in the online proceedings of the ICML’2000 Workshop on cost-sensitive learn- ittg. The properties of a cost matrix had been studied by Elkan in [182]. Margineantu and Dietterich [206] examined various methods for incorporating cost information into the C4.5 learning algorithm, including wrapper meth- ods, class distribution-based methods, and loss-based methods. Other cost- sensitive learning methods that are algorithm-independent include AdaCost
[t83], Metacost [177], and costing [222].
3L2 Chapter 5 Classification: Alternative Techniques
Extensive literature is also available on the subject of multiclass learning. This includes the works of Hastie and Tibshirani [191], Allwein et al. 1156], Kong and Dietterich [201], and Tax and Duin [215]. The error-correcting output coding (ECOC) method was proposed by Dietterich and Bakiri [175]. They had also investigated techniques for designing codes that are suitable for solving multiclass problems.
Bibliography [155] D. W. Aha. A studg of instance-based algorithms for superuised learning tasks: mathe-
rnatical, empirical, and, psgchological eualuat’ions. PhD thesis, University of California, Irvine, 1990.
[156] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing Multiclass to Binary: A Unifying Approach to Margin Classifiers. Journal of Machine Learn’ing Research, I: 113-141, 2000.
[157] R, Andrews, J. Diederich, and A. Tickle. A Survey and Critique of Techniques For Extracting Rules Flom Tbained Artificial Neural Networks. Knowledge Based, Sgstems, 8(6):373-389, 1995.
[158] K. Bennett and C. Campbell. Support Vector Machines: Hype or Hallelujah. SIGKDD Erp lorat’i, o n s, 2 (2) : L-13, 2000.
[159] C. M. Bishop. Neural Networks for Pattern Recognit’ion. Oxford University Press, Oxford. U.K.. 1995.
[160] A. P. Bradley. The use of the area under the ROC curve in the Evaluation of Machine Learning Algorithms. P attern Recogni,tion, 30(7) : 1 145-1 1 49, 1997.
[161] L. Breiman. Bagging Predictors. Mach,ine Lear”ning,24(.2):123 140, 1996.
[162] L. Breiman. Bias, Variance, and Arcing Classifiers. Technical Report 486, University of California, Berkeley, CA, 1996.
f163] L. Breiman. Random Forests. Mach’ine Learn’ing, 45(I):5-32,2001.
[164] C. J. C. Burges. A Ttrtorial on Support Vector Machines for Pattern Recognition. D ata Mining and, Knowled,g e Discox erg, 2(2) :t21-167, 1998.
[165] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artifi,cial Intelli,gence Research, 16:321- 357, 2002.
[166] N. V. Chawla, N. Japkowicz, and A. Kolcz. Editorial: Special Issue on Learning from Imbalanced Data Sets. SIGKDD Erplorat’ions, 6(1):1-6, 2004.
[167] V. Cherkassky and F. Mulier. Learn’ing frorn Data: Concepts, Theory, and Method,s. Wiley Interscience, 1998.
[168] P. Clark and R. Boswell. Rule Induction with CN2: Some Recent Improvements. In Machine Lear”ning: Proc. of the Sth European Conf. (EWSL-97/, pages 151-163, 1991.
[169] P. Clark and T. Niblett. The CN2 Induction Algorithm. Machine Learning, 3(4): 26L-283, 1989.
[170] W. W. Cohen. Fast Effective Rule Induction. In Proc. of the 12th Intl. Conf. on Mach’ine Leamting, pages 115-123, Tahoe City, CA, July 1995.
[171] S. Cost and S. Salzberg. A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Mach’ine Learn’ing, 10:57-78, 1993.
ll72l T. M. Cover and P. E. Hart. Nearest Neighbor Pattern Classification. Knowleilge Baseil Sgstems, 8(6):373-389, 1995.
Bibliography 313
[173] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-baseil Learning Method,s. Cambridge University Press, 2000.
f174] T. G. Dietterich. Ensemble Methods in Machine Learning. In Fi,rst IntI. Workshop on Multiple Classifi,er Sgsterns, Cagliari, Italy, 2000.
[175] T. G. Dietterich and G. Bakiri. Solving Multiclass Learning Problems via Error- Correcting Output Codes. Joum,al of Arti,ficial Intelligence Research,2:263-286, Lggl.
[176] P. Domingos. The RISE system: Conquering without separating. In Proc. of the 6th IEEE Intl. Conf. on Tools wi,th Artificial Intelligence, pages 704-707, New Orleans, LA, 1994.
lL77l P. Domingos. MetaCost: A General Method for Making Classifiers Cost-Sensitive. In Proc. of the Sth Intl. Conf. on Knowledge Discouery and. Data Mi,ning, pages 155-164, San Diego, CA, August 1999.
f1781 P. Domingos and M. Pazzani. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Leanri,ng,29(2-3):103 130, 1997.
[179] C. Drummond and R. C. Holte. C4.5, Class imbalance, and Cost sensitivity: Why under-sampling beats over-sampling. In ICML’2001 Workshop on Learning from Im- balanced Data Sets 1d Washington, DC, August 2003.
[180] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classi,fication John Wiley & Sons, Inc., New York, 2nd edition, 2001.
f1811 M. H. Dunham. Data Mi.ni,ng: Introductory and Aduanced, Topics. Prentice Hall, 2002.
[182] C. Elkan. The Foundations of Cost-Sensitive Learning. ln Proc. of the 17th IntL Joint Conf. on Arti,ficial Intelligence, pages 973-978, Seattle, WA, August 2001.
[183] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. AdaCost: misclassification cost- sensitive boosting. In Proc. of the 16th Intl. Conf. on Mach,ine Learning, pages 97-105, Bled, Slovenia, June 1999.
[184] J. Fiirnkranz and G. Widmer. Incremental reduced error pruning. In Proc. of the 11th IntI. Conf . on Mach’ine Learning, pages 70-77, New Brunswick, NJ, July 1994.
[185] C. Ferri, P. Flach, and J. Hernandez-Orallo. Learning Decision TYees Using the Area Under the ROC Curve. In Proc. of the 19th IntL Conf. on Machine Leanting, pages 739-L46, Sydney, Australia, July 2002.
1186] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and Sgstern Sciences,55(1):119- 139, 1997.
[187] K. Fukunaga. Introd”uction to Stati,sti.cal Pattern Recognition. Academic Press, New York, 1990.
U88] E.-H. Han, G. Karypis, and V. Kumar. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. In Proc. of the Sth Pacifi,c-Asia Conf. on Knowled,ge Discouerg and, Data Mining, Lyon, Fbance,2001.
f1891 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San FYancisco, 2001.
[190] D. J. Hand, H. Mannila, and P. Smyth. Pri,nci,ples of Data Mi,ning. MIT Press,2001.
[191] T. Hastie and R. Tibshirani. Classification by pairwise coupling. Annals of Statisti,cs, 26(2):451-a71, 1998.
f192] T. Hastie, R. Tibshirani, and J. H. Fbiedman. The Elements of Statistical Learning: Data Mining, Inference, PreiLiction. Springer, New York, 2001.
[193] M. Hearst. Ttends & Controversies: Support Vector Machines. IEEE Intetti,gent Systems, 13(4) : 18-28, 1998.
3L4 Chapter 5 Classification: Alternative Techniques
[194] D. Heckerman. Bayesian Networks for Data Mining. Data Mi,ni,ng and Knouledge Discouerg, 1(1):79 LI9, L997.
1195] R. C. Holte. Very Simple Classification Rules Perform Well on Most Commonly Used Data sets. Machine Learn’ing, 11:63-91, 1993.
[196] N. Japkowicz. The Class Imbalance Problem: Significance and Strategies. In Proc. of the 2000 Intl. Conf. on Arti,ficial Intelli,gence: Special Tracle on Ind,uct’iue Learning, volume 1, pages lll-117, Las Vegas, NV, June 2000.
[197] M. V. Joshi. On Evaluating Performance of Classifiers for Rare Classes. In Proc. of the 2002 IEEE IntI. Conf. on Data Mining, Maebashi City, Japan, December 2002.
1198] M. V. Joshi, R. C. Agarwal, and V. Kumar. Mining Needles in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction. In Proc. of 2001 ACM-SIGMOD IntI. Conf. on Managernent of Data, pages 91-102, Santa Barbara, CA, June 2001.
[199] M. V. Joshi, R. C. Agarwal, and V. Kumar. Predicting rare classes: can boosting make any weak learner strcing? In Proc. of the 8th IntI. Conf. on Knowledge Discouery and Data Mining, pages 297 306, Edmonton, Canada, JuIy 2002.
1200] M. V. Joshi and V. Kumar. CREDOS: Classification Using Ripple Down Structure (A Case for Rare Classes). ln Proc. of the SIAM IntI. Conf. on Data Min’ing, pages
32L-332, Orlando, FL, April 2004.
[201] E. B. Kong and T. G. Dietterich. Error-Correcting Output Coding Corrects Bias and Variance. In Proc. of the 12th IntI. Conf. on Machine Lear”ning, pages 313 321, Tahoe
City, CA, July 1995.
[202] M. Kubat and S. Matwin. Addressing the Curse of Imbalanced Ttaining Sets: One Sided Selection. In Proc. of the l/+th IntI. Conf. on Machine Lear”n’ing, pages 179-186, Nashville, TN, July 1997.
[203] P. Langley, W. Iba, and K. Thompson. An analysis of Bayesian classifiers. In Proc. oJ the 10th National Conf. on Artif,cial Intelligence, pages 223-228, 1992.
[204] D. D. Lewis. Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In Proc. of the 10th European Conf. on Mach’i,ne Learning (ECML 1998), pages 4-15, 1998.
[205] O. Mangasarian. Data Mining via Support Vector Machines. Technical Report Tech- nical Report 01-05, Data Mining Institute, May 2001.
[206] D. D. Margineantu and T. G. Dietterich. Learning Decision Tlees for Loss Minimization in Multi-Class Problems. Technical Report 99-30-03, Oregon State University, 1999.
1207] R. S. Michalski, I. Mozetic, J. Hong, and N. Lavrac. The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains. In Proc.
of sth National Conf. on Arti,fici,al Intell’igence, Orlando, August 1986.
[208] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.
f209] S. Muggleton. Found,ati,ons of Inducti,ue Log,ic Programm’ing. Prentice Hall, Englewood Cliffs. NJ, 1995.
[210] F. J. Provost and T. Fawcett. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions. In Proc. of the Srd, Intl. Conf. on Knowled,ge D’iscouery and Data M’in’ing, pages 43-48, Newport Beach, CA, August 1997.
[211] J. R. Quinlan. CI.S: Programs for Machi,ne Learn’ing. Morgan-Kaufmann Publishers, San Mateo. CA. 1993.
l2L2l M. Ramoni and P. Sebastiani. Robust Bayes classifierc. Arti.fi,cial Intelligence, I25: 209-226,200r.
5.10 Exercises 31-5
f213] B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,
Regularizati,on, Opt’i,mizat’ion, and, Begond. MIT Press, 2001.
[2I4) P. Smyth and R. M. Goodman. An Information Theoretic Approach to Rule Induction
from Databases. IEEE Trans. on Knowled,ge and” Data Engi’neering,4(4):301-316, 1992.
[215] D. M. J. Tax and R. P. W. Duin. Using Two-Class Classifiers for Multiclass Classi- fication. In Proc. of the 16th IntI. Conf. on Pattern Recogn’ition (ICPR 2002), pages
124-127, Quebec, Canada, August 2002.
f216] C. J. van Rijsbergen. Inforrnat’ion Retrieual. Butterworth-Heinemann, Newton, MA,
1978.
[217] V. Vapnik. The Nature of Statistical Learn’ing Theorg. Springer Verlag, New York,
1995.
[218] V. Vapnik. Statistical Learn’ing Theorg. John Wiley & Sons, New York, 1998.
[219] A. R. Webb. Statistical Pattern Recogni.tion. John Wiley & Sons, 2nd edition, 2002.
1220] G. M. Weiss. Mining with Rarity: A Unifying Flamework. SIGKDD Erplorations,6 ( l ) :7 -L9 ,2004.
l22l] I. H. Witten and E. FYank. Data Mining: Pract’ical Machine Learning Tools and
Techn’iques with Jaua Implementations. Morgan Kaufmann, 1999.
[222) B. Zadrozny, J. C. Langford, and N. Abe. Cost-Sensitive Learning by Cost- Proportionate Example Weighting. In Proc. of the 2003 IEEE IntI. Conf. on Data
M’i,n’ing, pages 435-442, Melbourne, FL, August 2003.
5.10 Exercises
1. Consider a binary classification problem with the following set of attributes and attribute values:
o Air Conditioner : {Working, Broken}
o Engine : {Good, Bad}
o Mileage : {High, Medium, Low}
o Rust : {yes, No}
Suppose a rule-based classifier produces the following rule set:
Mileage : HiSh _- Value : Low Mileage : Low ——+ Value : High Air Conditioner : Working, Engine : Good —– Value : High Air Conditioner : Working, Engine : Bad —-+ Value : Low Air Conditioner : Brokel —+ Value : Low
(a) Are the rules mutually exclustive?
316 Chapter 5 Classification: Alternative Techniques
Is the rule set exhaustive?