Example 5.10. Consider a multiclass problem where Y : {Ar,,A2,Az,A4}. Suppose a test instance is classified as (+, -, -, -) according to the 1-r ap- proach. In other words, it is classified as positive when 91 is used as the positive class and negative when 92, 93, and !4 are used as the positive class. Using a simple majority vote, notice that gfl receives the highest number of votes, which is four, while the remaining classes receive only three votes. The test instance is therefore classified as gr.
Suppose the test instance is classified as follows using the 1-1 approach:
Binary pair of classes
* i Ut – i Az
l , at – i Az
l : a t – i U+
+i A2 -i us
+” a2 – i U+
*: at ‘ y 4
Classification -T- + + +
The first two rows in this table correspond to the pair of classes (AtAi) chosen to build the classifier and the last row represents the predicted class for the test instance. After combining the predictions, 91 and 94 each receive two votes, while 92 and gr3 each receives only one vote. The test instance is therefore classified as either At or A+, depending on the tie-breaking procedure. I
Error-Correcting Output Coding
A potential problem with the previous two approaches is that they are sensitive to the binary classification errors. For the 1-r approach given in Example 5.10,
5 .8
308 Chapter 5 Classification: Alternative Techniques
if at least of one of the binary classifiers makes a mistake in its prediction, then the ensemble may end up declaring a tie between classes or making a wrong prediction. For example, suppose the test instance is classified as (+, -, *, -)
due to misclassification by the third classifier. In this case, it will be difficult to tell whether the instance should be classified as 91 or 93, unless the probability
associated with each class prediction is taken into account. The error-correcting output coding (ECOC) method provides a more ro-
bust way for handling multiclass problems. The method is inspired by an information-theoretic approach for sending messages across noisy channels. The idea behind this approach is to add redundancy into the transmitted message by means of a codeword, so that the receiver may detect errors in the received message and perhaps recover the original message if the number of erro s is small.
For multiclass learning, each class ga is represented by a unique bit string of length n known as its codeword. We then train n binary classifiers to predict
each bit of the codeword string. The predicted class of a test instance is given by the codeword whose Hamming distance is closest to the codeword produced
by the binary classifiers. Recall that the Hamming distance between a pair of bit strings is given by the number of bits that differ.
Example 5.11. Consider a multiclass problem where Y : {At,Az,ys,Aa}. Suppose we encode the classes using the following 7-bit codewords:
Class Codeword
At Ut
UI U+
1 0 0 0
1
0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 I
1 1 I 0
Each bit of the codeword is used to train a binary classifier. If a test instance is classified as (0,1,1,1,1,1,1) by the binary classifiers, then the Hamming dis- tance between the codeword and gr1 is 1, while the Hamming distance to the remaining classes is 3. The test instance is therefore classified as 91. I
An interesting property of an error-correcting code is that if the minimum Hamming distance between any pair of codewords is d, then any t@ – L) 12)) errors in the output code can be corrected using its nearest codeword. In Example 5.11, because the minimum Hamming distance between any pair of codewords is 4, the ensemble may tolerate errors made by one of the seven
5.9 Bibliographic Notes 309
binary classifiers. If there is more than one classifier that makes a mistake, then the ensemble may not be able to compensate for the error.
An important issue is how to design the appropriate set of codewords for different classes. FYom coding theory, a vast number of algorithms have been developed for generating n-bit codewords with bounded Hamming distance. However, the discussion of these algorithms is beyond the scope of this book. It is worthwhile mentioning that there is a significant difference between the design of error-correcting codes for communication tasks compared to those used for multiclass learning. For communication, the codewords should max- imize the Hamming distance between the rows so that error correction can be performed. Multiclass learning, however, requires that the row-wise and column-wise distances of the codewords must be well separated. A larger column-wise distance ensures that the binary classifiers are mutually indepen- dent, which is an important requirement for ensemble learning methods.
5.9 Bibliographic Notes
Mitchell [208j provides an excellent coverage on many classification techniques from a machine learning perspective. Extensive coverage on classification can also be found in Duda et al. [180], Webb [219], Fukunaga [187], Bishop [159], Hastie et al. [192], Cherkassky and Mulier [167], Witten and F]ank [221], Hand et al. [190], Han and Kamber [189], and Dunham [181].
Direct methods for rule-based classifiers typically employ the sequential covering scheme for inducing classification rules. Holte’s 1R [195] is the sim- plest form of a rule-based classifier because its rule set contains only a single rule. Despite its simplicity, Holte found that for some data sets that exhibit a strong one-to-one relationship between the attributes and the class label, lR performs just as well as other classifiers. Other examples of rule-based classifiers include IREP [184], RIPPER [170], CN2 1168, 1691, AQ [207], RISE
[176], and ITRULE [214]. Table 5.9 shows a comparison of the characteristics of four of these classifiers.
For rule-based classifiers, the rule antecedent can be generalized to include any propositional or first-order logical expression (e.g., Horn clauses). Read- ers who are interested in first-order logic rule-based classifiers may refer to references such as [208] or the vast literature on inductive logic programming
1209]. Quinlan [211] proposed the C4.5rules algorithm for extracting classifi- cation rules from decision trees. An indirect method for extracting rules from artificial neural networks was given by Andrews et al. in [157].
310 Chapter 5 Classification: Alternative Techniques
Table 5,9. Comparison of various rule-based classifiers, RIPPER u1\ z
(unordered) L]N2
(ordered)
AQR
ttule-growlng strategy
Generai-to- specific
L;eneral-to-
specific
General-tG. specific
General-to-specific (seeded by a
oositive examole)
Evaluation Metric
FOIL’s Infb gain Laplace Itntropy ancl likelihood ratio
Number of true positives
Stopping condition for rule-erowinq
All examples belong to the
same class
No performance galn
No perfbrmance galn
Kules cover only positive class
Rule Pruning Keducect error Drunlnq
None None None
lnstance Elimination
Positive and negative
Positive only Positive only Positive and negatlve
btopprng condition for adding rules
tlrlor > bUTo oI
based on MDL
I\o perlormance garn
No perlormance garn
All positive examples are
covered
Kule uet Pruning
tfeplace or
modifv rules
:itatrstrcal tests
None None
Search stratesv Greedv Beam search Beam search Beam sea,rch
Cover and Hart [t72] presented an overview of the nearest-neighbor classi- fication method from a Bayesian perspective. Aha provided both theoretical and empirical evaluations for instance-based methods in [155]. PEBLS, which was developed by Cost and Salzberg [171], is a nearest-neighbor classification algorithm that can handle data sets containing nominal attributes. Each train- ing example in PEBLS is also assigned a weight factor that depends on the number of times the example helps make a correct prediction. Han et al. [188] developed a weight-adjusted nearest-neighbor algorithm, in which the feature weights are learned using a greedy, hill-climbing optimization algorithm.
Naive Bayes classifiers have been investigated by many authors, including Langley et al. [203], Ramoni and Sebastiani l2I2), Lewis [204], and Domingos andPazzani [178]. Although the independence assumption used in naive Bayes classifiers may seem rather unrealistic, the method has worked surpdsingly well for applications such as text classification. Bayesian belief networks provide a more flexible approach by allowing some of the attributes to be interdependent. An excellent tutorial on Bayesian belief networks is given by Heckerman in
[1e4]. Vapnik [2I7, 218] had written two authoritative books on Support Vector
Machines (SVM). Other useful resources on SVM and kernel methods include the books by Cristianini and Shawe-Taylor [173] and Scholkopf and Smola
5.9 Bibliographic Notes 311
[213]. There are several survey articles on SVM, including those written by Burges [164], Bennet et al. [158], Hearst 1193], and Mangasarian [205].
A survey of ensemble methods in machine learning was given by Diet- terich [174]. The bagging method was proposed by Breiman [161]. Reund and Schapire [186] developed the AdaBoost algorithm. Arcing, which stands for adaptive resampling and combining, is a variant of the boosting algorithm proposed by Breiman [162]. It uses the non-uniform weights assigned to train- ing examples to resample the data for building an ensemble of training sets. Unlike AdaBoost, the votes of the base classifiers are not weighted when de- termining the class label of test examples. The random forest method was introduced by Breiman in [163].
Related work on mining rare and imbalanced data sets can be found in the survey papers written by Chawla et al. [166] and Weiss 1220]. Sampling-based methods for mining imbalanced data sets have been investigated by many au- thors, such as Kubat and Matwin 1202), Japkowitz [196], and Drummond and Holte [179]. Joshi et al. [199] discussed the limitations of boosting algorithms for rare class modeling. Other algorithms developed for mining rare classes include SMOTE [165], PNrule [198], and CREDOS [200].
Various alternative metrics that are well-suited for class imbalanced prob- lems are available. The precision, recall, and F1-measure are widely used met- rics in information retrieval 1216]. ROC analysis was originally used in signal detection theory. Bradley [160] investigated the use of area under the ROC curve as a performance metric for machine learning algorithms. A method for comparing classifier performance using the convex hull of ROC curves was suggested by Provost and Fawcett in [210]. Ferri et al. lf85] developed a methodology for performing ROC analysis on decision tree classifiers. They had also proposed a methodology for incorporating area under the ROC curve (AUC) as the splitting criterion during the tree-growing process. Joshi [197] examined the performance of these measures from the perspective of analyzing rare classes.