For the specific-to-general strategy, one of the positive examples is ran- domly chosen as the initial seed for the rule-growing process. During the refinement step, the rule is generalized by removing one of its conjuncts so that it can cover more positive examples. Figure 5.3(b) shows the specific-to- general approach for the vertebrate classification problem. Suppose a positive example for mammals is chosen as the initial seed. The initial rule contains the same conjuncts as the attribute values of the seed. To improve its cov- erage, the rule is generalized by removing the conjunct Hibernate=no. The refinement step is repeated until the stopping criterion is met, e.g., when the rule starts covering negative examples.

The previous approaches may produce suboptimal rules because the rules are grown in a greedy fashion. To avoid this problem, a beam search may be used, where k of the best candidate rules are maintained by the algorithm. Each candidate rule is then grown separately by adding (or removing) a con- junct from its antecedent. The quality ofthe candidates are evaluated and the k best candidates are chosen for the next iteration.

Rule Evaluation An evaluation metric is needed to determine which con- junct should be added (or removed) during the rule-growing process. Accu- racy is an obvious choice because it explicitly measures the fraction of training examples classified correctly by the rule. However, a potential limitation of ac- curacy is that it does not take into account the rule’s coverage. For example, consider a training set that contains 60 positive examples and 100 negative examples. Suppose we are given the following two candidate rules:

Rule 11 : covers 50 positive examples and 5 negative examples, Rule 12: covers 2 positive examples and no negative examples.

The accuracies for 11 and 12 are 90.9% and 100%, respectively. However, 11 is the better rule despite its lower accuracy. The high accuracy for 12 is potentially spurious because the coverage of the rule is too low.

5 .1 Rule-BasedClassifier 2L7

The following approaches can be used to handle this problem.

1. A statistical test can be used to prune rules that have poor coverage. For example, we may compute the following likelihood ratio statistic:

K

R: 2\ f i tosff , ; le), i : I

where k is the number of classes, fi is the observed frequency of class e examples that are covered by the rule, and el is the expected frequency of a rule that makes random predictions. Note that ,R has a chi-square distribution with k – 1 degrees of freedom. A large l? value suggests that the number of correct predictions made by the rule is significantly larger than that expected by random guessing. For example, since 11 covers 55 examples, the expected frequency for the positive class is ea :

55×601160:20.625, while the expected frequency for the negative class is e- : 55 x 100/160 : 34.375. Thus, the likelihood ratio for 11 is

E(r1) : 2 x [50 x log2(50120.625) f 5 x logr(5134.375)): 99.9.

Similarly, the expected frequencies for 12 a,re e+ : 2 x 60/160 : 0.75 and e- :2x 100/160:1.25. The l ikel ihood rat io stat ist ic for re is

R(r2) : 2 x [2 x log2(210.75) + 0 x log2(0/1.25)] : 5.66.

This statistic therefore suggests that rr is a better rule than 12.

2. At evaluation metric that takes into account the rule coverage can be used. Consider the following evaluation metrics:

f , – r 1 Laplace :

m-estimate :

n l k ‘

f+ -t kp+

(5.4)

( O . D J n – f k

I

where n is the number of examples covered by the rule, /-p is the number of positive examples covered by the rule, k is the total number of classes, and p1 is the prior probability for the positive class. Note that the m- estimate is equivalent to the Laplace measure by choosing p+ : llk. Depending on the rule coverage, these measures capture the trade-off

218 Chapter 5 Classification: Alternative Techniques

between rule accuracy and the prior probability of the positive class. If

the rule does not cover any training example, then the Laplace mea-

sure reduces to lf k, which is the prior probability of the positive class assuming a uniform class distribution. The m-estimate also reduces to the prior probability (p1) when n : 0. However, if the rule coverage is large, then both measures asymptotically approach the rule accuracy,

f+ln. Going back to the previous example, the Laplace measure for 11 is 51157 : 89.47To, which is quite close to its accuracy. Conversely, the Laplace measure for 12 (75%) is significantly lower than its accuracy because 12 has a much lower coverage.

3. An evaluation metric that takes into account the support count of the rule can be used. One such metric is the FOILts information gain.

The support count of a rule corresponds to the number of positive exam- ples covered by the rule. Suppose the rule r : A —— * covers ps positive

examples and ns negative examples. After adding a new conjunct B, the extended rule r’ : A tt, B ——+ * covers p1 positive examples and n1 neg- ative examples. Given this information, the FOIL’s information gain of the extended rule is defined as follows:

FoIL’s information gain: pt x ( bsr=!= – logz – 1o- ). (5.6) \ ” ‘ P r + n r ” – P o l n s l

Since the measure is proportional to pr and ptl(n*nr), it prefers rules that have high support count and accuracy. The FOIL’s information gains for rules 11 and 12 given in the preceding example are 43.12 and 2, respectively. Therefore, 11 is a better rule than 12.

Rule Pruning The rules generated by the Learn-One-Rule function can be pruned to improve their generalization errors. To determine whether pruning is necessary, wâ‚¬ may apply the methods described in Section 4.4 on page

172 to estimate the generalization error of a rule. For example, if the error on validation set decreases after pruning, we should keep the simplified rule. Another approach is to compare the pessimistic error of the rule before and after pruning (see Section 4.4.4on page 179). The simplified rule is retained in place of the original rule if the pessimistic error improves after pruning.

5.1 Rule-Based Classifier 2I9

Rationale for Sequential Covering

After a rule is extracted, the sequential covering algorithm must eliminate all the positive and negative examples covered by the rule. The rationale for doing this is given in the next example.

class = +

T

+ +

+ + + +

class =

Figure 5.4. Elimination of training records by the sequential covering algorithm. R7, R2, and R3 represent regions covered by three different rules.

Figure 5.4 shows three possible rules, R7, R2, and R3, extracted from a data set that contains 29 positive examples and 21 negative examples. The accuracies of .R1, R2, and E3 are I2115 (80%), 7lI0 (70%), and 8f L2 (66.7%), respectively. .R1 is generated first because it has the highest accuracy. After generating R1, it is clear that the positive examples covered by the rule must be removed so that the next rule generated by the algorithm is different than .R1. Next, suppose the algorithm is given the choice of generating either R2 or R3. Even though R2 has higher accuracy than -R3, Rl and -R3 together cover 18 positive examples and 5 negative examples (resulting in an overail accuracy of 78.3%), whereas R1 and .R2 together cover 19 positive examples and 6 negative examples (resulting in an overall accuracy of 76%). The incremental impact of R2 or -R3 on accuracy is more evident when the positive and negative exarnples covered by -Rl are removed before computing their accuracies. In particular, if positive examples covered by R1 are not removed, then we may overestimate the effective accuracy of ,R3, and if negative examples are not removed, then we may underestimate the accuracy of R3. In the latter caseT we might end up preferring R2 over fi3 even though half of the false positive errors committed by E3 have already been accounted for by the preceding rule, .R1.

R2R3

22O Chapter 5 Classification: Alternative Techniques

RIPPER Algorithm

To illustrate the direct method, we consider a widely used rule induction algo- rithm called RIPPER. This algorithm scales almost linearly with the number of training examples and is particularly suited for building models from data sets with imbalanced class distributions. RIPPER also works well with noisy data sets because it uses a validation set to prevent model overfitting.

For two-class problems, RIPPER chooses the majority class as its default class and learns the rules for detecting the minority class. For multiclass prob- lems, the classes are ordered according to their frequencies. Let (y,Az, . . . ,U”) be the ordered classes, where 91 is the least frequent class and g” is the most frequent class. During the first iteration, instances that belong to 91 are Ia- beled as positive examples, while those that belong to other classes are labeled as negative examples. The sequential covering method is used to generate rules that discriminate between the positive and negative examples. Next, RIPPER extracts rules that distinguish y2 frorn other remaining classes. This process is repeated until we are left with g., which is designated as the default class.

Rule Growing RIPPER employs a general-to-specific strategy to grow a rule and the FOIL’s information gain measure to choose the best conjunct to be added into the rule antecedent. It stops adding conjuncts when the rule starts covering negative examples. The new rule is then pruned based on its performance on the validation set. The following metric is computed to determine whether pruning is needed: (p-“) l@+n), where p (n) is the number of positive (negative) examples in the validation set covered by the rule. This metric is monotonically related to the rule’s accuracy on the validation set. If the metric improves after pruning, then the conjunct is removed. Pruning is done starting from the last conjunct added to the rule. For example, given a rde ABCD + a, RIPPER checks whether D should be pruned first, followed by CD, BCD, etc. While the original rule covers only positive examples, the pruned rule may cover some of the negative examples in the training set.

Building the Rule Set After generating a rule, all the positive and negative examples covered by the rule are eliminated. The rule is then added into the rule set as long as it does not violate the stopping condition, which is based on the minimum description length principle. If the new rule increases the total description length of the rule set by at least d bits, then RIPPER stops adding rules into its rule set (by default, d is chosen to be 64 bits). Another stopping condition used by RIPPER is that the error rate of the rule on the validation set must not exceed 50%.

5 .1 Rule-BasedClassifier 22L

RIPPER also performs additional optimization steps to determine whether some of the existing rules in the rule set can be replaced by better alternative rules. Readers who are interested in the details of the optimization method may refer to the reference cited at the end of this chapter.

5.1.5 Indirect Methods for Rule Extraction

This section presents a method for generating a rule set from a decision tree. In principle, every path from the root node to the Ieaf node of a decision tree can be expressed as a classification rule. The test conditions encountered along the path form the conjuncts ofthe rule antecedent, while the class label at the leaf node is assigned to the rule consequent. Figure 5.5 shows an example of a rule set generated from a decision tree. Notice that the rule set is exhaustive and contains mutually exclusive rules. However, some of the rules can be simplified as shown in the next example.

Rule Set

rl: (P=No,Q=No) =-> –

12: (P=No,Q=Yes) ==2 ‘i r3; (P=Yes,Q=No) ==1 .’ 14: (P=Yes,R=Yes,Q=No) ==> –

r5: (P=Yes,R=Yes,Q=Yes) ==> a

Flgure 5.5. Converting a decision tree into classification rules.

Example 5.2. Consider the following three rules from Figure 5.5:

12: (P : No) A (Q : Yes) ——+ -1-

r3: (P : Yes) n (R: No) —–+ 1 r5: (P : Yes) A (R: Yes) n (Q : Yes) ——+ f

Observe that the rule set always predicts a positive class when the value of Q is Yes. Therefore, we may simplify the rules as follows:

r2t; (Q : Yes) —-+ f r3: (P : Yes) A (R: No) ——+ 1.

222 Chapter 5 Classification: Alternative Techniques

Rul+Based Classifier: (Gives Birth=No, Aerial Creature=Yes) => Birds

(Gives Birth=No, Aquatic Creature=Yes) => Fishes (Gives Birth=Yes) => Mammals

(Gives Birth=No, Aerial Creature=No, Aquatic Creature=No) => Reptiles

( ) => Amphibians

Figure 5.6. Classification rules extracted from a decision tree for the vertebrate classification problem.

13 is retained to cover the remaining instances of the positive class. Although the rules obtained after simplification are no longer mutually exclusive, they are less complex and are easier to interpret. I

In the following, we describe an approach used by the C4.5rules algorithm to generate a rule set from a decision tree. Figure 5.6 shows the decision tree and resulting classification rules obtained for the data set given in Table 5.2.

Rule Generation Classification rules are extracted for every path from the root to one of the leaf nodes in the decision tree. Given a classification rule r : A ——-+ gr, we consider a simplified rule, r’ : A’ + A, where A/ is obtained by removing one of the conjuncts in A. The simplified rule with the lowest pessimistic error rate is retained provided its error rate is less than that of the original rule. The rule-pruning step is repeated until the pessimistic error of the rule cannot be improved further. Because some of the rules may become identical after pruning, the duplicate rules must be discarded.