Is ordering needed for this set of rules?

Do you need a default class for the rule set?

2. The RIPPER algorithm (by Cohen [170]) is an extension of an earlier algorithm called IREP (by Fiirnkranz and Widmer 1184]). Both algorithms apply the reduced-error pruning method to determine whether a rule needs to be pruned. The reduced error pruning method uses a validation set to estimate the generalization error of a classifier. Consider the following pair of rules:

Rt: A —– C Rz : AnB- – – – – -+C

-R2 is obtained by adding a new conjunct, B, to the left-hand side of R1. For this question, you will be asked to determine whether -R2 is preferred over fi1 from the perspectives of rule-growing and rule-pruning. To determine whether a rule should be pruned, IREP computes the following measure:

n – L ( N – n ) urREp: -_fr1,, ,

where P is the total number of positive examples in the validation set, Iy’ is the total number of negative examples in the validation set, p is the number of positive examples in the validation set covered by the rule, and n is the number of negative examples in the validation set covered by the rtIe. ulppp is actually similar to classification accuracy for the validation set. IREP favors rules that have higher values of u1pBp. On the other hand, RIPPER applies the following measure to determine whether a rule should be pruned:

un tppnn : ‘ -=- p + n

(a) Suppose -R1 is covered by 350 positive examples and 150 negative ex- amples, while R2 is covered by 300 positive examples and 50 negative examples. Compute the FOIL’s information gain for the rule Rz with respect to R1.

(b) Consider a validation set that contains 500 positive examples and 500 negative examples. For R1, suppose the number of positive examples covered by the rule is 200, and the number of negative examples covered by the rule is 50. For rR2, suppose the number of positive examples covered by the rule is 100 and the number of negative examples is 5. Compute urREp for both rules. Which rule does IREP prefer?

(c) Compute uRrppEn for the previous problem. Which rule does RIPPER prefer?

(b)

(c)

(d)

4 .

5.10 Exercises 3L7

C4.5rules is an implementation of an indirect method for generating rules from a decision tree. RIPPER is an implementation of a direct method for generating rules directly from data.

(a) Discuss the strengths and weaknesses of both methods.

(b) Consider a data set that has a large difference in the class size (i.e., some classes are much bigger than others). Which method (between C4.5rules and RIPPER) is better in terms of finding high accuracy rules for the small classes?

Consider a training set that contains 100 positive examples and 400 negative examples. For each of the following candidate rules,

R1 A .—–+ * (covers 4 positive and I negative examples), Rz: B .—–+ * (covers 30 positive and 10 negative examples), R3: C ——+ * (covers 100 positive and 90 negative examples),

determine which is the best and worst candidate rule according to:

(a) Rule accuracy.

(b) FOIL’s information gain.

(c) The likelihood ratio statistic.

(d) The Laplace measure.

(e) The m-estimate measure (with k :2 and p+ :0.2).

Figure 5.4 illustrates the coverage of the classification rules R1, ,R2, and R3. Determine which is the best and worst rule according to:

(a) The likelihood ratio statistic.

(b) The Laplace measure.

(c) The m-estimate measure (with k :2 and P+ :0.58).

(d) The rule accuracy after R1 has been discovered, where none of the exam- ples covered by R1 are discarded).

(e) The rule accuracy after rRl has been discovered, where only the positive

examples covered by iBl are discarded).

(f) The rule accuracy after Rl has been discovered, where both positive and negative examples covered by -Rl are discarded.

(a) Suppose the fraction of undergraduate students who smoke is 15% and the fraction of graduate students who smoke is 23%. If one-fifth of the college students are graduate students and the rest are undergraduates, what is the probability that a student who smokes is a graduate student?

5.

6.

Table 5.10. Data set for Exercise 7. Record A B C Class

1 2 e

t

o

7 8 9 10

0 0 0 0 0 1 1 1 1 I

0 0 I 1 0 0 0 0 1 0

0 1 1 I 1 1 1 I

1 1

+

+ +

+ +

318 Chapter 5 Classification: Alternative Techniques

(b) Given the information in part (a), is a randomly chosen college student more likely to be a graduate or undergraduate student?

(c) Repeat part (b) assuming that the student is a smoker.

(d) Suppose 30% of the graduate students live in a dorm but only l0% of the undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke.

7. Consider the data set shown in Table 5.10

Estimate the conditional probabilities for P(Al-l_), P(Bi+), P(Cl+), P(Al-), P(Bl-) , and P(Cl-) .

Use the estimate of conditional probabilities given in the previous question to predict the class label for a test sample (A:0,8 – I,C :0) using the naive Bayes approach.

(c) Estimate the conditional probabilities using the m-estimate approach, wi th p : I /2 and m:4.

(d) Repeat part (b) using the conditional probabilities given in part (c).

(e) Compare the two methods for estimating probabilities. Which method is better and why?

8. Consider the data set shown in Table 5.11.

(a) Estimate the conditional probabilities for P(A : 1l+), P(B : 11a), P(C : 1 l+) , P( .4 : 1 l – ) , P(B : 1 l – ) , and P(C : 1 l – ) us ing the same approach as in the previous problem.

(a)

(b)

Table 5.11. Data set for Exercise 8. lnstance A B C Olass

I 2 J

4 5

t

7 8 q

10

0 1 0 1 1 0 1 0 0 1

0 0 1 0 0 0 I 0 1 1

1 1 0 0 I

1 0 0 0 I

+

+ +

+ +

5.10 Exercises 319

(b) Use the conditional probabilities in part (a) to predict the class label for

a test sample (A : l, B : l,C :1) using the naive Bayes approach.

(c) Compare P(A:1) , P(B: 1) , and P(A: ! ,8 :1) . State the re lat ion- ships between A and B.

(d) Repeat the analysis in part (c) using P(A : l), P(B : 0), and P(A: 1 , ,B : 0 ) .

( e ) Compare P (A : I ,B : I lC lass : * ) aga ins t P (A : l lC lass : * ) and P(B : llClass: 1). Are the variables conditionally independent given

the class?

9. (a) Explain how naive Bayes performs on the data set shown in Figure 5.46.

(b) If each class is further divided such that there are four classes (AI, 42,

,B1, and B2), will naive Bayes perform better?

(c) How will a decision tree perform on this data set (for the two-class prob-

lem)? What if there are four classes?

10. Repeat the analysis shown in Example 5.3 for finding the location of a decision boundary using the following information:

(a) The prior probabil it ies are P(Crocodile):2 x P(lrrigator).

(b) The prior probabil it ies are P(All igator):2 x P(Crocodile).

(c) The prior probabilities are the same, but their standard deviations are different; i.e., o(Crocodile) :4 and o(All igator) : 2.

11. Figure 5.47 illustrates the Bayesian belief network for the data set shown in

Table 5.12. (Assume that all the attributes are binary).

(a) Draw the probability table for each node in the network.

Attributes

32O Chapter 5 Classification: Alternative Techniques

Figure 5.46, Data set for Exercise 9.

Figure 5.47, Bayesian belief network.

(b) Use the Bayesian network to compute P(Engine : Bad, Air Conditioner : Broken).

12. Given the Bayesian network shown in Figure 5.48, compute the following prob- abilities:

(a) P(B : good, F : empty, G : empty, 5 : yes).

(b) P(B – bad, F : empty? G : not empty, S : no).

(c) Given that the battery is bad, compute the probability that the car will start.

13. Consider the one-dimensional data set shown in Table 5.13.

Table 5,12. Data set for Exercise 11. Mileage Engine Air Conditioncr Number of Records

with Car Value:Hi Number of Records with Car Value:Lo

H Hi Hi Hi Lo Lo Lo Lo

Good Good Bad Bad

Good Good Bad Bad

Working Broken

Working Broken

Working Broken

Working Broken

J

1 1 0 I

I 0

4

z K

A

0 1 2 z

5.1-0 Exercises 32L

P ( B = b a d ) = 0 . 1 P(F=empty) =0.2

(u)

P(S = no I B – good, F = not empty) = 0.1 P(S = no I 3 = good, F = emPtY) = 0.8 P(S = no I B = bad, F = not emPty) = 0.9 P(S = no I B = bad, F = emPty) = 1.0

Figure 5.48, Bayesian belief network for Exercise 12.

Classify the data point r : 5.0 according to its 1-, 3-, 5-, and 9-nearest neighbors (using majority vote).

Repeat the previous analysis using the distance-weighted voting approach described in Section 5.2.1.

(b)

14. The nearest-neighbor algorithm described in Section 5.2 can be extended to handle nominal attributes. A variant of the algorithm called PEBLS (Parallel Examplar-Based Learning System) by Cost and Salzberg ll7ll measures the distance between two values of a nominal attribute using the modified value difference metric (MVDM). Given a pair of nominal attribute values’ V1 and

P(G = empty I B = good, F = not empV) = 0.1 P(G = empty I g = good, F = emPtV) = 0.8 P(G = empty I B = bad, F = not empty) = 0.2 P(G = empty I B = bad, F = empty) = 0.9

322 Chapter 5 Classification: Alternative Techniques

Table 5.13. Data set for Exercise 13. x U . D 3.0 4 . O 4.6 4.9 5 .2 5 .3 ( t 7.0 9.5 v + + -t- +

V2,lhe distance between them is defined as follows:

(5.84)

where nii is the number of examples from class i with attribute value Vi and n, is the number of examples with attribute value [.

Consider the training set for the loan classification problem shown in 5.9. Use the MVDM measure to compute the distance between every attribute values for the Home 0wner and Marital Status attributes.

For each of the Boolean functions given below, state whether the problem is linearly separable.

(a) A AND B AND C

(b) NOT A AND B

(“) (,.+ oR B) AND (,4 oR C)

(d) (,4 xoR B) AND (A OR B)

(a) Demonstrate how the perceptron model can be used to represent the AND and OR functions between a pair of Boolean variables.

(b) Comment on the disadvantage of using linear functions as activation func- tions for multilayer neural networks.

You are asked to evaluate the performance of two classification models, M1 and M2. The test set you have chosen contains 26 binary attributes, labeled as ,4 throtgh Z.

Table 5.14 shows the posterior probabilities obtained by applying the models to the test set. (Only the posterior probabilities for the positive class are shown). As this is a two-class problem, P(-) : 1- P(+) and P(-lA, . . . , Z) : I – P(+lA, . . . , Z). Assume that we are mostly interested in detecting instances from the positive class.

(a) PIot the ROC curve for both M1 and M2. (You should plot them on the same graph.) Which model do you think is better? Explain your reasons.