Estimating Conditional Probabilities for Continuous Attributes

There are two ways to estimate the class-conditional probabilities for contin- uous attributes in naive Bayes classifiers:

1. We can discretize each continuous attribute and then replace the con- tinuous attribute value with its corresponding discrete interval. This approach transforms the continuous attributes into ordinal attributes. The conditional probability P(X,IY : U) is estimated by computing the fraction of training records belonging to class g that falls within the corresponding interval for Xi. The estimation error depends on the dis- cretization strategy (as described in Section 2.3.6 on page 57), as well as the number of discrete intervals. If the number of intervals is too large, there are too few training records in each interval to provide a reliable estimate for P(XrlY). On the other hand, if the number of intervals is too small, then some intervals may aggregate records from different classes and we may miss the correct decision boundary.

2. We can assume a certain form of probability distribution for the contin- uous variable and estimate the parameters of the distribution using the training data. A Gaussian distribution is usually chosen to represent the class-conditional probability for continuous attributes. The distribution is characterized by two parameters, its mean, p,, and variance, o2. For each class Aj, the class-conditional probability for attribute Xi is

. , _ ( t t _ t t : j ) 2

P(Xi: r,ilY : y) : -)- exp zofi

1/2troii (5 .16)

The parameter p,ii can be estimated based on the sample mean of Xt (z) for all training record.s that belong to the class gt . Similarly, ol, can

be estimated from the sample variance (s2) of such training records. For

example, consider the annual income attribute shown in Figure 5.9. The sample mean and variance for this attribute with respect to the class No

are

r25+ 100+70+. . .+75

,

: 110 (

( 1 2 5 – 1 1 0 ) 2 + ( 1 0 0 – 1 1 0 ) 2 + . . . + ( 7 5 – 1 1 0 ) 2

7(6)

s: t /2975:54.54.

:2975

234 Chapter 5 Classification: Alternative Techniques

Given a test record with taxable income equal to $120K, we can compute its class-conditional probability as follows:

P(rncome=12olNo) : 6h.b4)”*p-95#f

: 0.0072.

Note that the preceding interpretation of class-conditional probability is somewhat misleading. The right-hand side of Equation 5.16 corre- sponds to a probability density function, f (X;pti,o;7). Since the function is continuous, the probability that the random variable Xl takes a particular value is zero. Instead, we should compute the conditional probability that Xi lies within some interval, ri and rt t e , where e is a small constant:

f r t l e

P(*o < X; I r i * e ly :yr1 : I fqo; t t i j ,o i j )dx i J:r ‘

= f (r t ; t t t i ,o, i i ) x e. (5.17)

Since e appears as a constant multiplicative factor for each class, it cancels out when we normalize the posterior probability for P(flX). Therefore, we can still apply Equation 5.16 to approximate the class- conditional probability P (X,lY).

Example of the Naive Bayes Classifier

Consider the data set shown in Figure 5.10(a). We can compute the class- conditional probability for each categorical attribute, along with the sample mean and variance for the continuous attribute using the methodology de- scribed in the previous subsections. These probabilities are summarized in Figure 5.10(b).

To predict the class label of a test record ;q : (Home Owner:No, Marital Status : Married, Income : $120K), we need to compute the posterior prob- abilities P(UolX) and P(YeslX). Recall from our earlier discussion that these posterior probabilities can be estimated by computing the product between the prior probability P(Y) and the class-conditional probabilitiesll P(X,lY), which corresponds to the numerator of the right-hand side term in Equation 5 . 1 5 .

The prior probabilities of each class can be estimated by calculating the fraction of training records that belong to each class. Since there are three records that belong to the class Yes and seven records that belong to the class

5.3 Bayesian Classifiers 235

P(Home Owner=YeslNo) = 317 P(Home Owner=NolNo) = 4fl P(Home Owner=YeslYes) = 0 P(Home Owner=NolYes) = 1 P(Marital Status=SinglelNo) = 2n P(Marital Status=Divorcedl No) = 1 /7 P(Marital Status=MarriedlNo) = 4t7 P(Marital Status=SinglelYes) = 2/3 P(Marital Status=DivorcedlYes) = 1 /3 P(Marital Status=MarriedlYes) = 0

For Annual Income: lf class=No: sample mean=1 10

sample variance=2975 lf class=Yes: sample medn=90

sample variance=2S

(a) (b)

Figure 5.10. The nalve Bayes classifier for the loan classification problem.

No, P(Yes) :0.3 and P(no) :0.7. Using the information provided in Figure 5.10(b), the class-conditional probabilities can be computed as follows:

P(Xluo) : P(Hone 0wner : NolNo) x P(status : MarriedlNo)

x P(Annual fncome : $120KlNo)

: 417 x 417 x 0.0072: 0.0024.

P(XlYes) : P(Home 0wner : IrtolYes) x P(Status : MarriedlYes)

x P(AnnuaI Income : $120KlYes)

: 1 x 0 x 1 . 2 x 1 0 – e : 0 .

Putting them together, the posterior probability for class No is P(NolX) :

ax7 l l0 x 0 .0024:0 .0016a, where a : l lP (X) i s a cons tan t te rm. Us ing a similar approach, we can show that the posterior probability for class Yes is zero because its class-conditional probability is zero. Since P(NolX) > P(YeslX), the record is classified as No.

Yes No No Yes No No Yes No No No

125K 100K 70K 120K 95K 60K 220K 85K 75K 90K

236 Chapter 5 Classification: Alternative Techniques

M-estimate of Conditional Probability

The preceding example illustrates a potential problem with estimating poste- rior probabilities from training data. If the class-conditional probability for one of the attributes is zero, then the overall posterior probability for the class vanishes. This approach of estimating class-conditional probabilities using simple fractions may seem too brittle, especially when there are few training examples available and the number of attributes is large.

In a more extreme case, if the training examples do not cover many of the attribute values, we may not be able to classify some of the test records. For example, if P(Marital Status : DivorcedlNo) is zero instead of If7, then a record with attribute set 1(: (Home Owner – yes, Marital Status : Divorced, Income : $120K) has the following class-conditional probabilities:

P(Xlno) : 3/7 x 0 x 0.0072 : 0.

P ( X l v e s ) : 0 x 7 1 3 x 7 . 2 x 1 0 – e : 0 .

The naive Bayes classifier will not be able to classify the record. This prob- lem can be addressed by using the m-estimate approach for estimating the conditional probabilities :

P(r, la) : ?s! ! ! ,- n + T n (5 .18 )

where n is the total number of instances from class 3ry, n” is the number of training examples from class gi that take on the value ri, rrl is a parameter known as the equivalent sample size, and p is a user-specified parameter. If there is no training set available (i.e., n:0), then P(rilyi) : p. Therefore p can be regarded as the prior probability of observing the attribute value ri among records with class 97. The equivalent sample size determines the tradeoff between the prior probability p and the observed probability n.f n.

In the example given in the previous section, the conditional probability P(Status : MarriedlYes) : 0 because none of the training records for the class has the particular attribute value. Using the m-estimate approach with m:3 and p :113, the conditional probability is no longer zero:

P ( M a r i t a l S t a t u s : M a r r i e d l Y e s ) : ( 0 + 3 x t l S ) / ( J + 3 ) : 1 7 6 .

5.3 Bayesian Classifiers 237

If we assume p : If 3 for all attributes of class Yes and p : 213 for all attributes of class No. then

P(Xluo) : P(Home Owner : NolNo) x P(status : MarriedlNo)

x P(Annual Incone : $120KlNo)

: 6lto x 6/10 x o.oo72 : o.oo26.

P(XlYes) : P(Home 0tmer : ttolYes) x P(status : MarriedlYes)

x P(AnnuaI Income: $120KlYes)

: 4 /6 x 116 x 7 .2 x 10-e : 1 .3 x 10-10 .

The posterior probability for class No is P(llolx) : (t x 7110 x 0.0026 :

0.0018o, while the posterior probability for class Yes is P(YeslX) : o x 3/10 x 1.3 x 10-10 : 4.0 x 10-11a. Atthough the classification decision has not changed, the m-estimate approach generally provides a more robust way for estimating probabilities when the number of training examples is small.

Characteristics of Naive Bayes Classifiers

NaiVe Bayes classifiers generally have the following characteristics:

o They are robust to isolated noise points because such points are averaged out when estimating conditional probabilities from data. Naive Bayes classifiers can also handle missing values by ignoring the example during model building and classification.

o They are robust to irrelevant attributes. If Xi is an irrelevant at- tribute, then P(XrlY) becomes almost uniformly distributed. The class- conditional probability for Xi has no impact on the overall computation of the posterior probability.

o Correlated attributes can degrade the performance of naive Bayes clas- sifiers because the conditional independence assumption no longer holds for such attributes. For example, consider the following probabilities:

P(A :0 lY :0 ) : 0 .4 , P (A :1 lY :0 ) : 0 .6 ,

P (A :0 lY : 1 ) : 0 . 6 , P (A : L IY : 1 ) : 0 . 4 ,

where A is a binary attribute and Y is a binary class variable. Suppose there is another binary attribute B that is perfectly correlated with A

238 Chapter 5 Classification: Alternative Techniques

when Y : 0, but is independent of -4 when Y : I. For simplicity, assume that the class-conditional probabilities for B are the same as for A. Given a record with attr ibutes,4 :0.8:0. we can comoute i ts posterior probabilities as follows:

P(Y :0 lA :0 , B : 0 ) : P(A :Oly : 0)P(B : Oly : O)P(Y : 0) P (A :0 , B : 0 )

0 .16 x P (Y : 0 )

P (A :0 , B : 0 ) ‘

P (A :O ly : I )P (B : O ly : l )P (Y : 1 )P (Y : I lA :0 ,8 : 0 ) : P (A :0 , B : 0 )

0 .36 x P (Y : 1 )

P ( A : 0 , B : 0 ) ‘

If P(Y – 0) : P(Y : 1), then the naiVe Bayes classifier would assign the record to class 1. However, the truth is,

P ( A : 0 , B : O l Y : 0 ) : P ( A : 0 l ) ‘ : 0 ) : 0 . 4 ,

because A and B are perfectly correlated when Y : 0. As a result, the posterior probability for Y : 0 is

P(Y :0 lA :0 , B : 0 ) : P (A : 0 ,8 :O lY : 0 )P (Y : 0 ) P (A :0 ,8 :0 )

0 .4 x P (Y :0 )

P ( A : 0 , 8 : 0 ) ‘

which is larger than that for Y : 1. The record should have been classified as class 0.

5.3.4 Bayes Error Rate

Suppose we know the true probability distribution that governs P(Xlf). The Bayesian classification method allows us to determine the ideal decision bound- ary for the classification task, as illustrated in the following example.

Example 5.3. Consider the task of identifying alligators and crocodiles based on their respective lengths. The average length of an adult crocodile is about 15 feet, while the average length of an adult alligator is about 12 feet. Assuming

5.3 Bayesian Classifiers 239

5 10 Length, *

tu

Figure 5.11. Comparing the likelihood functions of a crocodile and an alligator.

that their length z follows a Gaussian distribution with a standard deviation equal to 2 feet, we can express their class-conditional probabilities as follows:

P(Xlcrocodile) : #”””0 1

“(ry)’)P(Xlnrri.gator) : #””*o[ ;(ry)’l

(5.1e)

(5.20)

Figure 5.11 shows a comparison between the class-conditional probabilities

for a crocodile and an alligator. Assuming that their prior probabilities are the same, the ideal decision boundary is located at some length i such that

P(X: i lCrocod i le ) : P(X: f lA l l iga tor ) .

Using Equations 5.19 and 5.20, we obtain

( f t – r b \ 2 / i – r 2 \ 2

\ , / : \

, / ‘

which can be solved to yield f : 13.5. The decision boundary for this example is located halfway between the two means. r

\, Crocodile \\ \\\\\\\\\\\