228 Chapter 5 Classification: Alternative Techniques

knowledge of the classes with new evidence gathered from data. The use of the Bayes theorem for solving classification problems will be explained, followed by a description of two implementations of Bayesian classifiers: naiVe Bayes and the Bayesian belief network.

5.3.1 Bayes Theorem

Cons’ider a football game between two riual teams: Team 0 and Team 1. Suppose Team 0 wins 65%o of the ti,me and Team 1 w,ins the remaining matches. Among the games won by Team 0, only 30To of them come

from playing on Team 1’s football field. On the other hand,,75To of the u’ictories for Team 1 are obta’ined whi,le playi.ng at home. If Team f is to host the nert match between the two teams, wh,ich team wi,ll most likelu etnerge as the winner?

This question can be answered by using the well-known Bayes theorem. For completeness, we begin with some basic definitions from probability theory. Readers who are unfamiliar with concepts in probability may refer to Appendix C for a brief review of this topic.

Let X and Y be a pair of random variables. Their joint probability, P(X :

r,Y : g), refers to the probability that variable X will take on the value r and variable Y will take on the value g. A conditional probability is the probability that a random variable will take on a particular value given that the outcome for another random variable is known. For example, the conditional probability P(Y :UlX: r) refers to the probability that the variable Y will take on the value g, given that the variable X is observed to have the value r. The joint and conditional probabilities for X and Y are related in the following way:

P(x,Y) : P(Ylx) x P(X) : P(XIY) x P(Y). (5.e)

Rearranging the last two expressions in Equation 5.9 leads to the following formula, known as the Bayes theorem:

P(Ylx) : P(xlY)P(Y) (5 .10)

P(X)

The Bayes theorem can be used to solve the prediction problem stated at the beginning of this section. For notational convenience, let X be the random variable that represents the team hosting the match and Y be the random variable that represents the winner of the match. Both X and Y can

5.3 Bayesian Classifiers 229

take on values from the set {0,1}. We can summarize the information given in the problem as follows:

Probability Team 0 wins is P(Y :0) : 0.65. Probability Team 1 wins is P(Y :1) : 1 – P(Y : 0) : 0.35. Probability Team t hosted the match it won is P(X : llY :1) : 0.75. Probability Team t hosted the match won by Team 0 is P(X : llY :0) : 0.9.

Our objective is to compute P(Y : llx : 1), which is the conditional probability that Team 1 wins the next match it will be hosting, and compares it against P(Y :OlX: 1). Using the Bayes theorem, we obtain

P(Y : l l x : 1 ) : P(X : r lY :1) x P(Y : 1) P(X :1 )

P (X : L lY : 1 ) x P (Y : 1 ) P (X : ! ,Y : 1 ) + P (X : 1 ,Y : 0 )

P(X : L IY :1) x P(Y : 1) P(X : I IY : I )P(Y : 1) * P(X : r lY :O)P(Y : 0)

0.75 x 0.35 0 . 7 5 x 0 . 3 5 + 0 . 3 x 0 . 6 5

: 0.5738.

where the law of total probability (see Equation C.5 on page 722) was applied in the second line. Furthermore, P(Y :OlX : 1) : t – P(Y : llx – 1) :

0.4262. Since P(Y : llx : 1) > P(Y : OlX : 1), Team t has a better chance than Team 0 of winning the next match.

5.3.2 Using the Bayes Theorem for Classification

Before describing how the Bayes theorem can be used for classification, let us formalize the classification problem from a statistical perspective. Let X denote the attribute set and Y denote the class variable. If the class variable has a non-deterministic relationship with the attributes, then we can treat X and Y as random variables and capture their relationship probabilistically using P(YIX). This conditional probability is also known as the posterior probability for Y, as opposed to its prior probability, P(Y).

During the training phase, we need to learn the posterior probabilities P(ylX) for every combination of X and Y based on information gathered from the training data. By knowing these probabilities, a test record X’ can be classified by finding the class Yt that maximizes the posterior probability,

230 Chapter 5 Classification: Alternative Techniques

P(Y’lX’). To illustrate this approach, consider the task of predicting whether a loan borrower will default on their payments. Figure 5.9 shows a training set with the following attributes: Hone Owner, Marital Status, and Annual Income. Loan borrowers who defaulted on their payments are classified as Yes, while those who repaid their loans are classifi.ed as No.

a.”‘ “””””od”

Figure 5.9. Training set for predicting the loan default problem.

Suppose we are given a test record with the following attribute set: X : (Hone Owner : No, Marital Status : Married, Annual Income : $120K). To classify the record, we need to compute the posterior probabilities P(YeslX) and P(NolX) based on information available in the training data. If P(YeslX) > P(NolX), then the record is classified as Yes; otherwise, it is classified as No.

Estimating the posterior probabilities accurately for every possible combi- nation of class labiel and attribute value is a difficult problem because it re- quires a very large training set, even for a moderate number of attributes. The Bayes theorem is useful because it allows us to express the posterior probabil- ity in terms of the prior probability P(f), the class-conditional probability P(X|Y), and the evidence, P(X):

P(y lx ) : P(xlY) x P(Y) (5 .11)P(x)

When comparing the posterior probabilities for different values of Y, the de- nominator term, P(X), is always constant, and thus, can be ignored. The

BayesianClassifiers 23L

prior probability P(f) can be easily estimated from the training set by com- puting the fraction of training records that belong to each class. To estimate the class-conditional probabilities P(Xlf), we present two implementations of Bayesian classification methods: the naiVe Bayes classifier and the Bayesian belief network. These implementations are described in Sections 5.3.3 and 5.3.5, respectively.

5.3.3 NaiVe Bayes Classifier

A naive Bayes classifier estimates the class-conditional probability by assuming that the attributes are conditionally independent, given the class label g. The

conditional independence assumption can be formally stated as follows:

&

P(XIY : a) : f r lxSv : 11, (5.12) i : l

where each attribute set X : {Xr, X2,…, X4} consists of d attributes.

Conditional Independence

Before delving into the details of how a naive Bayes classifier works, let us examine the notion of conditional independence. Let X, Y, and Z denote three sets of random variables. The variables in X are said to be conditionally independent of Y, given Z, 1f the following condition holds:

P(xlY, z) : P(xlz). (5 .13)

An example of conditional independence is the relationship between a person’s

arm length and his or her reading skills. One might observe that people with longer arms tend to have higher levels of reading skills. This relationship can

be explained by the presence of a confounding factor, which is age. A young

child tends to have short arms and lacks the reading skills of an adult. If the age of a person is fixed, then the observed relationship between arm length

and reading skills disappears. Thus, we can conclude that arm length and reading skills are conditionally independent when the age variable is fixed.

5.3

232 Chapter 5 Classification: Alternative Techniques

The conditional independence between X and Y can also be written into a form that looks similar to Equation 5.12:

P(x,Ylz) : P\X ,Y ,Z ) P(Z)

P(X . Y , Z ) , . P (Y .Z )-FEq ^ P(z) P(xlY, z) x P(vlz) P(xlz) x P(Ylz), (5 .14)

where Equation 5.13 was used to obtain the last line of Equation 5.14.

How a Nai’ve Bayes Classifier Works

With the conditional independence assumption, instead of computing the class-conditional probability for every combination of X, we only have to esti- mate the conditional probability of each X4, given Y. The latter approach is more practical because it does not require a very large training set to obtain a good estimate of the probability.

To classify a test record, the naive Bayes classifier computes the posterior probability for each class Y:

(5 .15)

Since P(X) is fixed for every Y, it is sufficient to choose the class that maxi- mizes the numerator term, p(V)l[i:tP(X,lY). In the next two subsections, we describe several approaches for estimating the conditional probabilities P(X,lY) for categorical and continuous attributes.

Estimating Conditional Probabilities for Categorical Attributes

For a categorical attribute Xa, the conditional probability P(Xi : rilY : A) is estimated according to the fraction of training instances in class g that take on a particular attribute value ri. For example, in the training set given in Figure 5.9, three out of the seven people who repaid their loans also own a home. As a result, the conditional probability for P(Home Owner:Yeslno) is equal to 3/7. Similarly, the conditional probability for defaulted borrowers who are single is given by P(Marital Status : SinglelYes) : 213.

p(ytx) – PV)n!: l P(xi lY) ‘ P(x)

5.3 Bayesian Classifiers 233