24O Chapter 5 Classification: Alternative Techniques

(b) (c)

Figure 5.12. Representing probabilistic relationships using directed acyclic graphs.

When the prior probabilities are different, the decision boundary shifts toward the class with lower prior probability (see Exercise 10 on page 319). Furthermore, the minimum error rate attainable by any classifier on the given data can also be computed. The ideal decision boundary in the preceding example classifies all creatures whose lengths are less than ft as alligators and those whose lengths are greater than 0 as crocodiles. The error rate of the classifier is given by the sum of the area under the posterior probability curve for crocodiles (from length 0 to i) and the area under the posterior probability curve for alligators (from f to oo):

Error : fo”

et ‘o.odirelX)d , * Ir*

P(Alri-gat orlx)d,X.

The total error rate is known as the Bayes error rate.

5.3.5 Bayesian Belief Networks

The conditional independence assumption made by naive Bayes classifiers may seem too rigid, especially for classification problems in which the attributes are somewhat correlated. This section presents a more flexible approach for modeling the class-conditional probabilities P(Xlf). Instead of requiring all the attributes to be conditionally independent given the class, this approach allows us to specify which pair of attributes are conditionally independent. We begin with a discussion on how to represent and build such a probabilistic model, followed by an example of how to make inferences from the model.

(a)

5.3 Bayesian Classifiers 24L

Model Representation

A Bayesian belief network (BBN), or simply, Bayesian network, provides a graphical representation of the probabilistic relationships among a set of ran- dom variables. There are two key elements of a Bayesian network:

1. A directed acyclic graph (dag) encoding the dependence relationships among a set of variables.

2. A probability table associating each node to its immediate parent nodes.

Consider three random variables, -4, B, and C, in which A and B are independent variables and each has a direct influence on a third variable, C. The relationships among the variables can be summarized into the directed

acyclic graph shown in Figure 5.12(a). Each node in the graph represents a variable, and each arc asserts the dependence relationship between the pair

of variables. If there is a directed arc from X to Y, then X is the parent of Y and Y is the child of X. F\rrthermore, if there is a directed path in the network from X to Z, then X is an ancestor of Z, whlle Z is a descendant

of X. For example, in the diagram shown in Figure 5.12(b), A is a descendant of D and D is an ancestor of B. Both B and D arc also non-descendants of A. An important property of the Bayesian network can be stated as follows:

Property 1 (Conditional Independence). A node in a Bayesi’an network

i,s condi,tionally i,ndependent of its non-descendants, i’f i,ts parents are lcnown.

In the diagram shown in Figure 5.12(b), A is conditionally independent of both B and D given C because the nodes for B and D are non-descendants of node A. The conditional independence assumption made by a naive Bayes classifier can also be represented using a Bayesian network, as shown in Figure

5.12(c), where gr is the target class and {Xt,Xz,. . . ,Xa} is the attr ibute set. Besides the conditional independence conditions imposed by the network

topology, each node is also associated with a probability table.

1. If a node X does not have any parents, then the table contains only the prior probability P(X).

2. If a node X has only one parent, Y, then the table contains the condi- tional probability P(XIY).

3. If a node X has multiple parents, {Yt,Yz, . . . ,Yn}, then the table contains the condit ional probabi l i ty P(XlY,Yz,. . . , Yr.) .

242 Chapter 5 Classification: Alternative Techniques

Hb=Yes D=Healthy 0.2 D=Unhealthy 0.85

CP=Yes HD=Yes Hb=Yes 0.8

HD=Yes Hh=Nn 0.6

HD=No Hb=Yes 0.4

HD=No Hb=No 0.1

Figure 5.13. A Bayesian belief network for detecting heart disease and heartburn in patients.

Figure 5.13 shows an example of a Bayesian network for modeling patients with heart disease or heartburn problems. Each variable in the diagram is assumed to be binary-valued. The parent nodes for heart disease (HD) cor- respond to risk factors that may affect the disease, such as exercise (E) and diet (D). The child nodes for heart disease correspond to symptoms of the disease, such as chest pain (CP) and high blood pressure (BP). For example, the diagram shows that heartburn (Hb) may result from an unhealthy diet and may lead to chest pain.

The nodes associated with the risk factors contain only the prior proba- bilities, whereas the nodes for heart disease, heartburn, and their correspond- ing symptoms contain the conditional probabilities. To save space, some of the probabilities have been omitted from the diagram. The omitted prob- abilities can be recovered by noting that P(X – *) : 1 – P(X : r) and P(X : TIY) : 1 – P(X : rlY), where 7 denotes the opposite outcome of r. For example, the conditional probability

P(Hear t D isease: No lExerc ise : No,D ie t : Hea l thy)

1 – P(Heart Disease : YeslExercise : No, Diet : Healthy)

1 – 0 . 5 5 : 0 . 4 5 .

HD=Yes E=Yes D=Heatthy 0.25

E=Yes D=Unhealthl0.45

E=No D=Healthy 0.55

E=No D=Unhealth! 0.75

BayesianClassifiers 243

Model Building

Model building in Bayesian networks involves two steps: (1) creating the struc- ture of the network, and (2) estimating the probability values in the tables associated with each node. The network topology can be obtained by encod- ing the subjective knowledge of domain experts. Algorithm 5.3 presents a systematic procedure for inducing the topology of a Bayesian network.

Algorithm 5.3 Algorithm for generating the topology of a Bayesian network.

1: Let T: (XuXz,…,X4) denote a total order of the variables. 2 : f o r j : T t o d d o 3: Let X76 denote the jth highest order variable in ?. 4: Let r(X711) : {Xz(r), Xrp1, . . . , Xr1-t)} denote the set of variables preced-

ing X713y. 5: Remove the variables from r(Xrti>) that do not affect X1 (using prior knowl-

edge). 6: Create an arc between X71y; and the remaining variables in r(X711). 7: end for

Example 5.4. Consider the variables shown in Figure 5.13. After performing Step 1, Iet us assume that the variables are ordered in the following way: (E,D,HD,H\,CP,BP). From Steps 2 to 7, starting with variable D, we obtain the following conditional probabilities:

. P(DIE) is simplified to P(D).

o P(HDlE, D) cannot be simplified.

o P(HblHD,E,D) is simpl i f ied to P(HblD).

o P(C PlHb, H D, E, D) is simplified to P(C PlHb, H D).

o P(BPICP,Hb,HD,E,D) is simpl i f ied to P(BPIHD).

Based on these conditional probabilities) we can create arcs between the nodes (8, HD), (D, HD), (D, Hb), (HD, CP), (Hb, CP), and (HD, BP). These arcs result in the network structure shown in Figure 5.13. r

Algorithm 5.3 guarantees a topology that does not contain any cycles. The proof for this is quite straightforward. If a cycle exists, then there must be at least one arc connecting the lower-ordered nodes to the higher-ordered nodes, and at least another arc connecting the higher-ordered nodes to the lower- ordered nodes. Since Algorithm 5.3 prevents any arc from connecting the

5.3

244 Chapter 5 Classification: Alternative Techniques

lower-ordered nodes to the higher-ordered nodes, there cannot be any cycles in the topology.

Nevertheless, the network topology may change if we apply a different or- dering scheme to the variables. Some topology may be inferior because it produces many arcs connecting between different pairs of nodes. In principle, we may have to examine all d! possible orderings to determine the most appro- priate topology, a task that can be computationally expensive. An alternative approach is to divide the variables into causal and effect variables, and then draw the arcs from each causal variable to its corresponding effect variables. This approach eases the task of building the Bayesian network structure.

Once the right topology has been found, the probability table associated with each node is determined. Estimating such probabilities is fairly straight- forward and is similar to the approach used by naiVe Bayes classifiers.

Example of Inferencing Using BBN

Suppose we are interested in using the BBN shown in Figure 5.13 to diagnose whether a person has heart disease. The following cases illustrate how the diagnosis can be made under different scenarios.

Case 1: No Prior Information

Without any prior information, we can determine whether the person is likely to have heart disease by computing the prior probabilities P(HD : Yes) and P(HD: No). To simplify the notation, let a € {Yes,No} denote the binary values of Exercise and B e {Healthy,Unhealthy} denote the binary values of Diet.

P(HD:ves) : t tP(Hn :Yes lE : ( t ,D : P)P(E : a ,D : 0 ) d 1 3

: t tp(uo : yesl,O : ( t ,D : i lP(E : a)P(D : g) a R

0 .25 x 0 .7 x0 .25 +0 .45 x 0 .7 x 0 .75+0 .55 x 0 .3 x 0 .25

+ 0.75 x 0.3 x 0.75

0.49.

Since P(HD – no) – 1 – P(ttO : yes) : 0.51, the person has a slightly higher chance of not getting the disease.

5.3 Bayesian Classifiers 245

Case 2: High Blood Pressure

If the person has high blood pressure) we can make a diagnosis about heart disease by comparing the posterior probabilities, P(HD : YeslBP : High) against P(ttO : Nolnt : High). To do this, we must compute P(Be : High):

P(ne :High) : f r lnr : HighlHD:7)p(HD :7)

: 0.85 x 0.49 + 0.2 x 0.51 : 0.5185.

where 7 € {Yes, No}. Therefore, the posterior probability the person has heart disease is

P(Ho : yeslBp : High) : P(ge : HighlHD : Yes)P(HD : Yes)

P(ne : High) 0’85 x 0.49: f f i :o’8033′

Similarly, P(ttO : NolBe : High) – 1 – 0.8033 : 0.1967. Therefore, when a person has high blood pressure, it increases the risk of heart disease.

Case 3: High Blood Pressure, Healthy Diet, and Regular Exercise

Suppose we are told that the person exercises regularly and eats a healthy diet. How does the new information affect our diagnosis? With the new information, the posterior probability that the person has heart disease is

P(Ho :

P(BP

P(ne : HighlD : HeaIthV, E : Yes)

x P(HD : YeslD : Healthy,E : Yes)

P(ee : HighlHD : Yes)P(HD : YeslD : Healthy,E : Yes)

D, f lBe : HighlHD : ?)P(HD :.ylD: Healthy, E : Yes)

0.85 x 0.25 0 .85×0 .25+0 .2×0 .75

: 0.5862,

while the probability that the person does not have heart disease is

P(tt0 : Nolee : High,D : Healthy,E : yes) – 1 – 0.5862 : 0.4138.

Y e s l B P : H l g h , D : H e a l – t h y , E : Y e s )

: HighlHD : Yes, D : Healthy,E : Yes)l

246 Chapter 5 Classification: Alternative Techniques

The model therefore suggests that eating healthily and exercising regularly may reduce a person’s risk of getting heart disease.

Characteristics of BBN

Following are some of the general characteristics of the BBN method:

1. BBN provides an approach for capturing the prior knowledge of a par- ticular domain using a graphical model. The network can also be used to encode causal dependencies among variables.

Constructing the network can be time consuming and requires a large amount of effort. However, once the structure of the network has been determined, adding a new variable is quite straightforward.

Bayesian networks are well suited to dealing with incomplete data. In- stances with missing attributes can be handled by summing or integrat- ing the probabilities over all possible values of the attribute.

Because the data is combined probabilistically with prior knowledge, the method is quite robust to model overfitting.

5.4 Artificial Neural Network (ANN)

The study of artificial neural networks (ANN) was inspired by attempts to simulate biological neural systems. The human brain consists primarily of nerve cells called neurons) linked together with other neurons via strands of fiber called axons. Axons are used to transmit nerve impulses from one neuron to another whenever the neurons are stimulated. A neuron is connected to the axons of other neurons via dendrites, which are extensions from the cell body of the neuron. The contact point between a dendrite and an axon is called a synapse. Neurologists have discovered that the human brain learns by changing the strength of the synaptic connection between neurons upon repeated stimulation by the same impulse.

Analogous to human brain structure, an ANN is composed of an inter- connected assembly of nodes and directed links. In this section, we will exam- ine a family of ANN models, starting with the simplest model called percep- tron, and show how the models can be trained to solve classification problems.

2.