Rule Ordering After generating the rule set, C4.5rules uses the class-based ordering scheme to order the extracted rules. Rules that predict the same class are grouped together into the same subset. The total description length for each subset is computed, and the classes are arranged in increasing order of their total description length. The class that has the smallest description

5 .2 Nearest-Neighbor classifiers 223

length is given the highest priority because it is expected to contain the best set of rules. The total description length for a class is given by tr”*ception * g X

Imodel, where trexception is the number of bits needed to encode the misclassified

examples, Zmodel is the number of bits needed to encode the model, and g is a

tuning parameter whose default value is 0.5. The tuning parameter depends on the number of redundant attributes present in the model. The value of the

tuning parameter is small if the model contains many redundant attributes.

5.1.6 Characteristics of Rule-Based Classifiers

A rule-based classifier has the following characteristics:

o The expressiveness of a rule set is almost equivalent to that of a decision

tree because a decision tree can be represented by a set of mutually ex-

clusive and exhaustive rules. Both rule-based and decision tree classifiers

create rectilinear partitions of the attribute space and assign a class to

each partition. Nevertheless, if the rule-based classifier allows multiple

rules to be triggered for a given record, then a more complex decision

boundary can be constructed.

o Rule-based classifiers are generally used to produce descriptive models

that are easier to interpret, but gives comparable performance to the

decision tree classifier.

o The class-based ordering approach adopted by many rule-based classi-

fiers (such as RIPPER) is well suited for handling data sets with imbal-

anced class distributions.

5.2 Nearest-Neighbor classifiers

The classification framework shown in Figure 4.3 involves a two-step process:

(1) an inductive step for constructing a classification model from data, and (2) a deductive step for applying the model to test examples. Decision tree

and rule-based classifiers are examples of eager learners because they are

designed to learn a model that maps the input attributes to the class label as

soon as the training data becomes available. An opposite strategy would be to

delay the process of modeling the training data until it is needed to classify the

test examples. Techniques that employ this strategy are known as lazy learn-

ers. An example of alazy learner is the Rote classifier, which memorizes the

entire training data and performs classification only if the attributes of a test

instance match one of the training examples exactly. An obvious drawback of

224 Chapter 5 Classification: Alternative Techniques

++ (a) 1-nearest neighbor

Figure 5.7. The

(b) 2-nearest neighbor (c) 3-nearest neighbor

1-,2-,and 3-nearest neighbors of an instance.

this approach is that some test records may not be classified because they do not match any training example.

One way to make this approach more flexible is to find all the training examples that are relatively similar to the attributes of the test example. These examples, which are known as nearest neighbors, can be used to determine the class label of the test example. The justification for using nearest neighbors is best exemplified by the following saying: “If i,t walks li,ke a duck, quacks li,lce a duck, and looks li,ke a duck, then i,t’s probably a duck.” A nearest- neighbor classifier represents each example as a data point in a d-dimensional space, where d is the number of attributes. Given a test example, we compute its proximity to the rest of the data points in the training set, using one of the proximity measures described in Section 2.4 on page 65. The k-nearest neighbors of a given example z refer to the k points that are closest to z.

Figure 5.7 illustrates the L-, 2-, and 3-nearest neighbors of a data point located at the center of each circle. The data point is classified based on the class labels of its neighbors. In the case where the neighbors have more than one label, the data point is assigned to the majority class of its nearest neighbors. In Figure 5.7(a), the l-nearest neighbor of the data point is a negative example. Therefore the data point is assigned to the negative class. If the number of nearest neighbors is three, as shown in Figure 5.7(c), then the neighborhood contains two positive examples and one negative example. Using the majority voting scheme, the data point is assigned to the positive class. In the case where there is a tie between the classes (see Figure 5.7(b)), we may randomly choose one of them to classify the data point.

The preceding discussion underscores the importance of choosing the right value for k. If k is too small, then the nearest-neighbor classifier may be

+ +

5 .2 Nearest-Neighborclassifiers 225

Figure 5.8. k-nearest neighbor classification with large /c.

susceptible to overfitting because of noise in the training data. On the other hand, if k is too large, the nearest-neighbor classifier may misclassify the test instance because its list of nearest neighbors may include data points that are located far away from its neighborhood (see Figure 5.8).

5.2.L Algorithm

A high-level summary of the nearest-neighbor classification method is given in Algorithm 5.2. The algorithm computes the distance (or similarity) between each test example , : (*’,y’) and all the training examples (x, g) e D to determine its nearest-neighbor list, D”. Such computation can be costly if the number of training examples is large. However, efficient indexing techniques are available to reduce the amount of comoutations needed to find the nearest neighbors of a test example.

Algorithm 5.2 The k-nearest neighbor classification algorithm. 7: Let k be the number of nearest neighbors and D be the set of training examples. 2: fot each test example

” : (x’ ,E’) do

3: Compute d(x’,x), the distance between z and every example, (x,y) e D. 4: Select D, ! D, the set of k closest training examples to z. 5: y’ : argmax-D6u,ouy.n” I(u : yu1

6: end for

226 Chapter 5 Classification: Alternative Techniques

Once the nearest-neighbor list is obtained, the test example is classified based on the majority class of its nearest neighbors:

Majority Voting: g’ : argmax I ( , : A) , (5.7) U” (xi,g6)€D.

where u is a class label, 916 is the class label for one of the nearest neighbors, and 1(.) is an indicator function that returns the value 1 if its argument is true and 0 otherwise.

In the majority voting approach, every neighbor has the same impact on the classification. This makes the algorithm sensitive to the choice of k, as shown in Figure 5.7. One way to reduce the impact of k is to weight the influence of each nearest neighbor x; according to its distance: wt.:7ld(x’,x4)2. As a result, training examples that are located far away from z have a weaker impact on the classification compared to those that are located close to z. Using the distance-weighted voting scheme, the class label can be determined as follows:

Distance-Weighted Voting: A’ : argrnax u; x I(u : Ai). (5.8) (x ; ,v6)€D.

5.2.2 Characteristics of Nearest-Neighbor Classifiers

The characteristics of the nearest-neighbor classifier are summarized below:

o Nearest-neighbor classification is part of a more general technique known as instance-based learning, which uses specific training instances to make predictions without having to maintain an abstraction (or model) de- rived from data. Instance-based learning algorithms require a proximity measure to determine the similarity or distance between instances and a classification function that returns the predicted class of a test instance based on its proximity to other instances.

o Lazy learners such as nearest-neighbor classifiers do not require model building. However, classifying a test example can be quite expensive because we need to compute the proximity values individually between the test and training examples. In contrast, eager learners often spend the bulk of their computing resources for model building. Once a model has been built, classifying a test example is extremely fast.

o Nearest-neighbor classifiers make their predictions based on local infor- mation, whereas decision tree and rule-based classifiers attempt to find

BayesianClassifiers 227

a global model that fits the entire input space. Because the classification decisions are made locally, nearest-neighbor classifiers (with small values of /c) are quite susceptible to noise.

Nearest-neighbor classifiers can produce arbitrarily shaped decision bound- aries. Such boundaries provide a more flexible model representation compared to decision tree and rule-based classifiers that are often con- strained to rectilinear decision boundaries. The decision boundaries of nearest-neighbor classifiers also have high variability because they de- pend on the composition of training examples. Increasing the number of nearest neighbors may reduce such variability.

Nearest-neighbor classifiers can produce wrong predictions unless the appropriate proximity measure and data preprocessing steps are taken. For example, suppose we want to classify a group of people based on attributes such as height (measured in meters) and weight (measured in pounds). The height attribute has a low variability, ranging from 1.5 m to 1.85 m, whereas the weight attribute may vary from 90 lb. to 250 lb. If the scale of the attributes are not taken into consideration, the proximity measure may be dominated by differences in the weights of a person.

5.3 Bayesian Classifiers

In many applications the relationship between the attribute set and the class variable is non-deterministic. In other words, the class label of a test record cannot be predicted with certainty even though its attribute set is identical to some of the training examples. This situation may arise because of noisy data or the presence of certain confounding factors that affect classification but are not included in the analysis. For example, consider the task of pre-

dicting whether a person is at risk for heart disease based on the person’s diet and workout frequency. Although most people who eat healthily and exercise regularly have less chance of developing heart disease, they may still do so be- cause of other factors such as heredity excessive smoking, and alcohol abuse. Determining whether a person’s diet is healthy or the workout frequency is sufficient is also subject to interpretation, which in turn may introduce uncer- tainties into the learning problem.

This section presents an approach for modeling probabilistic relationships between the attribute set and the class variable. The section begins with an introduction to the Bayes theorem, a statistical principle for combining prior

5.3