4.3.2 How to Build a Decision Tbee

In principle, there are exponentially many decision trees that can be con- structed from a given set of attributes. While some of the trees are more accu- rate than others, finding the optimal tree is computationally infeasible because of the exponential size of the search space. Nevertheless, efficient algorithms have been developed to induce a reasonably accurate, albeit suboptimal, de- cision tree in a reasonable amount of time. These algorithms usually employ a greedy strategy that grows a decision tree by making a series of locally op-

L52 Chapter 4 Classification

Unlabeled data

Non- mammats

Figure 4.5. Classifying an unlabeled vertebrate. The dashed lines represent the outcomes of applying various attribute test conditions on the unlabeled vertebrate. The vertebrate is eventually assigned to lIg lrf 6a-aammal class.

timum decisions about which attribute to use for partitioning the data. One

such algorithm is fluntts algorithm, which is the basis of many existing de-

cision tree induction algorithms, including ID3, C4.5, and CART. This section presents a high-level discussion of Hunt’s algorithm and illustrates some of its

design issues.

flunt’s Algorithm

In Hunt’s algorithm, a decision tree is grown in a recursive fashion by parti-

tioning the training records into successively purer subsets. Let Dl be the set

of training records that are associated with node t and g : {At,U2, . . . ,A”} be

the class labels. The following is a recursive definition of Hunt’s algorithm.

Step 1: If ail the records in Dt belong to the same class y1, then t is a leaf

node labeled as y.

Step 2: If D; contains records that belong to more than one class, an at-

tribute test condition is selected to partition the records into smaller

subsets. A child node is created for each outcome of the test condi-

tion and the records in D1 are distributed to the children based on the

outcomes. The algorithm is then recursively applied to each child node.

Name Body temperature Gives Birth Class Flamingo Warm No

4.3 Decision TYee Induction 1-53

a.”” “””””””J”

Figure 4.6. Training set for predicting borrowers who will default on loan payments.

To illustrate how the algorithm works, consider the problem of predicting whether a loan applicant will repay her loan obligations or become delinquent, subsequently defaulting on her loan. A training set for this problem can be constructed by examining the records of previous borrowers. In the example shown in Figure 4.6, each record contains the personal information of a bor- rower along with a class label indicating whether the borrower has defaulted on loan payments.

The initial tree for the classification problem contains a single node with class label Defaulted = No (see Figure a.7@)), which means that most of the borrowers successfully repaid their loans. The tree, however, needs to be refined since the root node contains records from both classes. The records are subsequently divided into smaller subsets based on the outcomes of the Hone Owner test condition) as shown in Figure 4.7(b). The justification for choosing this attribute test condition will be discussed later. For now, we will assume that this is the best criterion for splitting the data at this point. Hunt’s algorithm is then applied recursively to each child of the root node. From the training set given in Figure 4.6, notice that all borrowers who are home owners successfully repaid their loans. The left child of the root is therefore a leaf node labeled Def aulted = No (see Figure 4.7(b)). For the right child, we need to continue applying the recursive step of Hunt’s algorithm until all the records belong to the same class. The trees resulting from each recursive step are shown in Figures a.7@) and (d).

Yes No No Yes No No Yes No No No

125K 100K 70K 120K 95K 60K 220K 85K 75K 90K

L54 Chapter 4 Classification

Figure 4.7. Hunt’s algorithm for inducing decision trees.

Hunt’s algorithm will work if every combination of attribute values is present in the training data and each combination has a unique class label. These assumptions are too stringent for use in most practical situations. Ad- ditional conditions are needed to handle the following cases:

1. It is possible for some of the child nodes created in Step 2 to be empty; i.e., there are no records associated with these nodes. This can happen if none of the training records have the combination of attribute values associated with such nodes. In this case the node is declared a leaf node with the same class label as the majority class of training records associated with its parent node.

2. In Step 2, if all the records associated with D; have identical attribute values (except for the class label), then it is not possible to split these records any further. In this case, the node is declared a leaf node with the same class label as the majority class of training records associated with this node.

4.3 Decision TYee Induction 155

Design Issues of Decision TYee Induction

A learning algorithm for inducing decision trees must address the following two issues.

1. How should the training records be split? Each recursive step of the tree-growing process must select an attribute test condition to divide the records into smaller subsets. To implement this step, the algorithm must provide a method for specifying the test condition for different attribute types as well as an objective measure for evaluating the goodness of each test condition.

2. How should the splitting procedure stop? A stopping condition is needed to terminate the tree-growing process. A possible strategy is to continue expanding a node until either all the records belong to the same class or all the records have identical attribute values. Although both conditions are sufficient to stop any decision tree induction algorithm, other criteria can be imposed to allow the tree-growing procedure to terminate earlier. The advantages of early termination will be discussed later in Section 4.4.5.

4.3.3 Methods for Expressing Attribute Test Conditions

Decision tree induction algorithms must provide a method for expressing an attribute test condition and its correspondins outcomes for different attribute types.

Binary Attributes The test condition for a binary attribute generates two potential outcomes, as shown in Figure 4.8.

Warm- Cold- blooded blooded

Figure 4,8. Test condition for binary attributes.

156 Chapter 4 Classification

{Manied} {Single, Divorced)

Divorced

(a) Multiway split

{Single} {Married, Divorced)

{Single, {Divorced} Married)

Single

OR

(b) Binary split {by grouping attribute values}

Figure 4.9. Test conditions for nominal attributes.

Nominal Attributes Since a nominal attribute can have many values, its

test condition can be expressed in two ways, as shown in Figure 4.9. For

a multiway split (Figure 4.9(a)), the number of outcomes depends on the

number of distinct values for the corresponding attribute. For example, if

an attribute such as marital status has three distinct values-single, married,

or divorced-its test condition will produce a three-way split. On the other

hand, some decision tree algorithms, such as CART, produce only binary splits

by considering all 2k-1 – 1 ways of creating a binary partition of k attribute

values. Figure 4.9(b) illustrates three different ways of grouping the attribute

values for marital status into two subsets.

Ordinal Attributes Ordinal attributes can also produce binary or multiway

splits. Ordinal attribute values can be grouped as long as the grouping does

not violate the order property of the attribute values. Figure 4.10 illustrates

various ways of splitting training records based on the Shirt Size attribute.

The groupings shown in Figures 4.10(a) and (b) preserve the order among

the attribute values, whereas the grouping shown in Figure a.10(c) violates

this property because it combines the attribute values Small and Large into

{Small, {Large, Medium) Extra Large)

(b) (c)

Figure 4.10. Different ways of grouping ordinal attribute values.

the same partition while Mediun and Extra Large are combined into another partition.

Continuous Attributes For continuous attributes, the test condition can be expressed as a comparison test (A < u) or (A 2 ,) with binary outcomes, or a range query with outcomes of the form ui I A l ut+t, fot ‘ i: L,… ,k. The difference between these approaches is shown in Figure 4.11. For the binary case, the decision tree algorithm must consider all possible split positions u, and it selects the one that produces the best partition. For the multiway split, the algorithm must consider all possible ranges of continuous values. One approach is to apply the discretization strategies described in Section 2.3.6 on page 57. After discretization, a new ordinal value will be assigned to each discretized interval. Adjacent intervals can also be aggregated into wider ranges as long as the order property is preserved.

(b)

Figure 4.1 1. Test condition for continuous attributes.

{Small, {Medium, Large) Extra Large)

4.3 Decision Tlee Induction L57

{Small} {Medium,Large, Extra Large)

(a)

{10K, 25K} {25K, 50K} {50K, 80K}

(a)

158 Chapter 4 Classification

Figure 4.12. Multiway versus binary splits.

4.3.4 Measures for Selecting the Best Split

There are many measures that can be used to determine the best way to split

the records. These measures are defined in terms of the class distribution of

the records before and after splitting. Let p(i.lt) denote the fraction of records belonging to class i at a given node

t. We sometimes omit the reference to node I and express the fraction as p,;.

In a two-class problem, the class distribution at any node can be written as

(po,pt), where Pt:7 -Po. To illustrate, consider the test conditions shown

in Figure 4.12. The class distribution before splitting is (0.5,0.5) because

there are an equal number of records from each class. If we split the data

using the Gender attribute, then the class distributions of the child nodes are

(0.6,0.4) and (0.4,0.6), respectively. Although the classes are no longer evenly

distributed, the child nodes still contain records from both classes. Splitting

on the second attribute, Car Type, will result in purer partitions.

The measures developed for selecting the best split are often based on the

degree of impurity of the child nodes. The smaller the degree of impurity, the

more skewed the class distribution. For example, a node with class distribu-

tion (0,1) has zero impurity, whereas a node with uniform class distribution (0.5,0.5) has the highest impurity. Examples of impurity measures include

c – l

Entropy(t) : -ln?.lt)rog2pllt), o1_,

Gini(r) : L -|,lp11t)12, i,:o

Classification error(t) : 1 -max[p(ilt)]’

where c is the number of classes and 0log2 0 : O ,” entropy calculations.

(4.3)

(4.4)

(4.5)

C0:1 C 1 : 3

C 0 : 8 C 1 : 0

C0:0 C 1 : 1

C0 :0 C 1 : 1

C 0 : 1 C 1 : 0

0.7

o.4

Entropy

eini -.’-‘ )’:-‘l’l– ‘ – ) . ‘ . . ‘ . . . – – .

– a a t . a , t

– ‘ \ . \ – ‘ a a

– a t – . a ‘ – ‘ a . – a a

, ‘ . ‘ t ‘ \ ,

– r ” – . a ‘ t

\ ‘ \ . \

,’ ,’- Misclassification error ‘\.

4.3 Decision Tlee Induction 159

Gini : r – (016)2 – (616)2 : 0 Entropy : -(016) Iog2(0/6) – (616) log2(6/6) :0 Error : 1 – maxl0/6,6/6] : 0

Gini : | – Gl6)2 – Fl6)2 : 0.278 Entropy : -(Il6)togr(716) – (516)ro92616): 0.650 Error : 1 – maxfl/6,516]l : 9.167

Gini : r – (Jl6)2 – (3/6)2 : 9.5 Entropy : -(316) Iog2(3/6) – (3/6)logr(3/6) : 1 Error : 1 – maxlS I 6, 3 I 6l : 0.b

0.1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 o

Figure 4.13. Comparison among the impurity measures for binary classification problems.

Figure 4.13 compares the values of the impurity measures for binary classi- fication problems. p refers to the fraction of records that belong to one of the two classes. Observe that all three measures attain their maximum value when the class distribution is uniform (i.e., when p : 0.5). The minimum values for the measures are attained when all the records belong to the same class (i.e., when p equals 0 or 1). We next provide several examples of computing the different impuritv measures.

Node Nr Uount Class:0 0 Class:1 6

Node Ah Count CIass:0 I Class:1

Node l/a Uount Class:0 .) Ulass:1 3

160 Chapter 4 Classification

The preceding examples, along with Figure 4.13, illustrate the consistency

among different impurity measures. Based on these calculations, node lft has

the lowest impurity value, followed by l{z and lf3. Despite their consistency,

the attribute chosen as the test condition may vary depending on the choice

of impurity measure, as will be shown in Exercise 3 on page 198. To determine how well a test condition performs, we need to compare the

degree of impurity of the parent node (before splitting) with the degree of

impurity of the child nodes (after splitting). The larger their difference, the

better the test condition. The gain, A, is a criterion that can be used to

determine the goodness of a split:

(4.6) j : t

where 1(.) is the impurity measure of a given node, ly’ is the total number of

records at the parent node, k is the number of attribute values, and l/(u3)

is the number of records associated with the child node, u7. Decision tree

induction algorithms often choose a test condition that maximizes the gain

A. Since /(parent) is the same for all test conditions, maximizing the gain is

equivalent to minimizing the weighted average impurity measures of the child

nodes. Finally, when entropy is used as the impurity measure in Equation 4.6,

the difference in entropy is known as the information gain, 461o.

Splitting of Binary Attributes

Consider the diagram shown in Figure 4.14. Suppose there are two ways to

split the data into smaller subsets. Before splitting, the Gini index is 0.5 since

there are an equal number of records from both classes. If attribute A is chosen

to split the data, the Gini index for node N1 is 0.4898, and for node N2, it

is 0.480. The weighted average of the Gini index for the descendent nodes is (71L2) x 0.4898 + (5112) x 0.480 : 0.486. Similarly, we can show that the

weighted average of the Gini index for attribute B is 0.375. Since the subsets

for attribute B have a smaller Gini index, it is preferred over attribute A.

Splitting of Nominal Attributes

As previously noted, a nominal attribute can produce either binary or multi-

way splits, as shown in Figure 4.15. The computation of the Gini index for a

binary split is similar to that shown for determining binary attributes. For the

first binary grouping of the Car Type attribute, the Gini index of {Sports,

k

A:I(pare”t) – i Yttr),

4.3 Decision Tree Induction L6L

Figure 4.14. Splitting binary attributes.

6”A 6;) {Sports,

–:-sl,- {Family,

tuxut).,/ \Tmrry) nxuryl.z \<{orts}

(a) Binary split (b) Multiway split

Figure 4.15. Splitting nominal attributes.

Luxury) is 0.4922 and the Gini index of {Fa:rily} is 0.3750. The weighted average Gini index for the grouping is equal to

16120 x 0.4922 + 4120 x 0.3750 : 0.468.

Similarly, for the second binary grouping of {Sports} and {Fanily, Luxury}, the weighted average Gini index is 0.167. The second grouping has a lower Gini index because its corresponding subsets are much purer.

L62 Chapter 4 Classification

Figure 4.16. Splitting continuous attributes.

For the multiway split, the Gini index is computed for every attribute value.

Since Gini({raniry}) : 0.375, Gini({Sports}) : 0, and Gini({Luxury}) :

0.219, the overall Gini index for the multiway split is equal to

4120 x 0.375 + 8120 x 0 + 8120 x 0.219 : 0.163.

The multiway split has a smaller Gini index compared to both two-way splits.

This result is not surprising because the two-way split actually merges some

of the outcomes of a multiway split, and thus, results in less pure subsets.

Splitting of Continuous Attributes

Consider the example shown in Figure 4.16, in which the test condition Annual

Income ( u is used to split the training records for the loan default classifica-

tion problem. A brute-force method for finding u is to consider every value of

the attribute in the ly’ records as a candidate split position. For each candidate

u, the data set is scanned once to count the number of records with annual

income less than or greater than u. We then compute the Gini index for each

candidate and choose the one that gives the lowest value. This approach is

computationally expensive because it requires O(,nf) operations to compute

the Gini index at each candidate split position. Since there are -l[ candidates,

the overall complexity of this task is O(N\. To reduce the complexity, the

training records are sorted based on their annual income, a computation that

requires O(,n/logli) time. Candidate split positions are identified by taking

the midpoints between two adjacent sorted values: 55, 65, 72, and so on. How-

ever, unlike the brute-force approach, we do not have to examine all ly’ records

when evaluating the Gini index of a candidate split position. For the first candidate. u : 55. none of the records has annual income less

than $55K. As a result, the Gini index for the descendent node with Annual

Decision Tlee Induction 163

fncome < $55K is zero. On the other hand, the number of records with annual income greater than or equal to $55K is 3 (for class Yes) and 7 (for class No), respectively. Thus, the Gini index for this node is 0.420. The overall Gini index for this candidate split position is equal to 0 x 0 + 1 x 0.420 :0.420.