To classify the Web sessions, features are constructed to describe the char- acteristics of each session. Figure a.fi(c) shows some of the features used for the Web robot detection task. Among the notable features include the depth and breadth of the traversal. Depth determines the maximum dis- tance of a requested page, where distance is measured in terms of the num- ber of hyperlinks away from the entry point of the Web site. For example, the home page http://vutw.cs.umn.edu/-lparar is assumed to be at depth 0, whereas http : / /wuw. cs. unn. edu/kunar/MINDS/MINDS-papers.htn is lo- cated at depth 2. Based on the Web graph shown in Figure 4.I7(b), the depth attribute for the first session is equal to two. The breadth attribute measures the width of the corresponding Web graph. For example, the breadth of the Web session shown in Figure 4.17(b) is equal to two.
The data set for classification contains 2916 records, with equal numbers of sessions due to Web robots (class 1) and human users (class 0). L0% of the data were reserved for training while the remaining 90% were used for testing. The induced decision tree model is shown in Figure 4.18. The tree has an error rate equal to 3.8% on the training set and 5.37o on the test set.
The model suggests that Web robots can be distinguished from human users in the following way:
1 . Accesses by Web robots tend to be broad but shallow, whereas accesses by human users tend to be more focused (narrow but deep).
Unlike human users, Web robots seldom retrieve the image pages asso- ciated with a Web document.
Sessions due to Web robots tend to be long and contain a large number of requested pages.
168 Chapter 4 Classification
Decision Tree: depth = 1:
breadth> 7: class 1 breadth<= 7:
breadth <= 3: lmagePages> 0.375: class 0 lmagePages<= 0.375:
totalPages<= 6: class 1 totalPages> 6: I breadth <= 1: class 1 I breadth > 1: class 0
width > 3: Mult i lP = 0:
lmagePages<= 0.1333: class 1 lmagePages> 0.1333: breadth <= 6: class 0 breadth > 6: class 1
Mul t i lP = 1 : I TotalTime <= 361: class 0 I TotalTime > 361: class 1
depth> 1: MultiAgent = 0:
deoth > 2: class 0 deoth < 2:
MultiAgent = 1: I totalPages <= 81 : class 0 I totalPages > 81 : class 1
Mult i lP = 1: class 0 Mult i lP = 0:
breadth <= 6: class 0 breadth > 6: I RepeatedAccess <= 0.322: class 0 I RepeatedAccess > 0.322: class 1
Figure 4.18. Decision tree modelfor Web robot detection,
4. Web robots are more likely to make repeated requests for the same doc-
ument since the Web pages retrieved by human users are often cached by the browser.
4.3.7 Characteristics of Decision Tbee Induction
The following is a summary of the important characteristics of decision tree induction algorithms.
1. Decision tree induction is a nonparametric approach for building classifi-
cation models. In other words, it does not require any prior assumptions regarding the type of probability distributions satisfied by the class and other attributes (unlike some of the techniques described in Chapter 5).
4.3 Decision Tbee Induction L69
Finding an optimal decision tree is an NP-complete problem. Many de- cision tree algorithms employ a heuristic-based approach to guide their search in the vast hypothesis space. For example, the algorithm pre- sented in Section 4.3.5 uses a greedy, top-down, recursive partitioning strategy for growing a decision tree.
Techniques developed for constructing decision trees are computationally inexpensive, making it possible to quickly construct models even when the training set size is very large. Furthermore, once a decision tree has been built, classifying a test record is extremely fast, with a worst-case complexity of O(to), where ,u.r is the maximum depth of the tree.
Decision trees, especially smaller-sized trees, are relatively easy to inter- pret. The accuracies of the trees are also comparable to other classifica- tion techniques for many simple data sets.
Decision trees provide an expressive representation for learning discrete- valued functions. However, they do not generalize well to certain types of Boolean problems. One notable example is the parity function, whose value is 0 (1) when there is an odd (even) number of Boolean attributes with the valueTrue. Accurate modeling of such a function requires a full decision tree with 2d nodes, where d is the number of Boolean attributes (see Exercise 1 on page 198).
Decision tree algorithms are quite robust to the presence of noise, espe- cially when methods for avoiding overfitting, as described in Section 4.4, are employed.
The presence of redundant attributes does not adversely affect the ac- curacy of decision trees. An attribute is redundant if it is strongly cor- related with another attribute in the data. One of the two redundant attributes will not be used for splitting once the other attribute has been chosen. However, if the data set contains many irrelevant attributes, i.e., attributes that are not useful for the classification task, then some of the irrelevant attributes may be accidently chosen during the tree-growing process, which results in a decision tree that is larger than necessary. Feature selection techniques can help to improve the accuracy of deci- sion trees by eliminating the irrelevant attributes during preprocessing. We will investigate the issue of too many irrelevant attributes in Section 4.4 .3 .
Since most decision tree algorithms employ a top-down, recursive parti-
tioning approach, the number of records becomes smaller as we traverse
down the tree. At the leaf nodes, the number of records may be too
small to make a statistically significant decision about the class rep-
resentation of the nodes. This is known as the data fragmentation problem. One possible solution is to disallow further splitting when the
number of records falls below a certain threshold.
A subtree can be replicated multiple times in a decision tree, as illus-
trated in Figure 4.19. This makes the decision tree more complex than
necessary and perhaps more difficult to interpret. Such a situation can
arise from decision tree implementations that rely on a single attribute
test condition at each internal node. Since most of the decision tree al- gorithms use a divide-and-conquer partitioning strategy, the same test
condition can be applied to different parts of the attribute space, thus
Ieading to the subtree replication problem.
Figure 4.19. Tree replication problem. The same subtree can appear at different branches.
10. The test conditions described so far in this chapter involve using only a
single attribute at a time. As a consequence, the tree-growing procedure
can be viewed as the process of partitioning the attribute space into
disjoint regions until each region contains records of the same class (see
Figure 4.20). The border between two neighboring regions of different
classes is known as a decision boundary. Since the test condition in- volves only a single attribute, the decision boundaries are rectilinear; i.e., parallel to the “coordinate axes.” This limits the expressiveness of the
4.3 Decision Tree Induction L7L
o i v
i o i
a l V
v i V i’ i v
v i v i o
0 r , 0 0 .1 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 4.20. Example of a decision tree and its decision boundaries for a two-dimensional data set.
i: .::… . ‘ l .. ‘ ‘. ‘ ‘ ; \ . . 1 . ‘ – ‘
t a t a a t a t .
t : . t t . . t ‘
: r t . t ‘
a ‘ a o – t
t . . t . ” t a
j l t t ‘ r o o a a- a
t o ‘
+ + + + \ .
. . i . .1; j – ; – \
oot 0.1 o.2 0.3 0.4 0.6 o.7 0.9
Figure 4.21. Example of data set that cannot be partitioned optimally using test conditions involving single attributes.
decision tree representation for modeling complex relationships among continuous attributes. Figure 4.21 illustrates a data set that cannot be classified effectively by a decision tree algorithm that uses test conditions involving only a single attribute at a time.
L72 Chapter 4 Classification
An oblique decision tree can be used to overcome this limitation because it allows test conditions that involve more than one attribute. The data set given in Figure 4.21can be easily represented by an oblique decision tree containing a single node with test condition
r + A < I .
Although such techniques are more expressive and can produce more compact trees, finding the optimal test condition for a given node can be computationally expensive.
Constructive induction provides another way to partition the data into homogeneous) nonrectangular regions (see Section 2.3.5 on page 57). This approach creates composite attributes representing an arithmetic or logical combination of the existing attributes. The new attributes provide a better discrimination of the classes and are augmented to the data set prior to decision tree induction. Unlike the oblique decision tree approach, constructive induction is less expensive because it identifies all the relevant combinations of attributes once, prior to constructing the decision tree. In contrast, an oblique decision tree must determine the right attribute combination dynamically, every time an internal node is expanded. However, constructive induction can introduce attribute re- dundancy in the data since the new attribute is a combination of several existing attributes.
11. Studies have shown that the choice of impurity measure has little effect on the performance of decision tree induction algorithms. This is because many impurity measures are quite consistent with each other, as shown in Figure 4.13 on page 159. Indeed, the strategy used to prune the tree has a greater impact on the final tree than the choice of impurity measure.
4.4 Model Overfitting
The errors committed by a classification model are generally divided into two types: training errors and generalization errors. TYaining error, also known as resubstitution error or apparent error, is the number of misclas- sification errors committed on training records, whereas generalization error is the expected error of the model on previously unseen records.
Recall from Section 4.2 that a good classification model must not only fit the training data well, it must also accurately classify records it has never
4.4 Model Overfitting 173
0 E 0 1 0
Figure 4.22. Example of a data set with binary classes.
seen before. In other words, a good model must have low training error as well as low generalization error. This is important because a model that fits
the training data too well can have a poorer generalization error than a model with a higher training error. Such a situation is known as model overfitting.
Overfftting Example in Two-Dimensional Data For a more concrete example of the overfitting problem, consider the two-dimensional data set shown in Figure 4.22. The data set contains data points that belong to two different classes, denoted as class o and class +, respectively. The data points
for the o class are generated from a mixture of three Gaussian distributions, while a uniform distribution is used to generate the data points for the + class. There are altogether 1200 points belonging to the o class and 1800 points be- Ionging to the + class. 30% of. the points are chosen for training, while the remaining 70%o are used for testing. A decision tree classifier that uses the