+1 (208) 254-6996 [email protected]

For the second candidate. u : 65. we can determine its class distribution by updating the distribution of the previous candidate. More specifically, the new distribution is obtained by examining the class label of the record with the lowest annual income (i.e., $60K). Since the class label for this record is No, the count for class No is increased from 0 to 1 (for Annual Income < $65K) and is decreased from 7 to 6 (for Annual- Incone > $65K). The distribution for class Yes remains unchanged. The new weighted-average Gini index for this candidate split position is 0.400.

This procedure is repeated until the Gini index values for all candidates are computed, as shown in Figure 4.16. The best split position corresponds to the one that produces the smallest Gini index, i.e., u:97. This procedure is less expensive because it requires a constant amount of time to update the class distribution at each candidate split position. It can be further optimized by considering only candidate split positions located between two adjacent records with different class labels. For example, because the first three sorted records (with annual incomes $60K, $70K, and $75K) have identical class labels, the best split position should not reside between $60K and $75K. Therefore, the candidate split positions at a : $55K, $65K, $72K, $87K, $92K, $110K, $I22K, $772K, and $230K are ignored because they are located between two adjacent records with the same class labels. This approach allows us to reduce the number of candidate split positions from 11 to 2.

Don't use plagiarized sources. Get Your Custom Essay on
For the second candidate. u : 65. we can determine its class distribution by updating the distribution of the previous candidate.
Just from $13/Page
Order Essay

Gain Ratio

Impurity measures such as entropy and Gini index tend to favor attributes that have a large number of distinct values. Figure 4.12 shows three alternative test conditions for partitioning the data set given in Exercise 2 on page 198. Comparing the first test condition, Gender, with the second, Car Type, it is easy to see that Car Type seems to provide a better way of splitting the data since it produces purer descendent nodes. However, if we compare both conditions with Customer ID, the latter appears to produce purer partitions. Yet Custoner ID is not a predictive attribute because its value is unique for each record. Even in a less extreme situation, a test condition that results in a large number of outcomes may not be desirable because the number of records associated with each partition is too small to enable us to make anv reliable predictions.




L64 Chapter 4 Classification

There are two strategies for overcoming this problem. The first strategy is

to restrict the test conditions to binary splits only. This strategy is employed

by decision tree algorithms such as CART. Another strategy is to modify the

splitting criterion to take into account the number of outcomes produced by

the attribute test condition. For example, in the C4.5 decision tree algorithm,

a splitting criterion known as gain ratio is used to deterrnine the goodness

of a split. This criterion is defined as follows:

” Ai”fo Ualn ratlo : ;–;.,–*-.

5pt1t rnlo (4.7)

Here, Split Info: -Df:rP(ui)logrP(u6) and /c is the total number of splits. For example, if each attribute value has the same number of records, then

Yi, : P(u,;) : Llk and the split information would be equal to log2 k. This

example suggests that if an attribute produces a large number of splits, its

split information will also be large, which in turn reduces its gain ratio.

4.3.5 Algorithm for Decision Tlee Induction

A skeleton decision tree induction algorithm called TreeGrowth is shown in Algorithm 4.7. The input to this algorithm consists of the training records

E and the attribute set F. The algorithm works by recursively selecting the

best attribute to split the data (Step 7) and expanding the leaf nodes of the

Algorithm 4.L A skeleton decision tree induction algorithm. TreeGrowth (8, F)

1: if stopping-cond(E,f’) : true t}nen 2: leaf : createNode$. 3: leaf . label : Ctassi fy(E) . 4: rcturn leaf . 5: else 6: root : createNode0. 7′. root.test-cond: f ind-best-split(E, F). 8: let V : {T.’lo is a possible outcome of root.test-cond }. 9: for each u €V do

10: Eo : {e I root.test-cond(e) : u and e e E}. 11: chi,ld: TreeGrowth(8″, F). 12: add chi,ld as descendent of root and Iabel the edge (root — chi,ld) as u. 13: end for 14: end if I5: return root.



Decision Tlee Induction 165

tree (Steps 11 and 12) until the stopping criterion is met (Step 1). The details of this algorithm are explained below:

1. The createNode$ function extends the decision tree by creating a new node. A node in the decision tree has either a test condition, denoted as node.test-cond, or a class label, denoted as node.label.

2. The f ind-best-split0 function determines which attribute should be selected as the test condition for splitting the training records. As pre- viously noted, the choice of test condition depends on which impurity measure is used to determine the goodness of a split. Some widely used measures include entropy, the Gini index, and the 12 statistic.

3. The Cl-assifyQ function determines the class label to be assigned to a leaf node. For each leaf node t,let p(ilt) denote the fraction of training records from class i associated with the node f. In most cases? the leaf node is assigned to the class that has the majority number of training records:

leaf .label: argmax p(i,lt), (4.8)

where the argmax operator returns the argument i that maximizes the expression p(i,lt). Besides providing the information needed to determine the class label of a leaf node, the fraction p(i,lt) can also be used to es- timate the probability that a record assigned to the leaf node t belongs to class z. Sections 5.7.2 and 5.7.3 describe how such probability esti mates can be used to determine the oerformance of a decision tree under different cost functions.

4. The stopping-cond0 function is used to terminate the tree-growing pro- cess by testing whether all the records have either the same class label or the same attribute values. Another way to terminate the recursive function is to test whether the number of records have fallen below some minimum threshold.

After building the decision tree, a tree-pruning step can be performed to reduce the size of the decision tree. Decision trees that are too large are susceptible to a phenomenon known as overfitting. Pruning helps by trim- ming the branches of the initial tree in a way that improves the generalization capability of the decision tree. The issues of overfitting and tree pruning are discussed in more detail in Section 4.4.




166 Chapter 4 Classification

htto://www.cs. u mn.edu/-kumar

Ml NDS/Ml NDS_papers.htm

(b) Graph of a Web session. (c) Derived attributes for Web robot detection.

Input data for Web robot detection.Figure 4.17.

4.3.6 An Example: Web Robot Detection

Web usage mining is the task of applying data mining techniques to extract useful patterns from Web access logs. These patterns can reveal interesting characteristics of site visitors; e.g., people who repeatedly visit a Web site and view the same product description page are more likely to buy the product if certain incentives such as rebates or free shipping are offered.

In Web usage mining, it is important to distinguish accesses made by hu- man users from those due to Web robots. A Web robot (also known as a Web crawler) is a software program that automatically locates and retrieves infor- mation from the Internet by following the hyperlinks embedded in Web pages. These programs are deployed by search engine portals to gather the documents necessary for indexing the Web. Web robot accesses must be discarded before applying Web mining techniques to analyze human browsing behavior.

1 60 11 11 11 08/Au9/2004 10:15:21

http://www cs umn edu/ -kumar

HTTP/1 1 200 u24 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

1 60 .11 11 .1108/Aug/2004 10:15:34

u E l http://www cs.umn.edu/ -kumar/MINDS

HTTPN,1200 41378 *umar

Mozilla/4 0 (compatible; MSIE 6 0; Windows NT 5.0)

6 0 1 1 1 1 , 1 1uu/AUg/zuu4 1 0:15:41


naners hlm

HTTPN.l 200 101851crftp://www cs umn eou/ ,kumar/MINDS

MOZila/4.U (compatible; MSIE 6.0; Windows NT 5.0)

6 0 1 1 1 1 , 1 1JU/AUg/ZUU4

10 :16 :11 GET nttp//www uS,urln.euu/

-kumar/papers/papers html

tP t1 .1 200 7463 )ttp://wwwcs.umn edu/ .kumar

M0zlila/4.u (compatible; MSIE 6 0; Windows NT 5.0)

359 22 08/Aug/2004 10 :16 :15

u E l http://www cs umn edu/ -sieinbac

HTTPN ( 200 3149 Mozilla/s.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/2004061 6

(a) Example of a Web server log

Total number of pages retrieved in a Web session Total number of imaqe paqes retrieved in a Web session Total amount ol time sDent bv Web site visitor

more than once in a Web session

of requests made using HEAD method



4.3 Decision Tree Induction 167

This section describes how a decision tree classifier can be used to distin- guish between accesses by human users and those by Web robots. The input data was obtained from a Web server log, a sample of which is shown in Figure a.I7(a). Each line corresponds to a single page request made by a Web client (a user or a Web robot). The fields recorded in the Web log include the IP address of the client, timestamp of the request, Web address of the requested document, size of the document, and the client’s identity (via the user agent field). A Web session is a sequence of requests made by a client during a single visit to a Web site. Each Web session can be modeled as a directed graph, in which the nodes correspond to Web pages and the edges correspond to hyper- links connecting one Web page to another. Figure 4.L7(b) shows a graphical representation of the first Web session given in the Web server log.

Order your essay today and save 10% with the discount code ESSAYHELP