+1 (208) 254-6996 [email protected]

3.6 Exercises L43

Comment on the use of a box plot to explore a data set with four attributes: age, weight, height, and income.

Don't use plagiarized sources. Get Your Custom Essay on
3.6 Exercises L43 Comment on the use of a box plot to explore a data set with four attributes: age, weight, height, and income.
Just from $13/Page
Order Essay

Give a possible explanation as to why most of the values of petal length and width fall in the buckets along the diagonal in Figure 3.9.

Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal width and petal length attributes.

Simple line plots, such as that displayed in Figure 2.L2 on page 56, which shows two time series, can be used to effectively display high-dimensional data. For example, in Figure 2.72 iI is easy to tell that the frequencies of the two time series are different. What characteristic of time series allows the effective visualization of high-dimensional data?

Describe the types of situations that produce sparse or dense data cubes. Illus- trate with examples other than those used in the book.

How might you extend the notion of multidimensional data analysis so that the target variable is a qualitative variable? In other words, what sorts of summary statistics or data visualizations would be of interest?

Construct a data cube from Table 3.14. Is this a dense or sparse data cube? If it is sparse, identify the cells that empty.

Product ID Location ID Number Sold I 1 2 2

I 3 1 2

10 6 5 22

17. Discuss the differences between dimensionality reduction based on aggregation and dimensionality reduction based on techniques such as PCA and SVD.

Table 3.14. Fact table for Exercise 16.




Classification: Basic Concepts, Decision Trees, and Model Evaluation

Classification, which is the task of assigning objects to one of several predefined categories, is a pervasive problem that encompasses many diverse applications. Examples include detecting spam email messages based upon the message header and content, categorizing cells as malignant or benign based upon the results of MRI scans, and classifying galaxies based upon their shapes (see Figure 4.1).

(a) A spiral galaxy (b) An elliptical galaxy.

Figure 4.1. Classification of galaxies. The images are from the NASA website.



L46 Chapter 4 Classification

Input Output

onr.’l1,,” set l—) —) Class labelClassilication model

Figure 4.2. Classification as the task of mapping an input attribute set x into its class label g.

This chapter introduces the basic concepts of classification, describes some of the key issues such as model overfitting, and presents methods for evaluating and comparing the performance of a classification technique. While it focuses mainly on a technique known as decision tree induction, most of the discussion in this chapter is also applicable to other classification techniques, many of which are covered in Chapter 5.

4.L Preliminaries

The input data for a classification task is a collection of records. Each record, also known as an instance or example, is characterized by a tuple (*,A), where x is the attribute set and gr is a special attribute, designated as the class label (also known as category or target attribute). Table 4.1 shows a sample data set used for classifying vertebrates into one of the following categories: mammal, bird, fish, reptile, or amphibian. The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of reproduction, ability to fly, and ability to live in water. Although the attributes presented in Table 4.L are mostly discrete, the attribute set can also contain continuous features. The class label, on the other hand, must be a discrete attribute. This is a key characteristic that distinguishes classification from regression, a predictive modeling task in which g is a continuous attribute. Regression techniques are covered in Appendix D.

Definition 4.1 (Classification). Classification is the task of learning a tar- get function / that maps each attribute set x to one of the predefined class Iabels g.

The target function is also known informally as a classification model. A classification model is useful for the following purposes.

Descriptive Modeling A classification model can serve as an explanatory tool to distinguish between objects of different classes. For example, it would be useful-for both biologists and others-to have a descriptive model that




Table 4.1. The vertebrate data set.

Preliminaries 147

summarizes the data shown in Table 4.1 and explains what features define a vertebrate as a mammal, reptile, bird, fish, or amphibian.

Predictive Modeling A classification model can also be used to predict the class iabel of unknown records. As shown in Figure 4.2, a classification model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record. Suppose we are given the following characteristics of a creature known as a gila monster:

We can use a classification model built from the data set shown in Table 4.1 to determine the class to which the creature belongs.

Classification techniques are most suited for predicting or describing data sets with binary or nominal categories. They are less effective for ordinal categories (e.g., to classify a person as a member of high-, medium-, or low- income group) because they do not consider the implicit order among the categories. Other forms of relationships, such as the subclass-superclass re- lationships among categories (e.g., humans and apes are primates, which in

Name lJody


Skin Cover

Gives Birth

Aquatic Creature

Aerral Creature

Has Legs nates

beH Ulass Label

human python salmon whale frog komodo dragon bat plgeon cat leopard shark turtle penguln porcuptne eel salamander

warm-blooded cold-blooded cold-blooded

warm-blooded cold-blooded cold-blooded

warm-blooded warm-blooded warm-blooded cold-blooded

cold-blooded warm-blooded warm-blooded cold-blooded cold-blooded

hair feathers

fur scales

scales feathers

quills scales none

hair scales scales hair none scales

yes no no yes no no

yes no yes yes

no no yes no no

no no yes yes

seml no

no no no yes

semt seml no yes


no no no no no no

yes yes no no

no no no no no

yes no no no yes yes

yes yes yes no

yes yes yes no yes

no yes no no yes no

yes no no no

no no yes no yes

mammal reptile

fish mammal

amphibian reptile

mammal bird

mammal fish

reptile bird

mammal fish


Name lJody Iemperature


Cover Gives Birth

Aquatic Creature

Aerial Creature

Has Legs

Hiber- nates

Ulass Label

gila monster cold-blooded scales no no no yes yes



1,48 Chapter 4 Classification

turn, is a subclass of mammals) are also ignored. The remainder of this chapter focuses only on binary or nominal class labels.

4.2 General Approach to Solving a Classification Problem

A classification technique (or classifier) is a systematic approach to building classification models from an input data set. Examples include decision tree

classifiers, rule-based classifiers, neural networks, support vector machines,

and naive Bayes classifiers. Each technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data. The model generated by a learning algorithm should both fit the input data well and correctly predict the class labels of records it has never seen before. Therefore, a key objective of the learning algorithm is to build models with good generalization capability; i.e., models that accurately predict the class labels of previously unknown records.

Figure 4.3 shows a general approach for solving classification problems. First, a training set consisting of records whose class labels are known must

t ;t” I I ruooel


Figure 4.3. General approach for building a classification model.



4.2 General Approach to Solving a Classification Problem L4g

Table 4.2. Confusion matrix for a 2-class problem.

Predicted Class C l a s s : 7 U l a s s : U

Actual Class

Class : I T . t 77 J I O

Ulass :0 ./ 01 .Ioo

be provided. The training set is used to build a classification model, which is subsequently applied to the test set, which consists of records with unknown class labels.

Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. These counts are tabulated in a table known as a confusion matrix. TabIe 4.2 depicts the confusion matrix for a binary classification problem. Each entry f4 in this table denotes the number of records from class e predicted to be of class 7. For instance, /s1 is the number of records from class 0 incorrectly predicted as class 1. Based on the entries in the confusion matrix, the total number of correct predictions made by the model ir (“ftr + /oo) and the total number of incorrect predictions ir (/ro + /or).

Although a confusion matrix provides the information needed to determine how well a classification model performs, summarizing this information with a single number would make it more convenient to compare the performance of different models. This can be done using a performance metric such as accuracy, which is defined as follows:

Accuracy: Number of correct oredictions fn* foo (4.1) Total number of predictions fr l- fn * ,for * ,foo’

Equivalently, the performance of a model can be expressed error rate, which is given by the following equation:

terms of its

n__^_- _^r^ Number of wrong predictions H : r r n r r o f 6

Total number of predictions

. t r r Jrc -T JO7

f : – r * f r c * , f o r * . f oo (4.2)

Most classification algorithms seek models that attain the highest accuracy, or equivalently, the lowest error rate when applied to the test set. We will revisit the topic of model evaluation in Section 4.5.



1-50 Chapter 4 Classification

4.3 Decision Tree Induction

This section introduces a decision tree classifier, which is a simple yet widely

used classification technique.

4.3.L How a Decision TYee Works

To illustrate how classification with a decision tree works, consider a simpler

version of the vertebrate classification problem described in the previous sec-

tion. Instead of classifying the vertebrates into five distinct groups of species,

we assign them to two categories: mammals and non-mammals.

Suppose a new species is discovered by scientists. How can we tell whether

it is a mammal or a non-mammal? One approach is to pose a series of questions

about the characteristics of the species. The first question we may ask is

whether the species is cold- or warm-blooded. If it is cold-blooded, then it is

definitely not a mammal. Otherwise, it is either a bird or a mammal. In the

latter case, we need to ask a follow-up question: Do the females of the species give birth to their young? Those that do give birth are definitely mammals,

while those that do not are likely to be non-mammals (with the exception of

egg-laying mammals such as the platypus and spiny anteater). The previous example illustrates how we can solve a classification problem

by asking a series of carefully crafted questions about the attributes of the

test record. Each time we receive an answer? a follow-up question is asked

until we reach a conclusion about the class label of the record. The series of questions and their possible answers can be organized in the form of a decision

tree, which is a hierarchical structure consisting of nodes and directed edges.

Figure 4.4 shows the decision tree for the mammal classification problem. The

tree has three types of nodes:

o A root node that has no incoming edges and zero or more outgoing


o Internal nodes, each of which has exactly one incoming edge and two

or more outgoing edges.

o Leaf or terminal nodes, each of which has exactly one incoming edge

and no outgoing edges.

In a decision tree, each leaf node is assigned a class label. The non-

terminal nodes, which include the root and other internal nodes, contain

attribute test conditions to separate records that have different characteris- tics. For example, the root node shown in Figure 4.4 uses the attribute Body



4.3 Decision Thee Induction 151

Figure 4.4. A decision tree for the mammal classification problem.

Temperature to separate warm-blooded from cold-blooded vertebrates. Since all cold-blooded vertebrates are non-mammals, a leaf node labeled Non-mamma]s is created as the right child of the root node. If the vertebrate is warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish mammals from other warm-blooded creatures, which are mostly birds.

Classifying a test record is straightforward once a decision tree has been constructed. Starting from the root node, we apply the test condition to the record and follow the appropriate branch based on the outcome of the test. This will lead us either to another internal node, for which a new test condition is applied, or to a leaf node. The class label associated with the leaf node is then assigned to the record. As an illustration, Figure 4.5 traces the path in the decision tree that is used to predict the class label of a flamingo. The path terminates at a leaf node labeled Non-rna-nmal-s.

Order your essay today and save 10% with the discount code ESSAYHELP