3.6 Exercises L43
Comment on the use of a box plot to explore a data set with four attributes: age, weight, height, and income.
Give a possible explanation as to why most of the values of petal length and width fall in the buckets along the diagonal in Figure 3.9.
Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal width and petal length attributes.
Simple line plots, such as that displayed in Figure 2.L2 on page 56, which shows two time series, can be used to effectively display high-dimensional data. For example, in Figure 2.72 iI is easy to tell that the frequencies of the two time series are different. What characteristic of time series allows the effective visualization of high-dimensional data?
Describe the types of situations that produce sparse or dense data cubes. Illus- trate with examples other than those used in the book.
How might you extend the notion of multidimensional data analysis so that the target variable is a qualitative variable? In other words, what sorts of summary statistics or data visualizations would be of interest?
Construct a data cube from Table 3.14. Is this a dense or sparse data cube? If it is sparse, identify the cells that empty.
Product ID Location ID Number Sold I 1 2 2
I 3 1 2
10 6 5 22
17. Discuss the differences between dimensionality reduction based on aggregation and dimensionality reduction based on techniques such as PCA and SVD.
Table 3.14. Fact table for Exercise 16.
Classification: Basic Concepts, Decision Trees, and Model Evaluation
Classification, which is the task of assigning objects to one of several predefined categories, is a pervasive problem that encompasses many diverse applications. Examples include detecting spam email messages based upon the message header and content, categorizing cells as malignant or benign based upon the results of MRI scans, and classifying galaxies based upon their shapes (see Figure 4.1).
(a) A spiral galaxy (b) An elliptical galaxy.
Figure 4.1. Classification of galaxies. The images are from the NASA website.
L46 Chapter 4 Classification
Input Output
onr.’l1,,” set l—) —) Class labelClassilication model
Figure 4.2. Classification as the task of mapping an input attribute set x into its class label g.
This chapter introduces the basic concepts of classification, describes some of the key issues such as model overfitting, and presents methods for evaluating and comparing the performance of a classification technique. While it focuses mainly on a technique known as decision tree induction, most of the discussion in this chapter is also applicable to other classification techniques, many of which are covered in Chapter 5.
4.L Preliminaries
The input data for a classification task is a collection of records. Each record, also known as an instance or example, is characterized by a tuple (*,A), where x is the attribute set and gr is a special attribute, designated as the class label (also known as category or target attribute). Table 4.1 shows a sample data set used for classifying vertebrates into one of the following categories: mammal, bird, fish, reptile, or amphibian. The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of reproduction, ability to fly, and ability to live in water. Although the attributes presented in Table 4.L are mostly discrete, the attribute set can also contain continuous features. The class label, on the other hand, must be a discrete attribute. This is a key characteristic that distinguishes classification from regression, a predictive modeling task in which g is a continuous attribute. Regression techniques are covered in Appendix D.
Definition 4.1 (Classification). Classification is the task of learning a tar- get function / that maps each attribute set x to one of the predefined class Iabels g.
The target function is also known informally as a classification model. A classification model is useful for the following purposes.
Descriptive Modeling A classification model can serve as an explanatory tool to distinguish between objects of different classes. For example, it would be useful-for both biologists and others-to have a descriptive model that
4.r
Table 4.1. The vertebrate data set.
Preliminaries 147
summarizes the data shown in Table 4.1 and explains what features define a vertebrate as a mammal, reptile, bird, fish, or amphibian.
Predictive Modeling A classification model can also be used to predict the class iabel of unknown records. As shown in Figure 4.2, a classification model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record. Suppose we are given the following characteristics of a creature known as a gila monster:
We can use a classification model built from the data set shown in Table 4.1 to determine the class to which the creature belongs.
Classification techniques are most suited for predicting or describing data sets with binary or nominal categories. They are less effective for ordinal categories (e.g., to classify a person as a member of high-, medium-, or low- income group) because they do not consider the implicit order among the categories. Other forms of relationships, such as the subclass-superclass re- lationships among categories (e.g., humans and apes are primates, which in
Name lJody
Temperature
Skin Cover
Gives Birth
Aquatic Creature
Aerral Creature
Has Legs nates
beH Ulass Label
human python salmon whale frog komodo dragon bat plgeon cat leopard shark turtle penguln porcuptne eel salamander
warm-blooded cold-blooded cold-blooded
warm-blooded cold-blooded cold-blooded
warm-blooded warm-blooded warm-blooded cold-blooded
cold-blooded warm-blooded warm-blooded cold-blooded cold-blooded
hair feathers
fur scales
scales feathers
quills scales none
hair scales scales hair none scales
yes no no yes no no
yes no yes yes
no no yes no no
no no yes yes
seml no
no no no yes
semt seml no yes
semr
no no no no no no
yes yes no no
no no no no no
yes no no no yes yes
yes yes yes no
yes yes yes no yes
no yes no no yes no
yes no no no
no no yes no yes
mammal reptile
fish mammal
amphibian reptile
mammal bird
mammal fish
reptile bird
mammal fish
amphibian
Name lJody Iemperature
Dlfln
Cover Gives Birth
Aquatic Creature
Aerial Creature
Has Legs
Hiber- nates
Ulass Label
gila monster cold-blooded scales no no no yes yes
1,48 Chapter 4 Classification
turn, is a subclass of mammals) are also ignored. The remainder of this chapter focuses only on binary or nominal class labels.
4.2 General Approach to Solving a Classification Problem
A classification technique (or classifier) is a systematic approach to building classification models from an input data set. Examples include decision tree
classifiers, rule-based classifiers, neural networks, support vector machines,
and naive Bayes classifiers. Each technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data. The model generated by a learning algorithm should both fit the input data well and correctly predict the class labels of records it has never seen before. Therefore, a key objective of the learning algorithm is to build models with good generalization capability; i.e., models that accurately predict the class labels of previously unknown records.
Figure 4.3 shows a general approach for solving classification problems. First, a training set consisting of records whose class labels are known must
t ;t” I I ruooel
IK”,’:
Figure 4.3. General approach for building a classification model.
4.2 General Approach to Solving a Classification Problem L4g
Table 4.2. Confusion matrix for a 2-class problem.
Predicted Class C l a s s : 7 U l a s s : U
Actual Class
Class : I T . t 77 J I O
Ulass :0 ./ 01 .Ioo
be provided. The training set is used to build a classification model, which is subsequently applied to the test set, which consists of records with unknown class labels.
Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. These counts are tabulated in a table known as a confusion matrix. TabIe 4.2 depicts the confusion matrix for a binary classification problem. Each entry f4 in this table denotes the number of records from class e predicted to be of class 7. For instance, /s1 is the number of records from class 0 incorrectly predicted as class 1. Based on the entries in the confusion matrix, the total number of correct predictions made by the model ir (“ftr + /oo) and the total number of incorrect predictions ir (/ro + /or).
Although a confusion matrix provides the information needed to determine how well a classification model performs, summarizing this information with a single number would make it more convenient to compare the performance of different models. This can be done using a performance metric such as accuracy, which is defined as follows:
Accuracy: Number of correct oredictions fn* foo (4.1) Total number of predictions fr l- fn * ,for * ,foo’
Equivalently, the performance of a model can be expressed error rate, which is given by the following equation:
terms of its
n__^_- _^r^ Number of wrong predictions H : r r n r r o f 6
Total number of predictions
. t r r Jrc -T JO7
f : – r * f r c * , f o r * . f oo (4.2)
Most classification algorithms seek models that attain the highest accuracy, or equivalently, the lowest error rate when applied to the test set. We will revisit the topic of model evaluation in Section 4.5.
1-50 Chapter 4 Classification
4.3 Decision Tree Induction
This section introduces a decision tree classifier, which is a simple yet widely
used classification technique.
4.3.L How a Decision TYee Works
To illustrate how classification with a decision tree works, consider a simpler
version of the vertebrate classification problem described in the previous sec-
tion. Instead of classifying the vertebrates into five distinct groups of species,
we assign them to two categories: mammals and non-mammals.
Suppose a new species is discovered by scientists. How can we tell whether
it is a mammal or a non-mammal? One approach is to pose a series of questions
about the characteristics of the species. The first question we may ask is
whether the species is cold- or warm-blooded. If it is cold-blooded, then it is
definitely not a mammal. Otherwise, it is either a bird or a mammal. In the
latter case, we need to ask a follow-up question: Do the females of the species give birth to their young? Those that do give birth are definitely mammals,
while those that do not are likely to be non-mammals (with the exception of
egg-laying mammals such as the platypus and spiny anteater). The previous example illustrates how we can solve a classification problem
by asking a series of carefully crafted questions about the attributes of the
test record. Each time we receive an answer? a follow-up question is asked
until we reach a conclusion about the class label of the record. The series of questions and their possible answers can be organized in the form of a decision
tree, which is a hierarchical structure consisting of nodes and directed edges.
Figure 4.4 shows the decision tree for the mammal classification problem. The
tree has three types of nodes:
o A root node that has no incoming edges and zero or more outgoing
edges.
o Internal nodes, each of which has exactly one incoming edge and two
or more outgoing edges.
o Leaf or terminal nodes, each of which has exactly one incoming edge
and no outgoing edges.
In a decision tree, each leaf node is assigned a class label. The non-
terminal nodes, which include the root and other internal nodes, contain
attribute test conditions to separate records that have different characteris- tics. For example, the root node shown in Figure 4.4 uses the attribute Body
4.3 Decision Thee Induction 151
Figure 4.4. A decision tree for the mammal classification problem.
Temperature to separate warm-blooded from cold-blooded vertebrates. Since all cold-blooded vertebrates are non-mammals, a leaf node labeled Non-mamma]s is created as the right child of the root node. If the vertebrate is warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish mammals from other warm-blooded creatures, which are mostly birds.
Classifying a test record is straightforward once a decision tree has been constructed. Starting from the root node, we apply the test condition to the record and follow the appropriate branch based on the outcome of the test. This will lead us either to another internal node, for which a new test condition is applied, or to a leaf node. The class label associated with the leaf node is then assigned to the record. As an illustration, Figure 4.5 traces the path in the decision tree that is used to predict the class label of a flamingo. The path terminates at a leaf node labeled Non-rna-nmal-s.