To summarize, the median is the middle value if there are an odd number of values, and the average of the two middle values if the number of values is even. Thus, for seven values, the median is 1141, while for ten values, the median is | (r15; + r fol) .

LO2 Chapter 3 Exploring Data

Although the mean is sometimes interpreted as the middle of a set of values, this is only correct if the values are distributed in a symmetric manner. If the distribution of values is skewed, then the median is a better indicator of the middle. AIso, the mean is sensitive to the presence of outliers. For data with outliers, the median again provides a more robust estimate of the middle of a set of values.

To overcome problems with the traditional definition of a mean, the notion

of a trimmed mean is sometimes used. A percentage p between 0 and 100 is specified, the top and bottom (pl2)% of the data is thrown out, and the mean is then calculated in the normal way. The median is a trimmed mean with p — L00yo, while the standard mean corresponds to p: go/o.

Example 3.3. Consider the set of values {L,2,3,4, 5,90}. The mean of these values is 17.5, while the median is 3.5. The trimmed mean with p : 40To is also 3.5. r

Example 3.4. The means, medians, and trimmed means (p : 20%) of the four quantitative attributes of the Iris data are given in Table 3.3. The three measures of location have similar values except for the attribute petal length.

Table 3.3. Means and medians for sepal length, sepal width, petal length, and petal width. (All values are in centimeters.)

3.2.4 Measures of Spread: Range and Variance

Another set of commonly used summary statistics for continuous data are those that measure the dispersion or spread of a set of values. Such measures indicate if the attribute values are widely spread out or if they are relatively concentrated around a single point such as the mean.

The simplest measure of spread is the range, which, given an attribute r with a set of rn values {rr, . . . , r*}, is defined as

Measure Sepal Length Sepal Width Petal Length Petal Width mean

median trimmed mean (20To)

5.84 5.80 5 . ( 9

3.05 3.00 3.02

3.76 4.35 3.72

I 20 1.30 r .72

range(r) : max(r) – min(r) : r(^) – r(t). (3.4)

3.2 Summary Statistics 103

Table 3.4. Range, standard deviation (std), absolute average difference (AAD), median absolute difier- ence (MAD), and interquartile range (lQR)for sepal length, sepal width, petal length, and petal width. (Allvalues are in centimeters.)

Although the range identifies the maximum spread, it can be misleading if most of the values are concentrated in a narrow band of values, but there are also a relatively small number of more extreme values. Hence, the variance is preferred as a measure of spread. The variance of the (observed) values of an attribute r is typically written as sl and is defined below. The standard deviation, which is the square root of the variance, is written as su and has the same units as r.

Measure Sepal Length Sepal Width Petal Length Petal Width range std

AAD MAD IQR

3.6 0.8 0 .7 0 .7 1 .3

, A

0.4 0 .3 0 .3 0 .5

5.9 1 .8 1 .6 r .2 3.5

, A

0.8 0.6 u . ( 1 .5

1 m — \-(2, – z)2 m , – l z – / ‘ ‘

1 f f i

AAD(z) : ‘ t l * i -n l m –

MAD(z) : med’ian( ft, – rl,. . ., l”- – rl)) \ /

interquartile range(r) : r1sTo – r2s%

variance(z) – s7: (3.5)

The mean can be distorted by outliers, and since the variance is computed using the mean, it is also sensitive to outliers. Indeed, the variance is particu- larly sensitive to outliers since it uses the squared difference between the mean and other values. As a result, more robust estimates of the spread of a set of values are often used. Following are the definitions of three such measures: the absolute average deviation (AAD), the median absolute deviation (MAD), and the interquartile range(IQR). Table 3.4 shows these measures for the Iris data set.

(3.6)

(3.7)

(3.8)

LO4 Chapter 3 Exploring Data

3.2.5 Multivariate Summary Statistics

Measures of location for data that consists of several attributes (multivariate

data) can be obtained by computing the mean or median separately for each

attribute. Thus, given a data set the mean of the data objects, x, is given by

(3.e)

where 4 is the mean of the i,th attribute r;. For multivariate data, the spread of each attribute can be computed in-

dependently of the other attributes using any of the approaches described in

Section 3.2.4. However, for data with continuous variables, the spread of the

data is most commonly captured. by the covariance matrix S, whose iith

entry sii is the covariance of the i}h and jth attributes of the data. Thus, if ai

and ri are the ith and jth attributes, then

sij : covafi ance(rrr t i).

In turn, couariance(q,ri) is given by

(3 .10)

covariance(ri, r j) 1 nr \ –

., l\rnt. – ri)\rki – ri), (3 .11 )

v 2 – \ z – t K : 1

where rpi arrd, rkj are the values of the ith andj’h attributes for the kth object. Notice that covariance(r6,rt) : variance(r1). Thus, the covariance matrix has

the variances of the attributes along the diagonal. The covariance of two attributes is a measure of the degree to which two

attributes vary together and depends on the magnitudes of the variables. A

value near 0 indicates that two attributes do not have a (linear) relationship,

but it is not possible to judge the degree of relationship between two variables

by looking only at the value of the covariance. Because the correlation of two

attributes immediately gives an indication of how strongly two attributes are (linearly) related, correlation is preferred to covariance for data exploration. (AIso see the discussion of correlation in Section 2.4.5.) The ijth entry of the

correlation matrix R, is the correlation between I’he ith and jth attributes of the data. If rt arrd. rj are the i,th and jth attributes, then

ri.j : corcelntion(r6, ,j) : **Xy-f, (3.12)

Visualization 105

where s2 and sy are the variances of r; and rjj respectively. The diagonal entries of R are correlation(u,rt): 1, while the other entries are between -1 and 1. It is also useful to consider correlation matrices that contain the pairwise correlations of objects instead of attributes.

3.2.6 Other Ways to Summarize the Data

There are, of course, other types of summary statistics. For instance, the skewness of a set of values measures the degree to which the values are sym- metrically distributed around the mean. There are also other characteristics of the data that are not easy to measure quantitatively, such as whether the distribution of values is multimodal; i.e., the data has multiple “bumps” where most of the values are concentrated. In many cases, however, the most effec- tive approach to understanding the more complicated or subtle aspects of how the values of an attribute are distributed, is to view the values graphically in the form of a histogram. (Histograms are discussed in the next section.)

3.3 Visualization

Data visualization is the display of information in a graphic or tabular format. Successful visualization requires that the data (information) be converted into a visual format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. The goal of visualization is the interpretation of the visualized information by a person and the formation of a mental model of the information.

In everyday life, visual techniques such as graphs and tables are often the preferred approach used to explain the weather, the economy, and the results of political elections. Likewise, while algorithmic or mathematical approaches are often emphasized in most technical disciplines-data mining included- visual techniques can play a key role in data analysis. In fact, sometimes the use of visualization techniques in data mining is referred to as visual data mining.

3.3.1 Motivations for Visualization

The overriding motivation for using visualization is that people can quickly absorb large amounts of visual information and find patterns in it. Consider Figure 3.2, which shows the Sea Surface Temperature (SST) in degrees Celsius for July, 1982. This picture summarizes the information from approximately 250,000 numbers and is readily interpreted in a few seconds. For example, it

3.3

106 Chapter 3 Exploring Data

Longitude

Figure 3,2. Sea Surface Temperature (SST) for July, 1982.

is easy to see that the ocean temperature is highest at the equator and lowest at the poles.

Another general motivation for visualization is to make use of the domain knowledge that is “locked up in people’s heads.” While the use of domain knowledge is an important task in data mining, it is often difficult or impossible

to fully utilize such knowledge in statistical or algorithmic tools. In some cases, an analysis can be performed using non-visual tools, and then the results presented visually for evaluation by the domain expert. In other cases, having

a domain specialist examine visualizations of the data may be the best way

of finding patterns of interest since, by using domain knowledge, a person can

often quickly eliminate many uninteresting patterns and direct the focus to the patterns that are important.

3.3.2 General Concepts

This section explores some of the general concepts related to visualization, in particular, general approaches for visualizing the data and its attributes. A

number of visualization techniques are mentioned briefly and will be described in more detail when we discuss specific approaches later on. We assume that

the reader is familiar with line graphs, bar charts, and scatter plots.

Temp

1 5

1 0

3.3 Visualization 1.O7

Representation: Mapping Data to Graphical Elements

The first step in visualization is the mapping of information to a visual format; i.e., mapping the objects, attributes, and relationships in a set of information to visual objects, attributes, and relationships. That is, data objects, their at- tributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors.

Objects are usually represented in one of three ways. First, if only a single categorical attribute of the object is being considered, then objects are often lumped into categories based on the value of that attribute, and these categories are displayed as an entry in a table or arr area on a screen. (Examples shown later in this chapter are a cross-tabulation table and a bar chart.) Second, if an object has multiple attributes, then the object can be displayed as a row (or column) of a table or as a line on a graph. Finally, an object is often interpreted as a point in two- or three-dimensional space, where graphically, the point might be represented by a geometric figure, such as a circle. cross. or box.

For attributes, the representation depends on the type of attribute, i.e., nominal, ordinal, or continuous (interval or ratio). Ordinal and continuous attributes can be mapped to continuous, ordered graphical features such as location along the x:) A) or z axes; intensity; color; or size (diameter, width, height, etc.). For categorical attributes, each category can be mapped to a distinct position, color, shape, orientation, embellishment, or column in a table. However, for nominal attributes, whose values are unordered, care should be taken when using graphical features, such as color and position that have an inherent ordering associated with their values. In other words, the graphical elements used to represent the ordinal values often have an order, but ordinal values do not.

The representation of relationships via graphical elements occurs either explicitly or implicitly. For graph data, the standard graph representation- a set of nodes with links between the nodes-is normally used. If the nodes (data objects) or links (relationships) have attributes or characteristics oftheir own, then this is represented graphically. To illustrate, if the nodes are cities and the links are highways, then the diameter of the nodes might represent population, while the width of the links might represent the volume of traffic.

In most cases, though, mapping objects and attributes to graphical el- ements implicitly maps the relationships in the data to relationships among graphical elements. To illustrate, if the data object represents a physical object that has a location, such as a city, then the relative positions of the graphical objects corresponding to the data objects tend to naturally preserve the actual

1-08 Chapter 3 Exploring Data

relative positions of the objects. Likewise, if there are two or three continuous attributes that are taken as the coordinates ofthe data points, then the result-

ing plot often gives considerable insight into the relationships of the attributes and the data points because data points that are visually close to each other have similar values for their attributes.

In general, it is difficult to ensure that a mapping of objects and attributes will result in the relationships being mapped to easily observed relationships among graphical elements. Indeed, this is one of the most challenging aspects

of visualization. In any given set of data, there are many implicit relationships, and hence, a key challenge of visualization is to choose a technique that makes

the relationships of interest easily observable.