(a) Sepal Length.
(c) Petal Length.
Figure 3.14.
(b) Sepal Width.
(d) Petal Width.
Empirical CDFs of four lris attributes.
0 2 0 4 6 0 8 0 1 0
Figure 3.15. Percentile plots for sepal length, sepal width, petal length, and petal width.
118 Chapter 3 Exploring Data
@ F – @ r O r Q \ t | r ) ( f ) l r | c \ l \ f ( r ) o l
q$uag ledes qprm ledes
( o S O l t Q c \ l t Q – l O O N O
q$ue; le1ed qptrr le1ed
c\l
o = (d q) o-
-c c’)
s 9 (U o o-
ol
s -c ‘=
– E o- o o
ol
@
t\ -C o) c q)
@ _ (d o- o
g r @
ov) ct (o
E at)
o -c
.9 (t
o o_ (D
(6 <) <t>
o x
(5
<rt Grt o = ctt u-
X X X X X XXX
Xf f iX X X X
X X X g X
X X W X X g
p
x
x x
x X X
x X X )
x xg
x x x X
x x
X X x
x x g X X
x x X
oo f
+
o o goo
oce9
3.3 Visualization 119
There are two main uses for scatter plots. First, they graphically show the relationship between two attributes. In Section 2.4.5, we saw how scatter plots could be used to judge the degree of linear correlation. (See Figure 2.17.) Scatter plots can also be used to detect non-linear relationships, either directly or by using a scatter plot of the transformed attributes.
Second, when class labels are available, they can be used to investigate the degree to which two attributes separate the classes. If is possible to draw a line (or a more complicated curve) that divides the plane defined by the two attributes into separate regions that contain mostly objects of one class, then it is possible to construct an accurate classifier based on the specified pair of attributes. If not, then more attributes or more sophisticated methods are needed to build a classifier. In Figure 3.16, many of the pairs of attributes (for example, petal width and petal length) provide a moderate separation of the Iris species.
Example 3.14. There are two separate approaches for displaying three at- tributes of a data set with a scatter plot. First, each object can be displayed according to the values of three, instead of two attributes. F igure 3.17 shows a three-dimensional scatter plot for three attributes in the Iris data set. Second, one of the attributes can be associated with some characteristic of the marker, such as its size, color, or shape. Figure 3.18 shows a plot of three attributes of the Iris data set, where one of the attributes, sepal width, is mapped to the size of the marker. r
Extending Two- and Three-Dimensional Plots As illustrated by Fig- ure 3.18, two- or three-dimensional plots can be extended to represent a few additional attributes. For example, scatter plots can display up to three ad- ditional attributes using color or shading, size, and shape, allowing five or six dimensions to be represented. There is a need for caution, however. As the complexity of a visual representation of the data increases, it becomes harder for the intended audience to interpret the information. There is no benefit in packing six dimensions’ worth of information into a two- or three-dimensional plot, if doing so makes it impossible to understand.
Visualizing Spatio-temporal Data
Data often has spatial or temporal attributes. For instance, the data may consist of a set of observations on a spatial grid, such as observations of pres- sure on the surface of the Earth or the modeled temperature at various grid points in the simulation of a physical object. These observations can also be
LzO Chapter 3 Exploring Data
Figure 3.17. Three-dimensional scatter plot ol sepal width, sepal length, and petalwidth.
4 Petal Length
Figure 3.18, Scatter plot of petal length versus petalwidth, with the size of the marker indicating sepal width.
2
c 1 . 5
c o
J 1
o o @ 0.5
0
*:;’H*:”**-i,S#”.*f
3.3 Visualization LzL
Figure 3.19, Contour plot of SST for December 1 998.
made at various points in time. In addition, data may have only a temporal component, such as time series data that gives the daily prices of stocks.
Contour Plots For some three-dimensional data, two attributes specify a position in a plane, while the third has a continuous value, such as temper- ature or elevation. A useful visualization for such data is a contour plot, which breaks the plane into separate regions where the values of the third attribute (temperature, elevation) are roughly the same. A common example of a contour plot is a contour map that shows the elevation of land locations.
Example 3.15. Figure 3.19 shows a contour plot of the average sea surface temperature (SST) for December 1998. The land is arbitrarily set to have a temperature of 0oC. In many contour maps, such as that of Figure 3.19, the contour lines that separate two regions are labeled with the value used to separate the regions. For clarity, some of these labels have been deleted. r
Surface Plots Like contour plots, surface plots use two attributes for’the r and 3l coordinates. The third attribute is used to indicate the height above
I22 Chapter 3 Exploring Data
t
(a) Set of 12 points. (b) Overall density function—-surface plot.
Figure 3,20. Density of a set of 12 points.
the plane defined by the first two attributes. While such graphs can be useful, they require that a value of the third attribute be defined for all combinations of values for the first two attributes, at least over some range. AIso, if the surface is too irregular, then it can be difficult to see all the information, unless the plot is viewed interactively. Thus, surface plots are often used to describe mathematical functions or physical surfaces that vary in a relatively smooth manner.
Example 3.16. Figure 3.20 shows a surface plot of the density around a set of 12 points. This example is further discussed in Section 9.3.3. r
Vector Field Plots In some data, a characteristic may have both a mag- nitude and a direction associated with it. For example, consider the flow of a substance or the change of density with location. In these situations, it can be useful to have a plot that displays both direction and magnitude. This type of plot is known as a vector plot.
Example 3.17. Figure 3.2I shows a contour plot of the density of the two smaller density peaks from Figure 3.20(b), annotated with the density gradient
vectors.
Lower-Dimensional Slices Consider a spatio-temporal data set that records sorne quantity, such as temperature or pressure) at various locations over time. Such a data set has four dimensions and cannot be easily displayed by the types
3.3 Visualization L23
\ \ \ l l l l t r r \ \ \ l l l . t
t t l l l r t t l l l \
Figure 3.21, Vector plot of the gradient (change) in density for the bottom two density peaks of Figure 3.20.
of plots that we have described so far. However, separate “slices” of the data can be displayed by showing a set of plots, one for each month. By examining the change in a particular area from one month to another, it is possible to notice changes that occur, including those that may be due to seasonal factors.
Example 3.18. The underlying data set for this example consists of the av- erage monthly sea level pressure (SLP) from 1982 to 1999 on a 2.5o by 2.5′ Iatitude-longitude grid. The twelve monthly plots of pressure for one year are shown in Figure 3.22. In this example, we are interested in slices for a par- ticular month in the year 1982. More generally, we can consider slices bf the data along any arbitrary dimension.
Animation Another approach to dealing with slices of data, whether or not time is involved, is to employ animation. The idea is to display successive two-dimensional slices of the data. The human visual system is well suited to detecting visual changes and can often notice changes that might be difficult to detect in another manner. Despite the visual appeal of animation, a set of still plots, such as those of Figure 3.22, can be more useful since this type of visualization allows the information to be studied in arbitrary order and for arbitrary amounts of time.
L24 Chapter 3
January
Exploring Data
February
April
July
May
August
Figure 3,22. Monthly plots of sea level pressure over the 12 months of 1982.
3.3.4 Visualizing Higher-Dimensional Data
This section considers visualization techniques that can display more than the handful of dimensions that can be observed with the techniques just discussed. However, even these techniques are somewhat limited in that they only show some aspects of the data.
Matrices An image can be regarded as a rectangular array of pixels, where each pixel is characterized by its color and brightness. A data matrix is a rectangular array of values. Thus, a data matrix can be visualized as an image by associating each entry of the data matrix with a pixel in the image. The brightness or color of the pixel is determined by the value of the corresponding entry of the matrix.
W,8;, 3.3 Visualization I25
Figure 3.23. Plot of the lris data matrix where columns have been standardized to have a mean of 0 and standard deviation of 1.
Ssrca V€cidour Virginl€
Figure 3.24. Plot of the lris conelation matrix.
There are some important practical considerations when visualizing a data matrix. If class labels are known, then it is useful to reorder the data matrix so that all objects of a class are together. This makes it easier, for example, to detect if all objects in a class have similar attribute values for some attributes. If different attributes have different ranges, then the attributes are ofben stan- dardized to have a mean of zero and a standard deviation of 1. This prevents the attribute with the largest magnitude values from visually dominating the plot.
Example 3.19. Figure 3.23 shows the standardized data matrix for the Iris data set. The first 50 rows represent Iris flowers ofthe species Setosa, the next 50 Versicolour, and the last 50 Virginica. The Setosa flowers have petal width and length well below the average, while the Versicolour flowers have petal width and length around average. The Virginica flowers have petal width and length above average. l
It can also be useful to look for structure in the plot of a proximity matrix for a set of data objects. Again, it is useful to sort the rows and columns of the similarity matrix (when class labels are known) so that all the objects of a class are together. This allows a visual evaluation of the cohesiveness of each class and its separation from other classes.
Example 3.20. Figure 3.24 shows the correlation matrix for the Iris data set. Again, the rows and columns are organized so that all the flowers of a particular species are together. The flowers in each group are most similar
L26 Chapter 3 Exploring Data
to each other, but Versicolour and Virginica are more similar to one another than to Setosa. r
If class labels are not known, various techniques (matrix reordering and seriation) can be used to rearrange the rows and columns of the similarity matrix so that groups of highly similar objects and attributes are together and can be visually identified. Effectively, this is a simple kind of clustering. See Section 8.5.3 for a discussion of how a proximity matrix can be used to investigate the cluster structure of data.
Parallel Coordinates Parallel coordinates have one coordinate axis for each attribute, but the different axes are parallel to one other instead of per- pendicular, as is traditional. Furthermore, an object is represented as a line instead of as a point. Specifically, the value of each attribute of an object is mapped to a point on the coordinate axis associated with that attribute, and these points are then connected to form the line that represents the object.
It might be feared that this would yield quite a mess. However, in many cases, objects tend to fall into a small number of groups, where the points in each group have similar values for their attributes. If so, and if the number of data objects is not too large, then the resulting parallel coordinates plot can reveal interesting patterns.
Example 3.2L. Figure 3.25 shows a parallel coordinates plot of the four nu- merical attributes of the Iris data set. The lines representing objects of differ- ent classes are distinguished by their shading and the use of three different line styles-solid, dotted, and dashed. The parallel coordinates plot shows that the classes are reasonably well separated for petal width and petal length, but less well separated for sepal length and sepal width. Figure 3.25 is another parallel coordinates plot of the same data, but with a different ordering of the axes. r
One of the drawbacks of parallel coordinates is that the detection of pat- terns in such a plot may depend on the order. For instance, if lines cross a Iot, the picture can become confusing, and thus, it can be desirable to order the coordinate axes to obtain sequences of axes with less crossover. Compare Figure 3.26, where sepal width (the attribute that is most mixed) is at the left of the figure, to Figure 3.25, where this attribute is in the middle.
Star Coordinates and Chernoff Faces
Another approach to displaying multidimensional data is to encode objects as glyphs or icons-symbols that impart information non-verbally. More
3.3 Visualization L27
I o o E
c o o
o 5
Sepal Width Petal Length Petal Width
I o o E c c) o
o f (!
Sepal Width Sepal Length Petal Length Petal Width
Figure 3.26. A parallel coordinates plot of the four lris attributes with the attributes reordered to emphasize similarities and dissimilarities of groups
Figure 3.25. A parallel coordinates plot of the four lris attributes.
1-28 Chapter 3 Exploring Data
specifically, each attribute of an object is mapped to a particular feature of a glyph, so that the value of the attribute determines the exact nature of the feature. Thus, at a glance, we can distinguish how two objects differ.
Star coordinates are one example of this approach. This technique uses one axis for each attribute. These axes all radiate from a center point, like the spokes of a wheel, and are evenly spaced. Typically, all the attribute values are mapped to the range [0,1].
An object is mapped onto this star-shaped set of axes using the following process: Each attribute value of the object is converted to a fraction that represents its distance between the minimum and maximum values of the attribute. This fraction is mapped to a point on the axis corresponding to this attribute. Each point is connected with a line segment to the point on the axis preceding or following its own axis; this forms a polygon. The size and shape of this polygon gives a visual description of the attribute values of the object. For ease of interpretation, a separate set of axes is used for each object. In other words, each object is mapped to a polygon. An example of a star coordinates plot of flower 150 is given in Figure 3.27(a).
It is also possible to map the values of features to those of more familiar objects, such as faces. This technique is named Chernoff faces for its creator, Herman Chernoff. In this technique, each attribute is associated with a specific feature of a face, and the attribute value is used to determine the way that the facial feature is expressed. Thus, the shape of the face may become more elongated as the value of the corresponding data feature increases. An example of a Chernoff face for flower 150 is given in Figure 3.27(b).
The program that we used to make this face mapped the features to the four features listed below. Other features of the face, such as width between the eyes and length of the mouth, are given default values.