Arrangement

As discussed earlier, the proper choice of visual representation of objects and attributes is essential for good visualization. The arrangement of items within

the visual display is also crucial. We illustrate this with two examples.

Example 3.5. This example illustrates the importance of rearranging a table of data. In Table 3.5, which shows nine objects with six binary attributes, there is no clear relationship between objects and attributes, at least at first glance. If the rows and columns of this table are permuted, however, as shown in Table 3.6, then it is clear that there are really only two types of objects in the table-one that has all ones for the first three attributes and one that has only ones for the last three attributes. I

Table 3.5, A table of nine objects (rows) with six binary attributes (columns).

r 23456

Table 3.6. A table of nine objects (rows) with six binary attributes (columns) permuted so that the relationships of the rows and columns are clear.

6 1 3 2 5 4 1 2 J

4 5 o ,7

8 q

0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 0

0 0 0 1 I

1 1 I

4 2 6 8 r

3 9 1 7

1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1

3.3 Visualization 109

Example 3.6. Consider Figure 3.3(a), which shows a visualization of a graph. If the connected components of the graph are separated, as in Figure 3.3(b), then the relationships between nodes and graphs become much simpler to understand.

V (a) Original view of a graph. (b) Uncoupled view ofconnected components

of the graph.

Two visualizations of a graph.Figure 3.3.

Selection

Another key concept in visualization is selection, which is the elimination or the de-emphasis of certain objects and attributes. Specifically, while data objects that only have a few dimensions can often be mapped to a two- or three-dimensional graphical representation in a straightforward way, there is no completely satisfactory and general approach to represent data with many attributes. Likewise, if there are many data objects, then visualizing all the objects can result in a display that is too crowded. If there are many attributes and many objects, then the situation is even more challenging.

The most common approach to handling many attributes is to choose a subset of attributes-usually two-for display. If the dimensionality is not too high, a matrix of bivariate (two-attribute) plots can be constructed for simul- taneous viewing. (Figure 3.16 shows a matrix of scatter plots for the pairs of attributes of the Iris data set.) Alternatively, a visualization program can automatically show a series of two-dimensional plots, in which the sequence is user directed or based on some predefined strategy. The hope is that visualiz- ing a collection of two-dimensional plots will provide a more complete view of the data.

L10 Chapter 3 Exploring Data

The technique of selecting a pair (or small number) of attributes is a type of dimensionality reduction, and there are many more sophisticated dimension- ality reduction techniques that can be employed, e.g., principal components analysis (PCA). Consult Appendices A (Linear Algebra) and B (Dimension-

ality Reduction) for more information. When the number of data points is high, e.9., more than a few hundred,

or if the range of the data is large, it is difficult to display enough information about each object. Some data points can obscure other data points, or a data object may not occupy enough pixels to allow its features to be clearly

displayed. For example, the shape of an object cannot be used to encode a characteristic of that object if there is only one pixel available to display it. In

these situations, it is useful to be able to eliminate some of the objects, either by zooming in on a particular region of the data or by taking a sample of the data points.

3.3.3 Techniques

Visualization techniques are often specialized to the type of data being ana- lyzed. Indeed, new visualization techniques and approaches, as well as special- ized variations ofexisting approaches, are being continuously created, typically in response to new kinds of data and visualization tasks.

Despite this specialization and the ad hoc nature of visualization, there are some generic ways to classify visualization techniques. One such classification is based on the number of attributes involved (1,2,3, or many) or whether the data has some special characteristic, such as a hierarchical or graph structure. Visualization methods can also be classified according to the type of attributes involved. Yet another classification is based on the type of application: scien- tific, statistical, or information visualization. The following discussion will use three categories: visualization of a small number of attributes, visualization of data with spatial andf or temporal attributes, and visualization of data with many attributes.

Most of the visualization techniques discussed here can be found in a wide variety of mathematical and statistical packages, some of which are freely available. There are also a number of data sets that are freely available on the World Wide Web. Readers are encouraged to try these visualization techniques as they proceed through the following sections.

3.3 Visualization 111

Visualizing Small Numbers of Attributes

This section examines techniques for visualizing data with respect to a small number of attributes. Some of these techniques, such as histograms, give insight into the distribution of the observed values for a single attribute. Other techniques, such as scatter plots, are intended to display the relationships between the values of two attributes.

Stem and Leaf Plots Stem and leaf plots can be used to provide insight into the distribution of one-dimensional integer or continuous data. (We will assume integer data initially, and then explain how stem and leaf plots can be applied to continuous data.) For the simplest type of stem and leaf plot, we split the values into groups, where each group contains those values that are the same except for the last digit. Each group becomes a stem, while the last digits of a group are the leaves. Hence, if the values are two-digit integers, e.g., 35, 36, 42, and 51, then the stems will be the high-order digits, e.g., 3, 4, and 5, while the leaves are the low-order digits, e.g., 1, 2, 5, and 6. By plotting the stems vertically and leaves horizontally, we can provide a visual representation of the distribution of the data.

Example 3.7. The set of integers shown in Figure 3.4 is the sepal length in centimeters (multiplied by 10 to make the values integers) taken from the Iris data set. For convenience, the values have also been sorted.

The stem and leaf plot for this data is shown in Figure 3.5. Each number in Figure 3.4 is first put into one of the vertical groups-4, 5, 6, or 7-according to its ten’s digit. Its last digit is then placed to the right of the colon. Often, especially if the amount of data is larger, it is desirable to split the stems. For example, instead of placing all values whose ten’s digit is 4 in the same “bucket,” the stem 4 is repeated twice; all values 40-44 are put in the bucket corresponding to the first stem and all values 45-49 are put in the bucket corresponding to the second stem. This approach is shown in the stem and leaf plot of Figure 3.6. Other variations are also possible. I

Histograms Stem and leaf plots are a type of histogram, a plot that dis- plays the distribution of values for attributes by dividing the possible values into bins and showing the number of objects that fall into each bin. For cate- gorical data, each value is a bin. If this results in too many values, then values are combined in some way. For continuous attributes, the range of values is di- vided into bins-typically, but not necessarily, of equal width-and the values in each bin are counted.

A

LL2 Chapter 3 Exploring Data

43 44 44 44 45 46 46 46 46 47 47 48 48 48 48 48 49 49 49 49 49 49 50 50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51 51 51 52 52 52 52 53 54 54 54 54 54 54 55 55 55 55 55 55 55 56 56 56 56 56 56 57 57 57 57 57 57 57 57 58 58 58 58 58 58 58 59 59 59 60 60 60 60 60 60 61 61 61 61 61 61 62 62 62 62 63 63 63 63 63 63 63 63 63 64 64 64 64 64 64 64 65 65 65 65 65 66 66 67 67 67 67 67 67 67 67 68 68 68 69 69 69 69 70 7t 72 72 72 73 74 76 77 77 77 77 79

Figure 3.4. Sepal length data from the lris data set.

34444566667788888999999 0000000000 Lt t l1tfl L222234444445555555666 6667 7 7 7 7 7 7 78888888999 000000 1 1 1 1 t1222233333333344444445555566777 7777 7 8889999 0t22234677779

Figure 3.5. Stem and leaf plot for the sepal length from the lris data set.

3444 566667788888999999 00000000001 1 1 11 1 1 1 122223444++4 5555555666 6667 7 77 7 7778888888999 000000 1 1 1 1 t t22223333333334444444 5 5 5 55 6 67 7 77 7 7 7 7 8889 999 0122234 677779

Figure 3.6. Stem and leaf plot for the sepal length from the lris data set when buckets conesponding to digits are split.

Once the counts are available for each bin, a bar plot is constructed such that each bin is represented by one bar and the area of each bar is proportional

to the number of values (objects) that fall into the corresponding range. If all

intervals are of equal width, then all bars are the same width and the height of a bar is proportional to the number of values in the corresponding bin.

Exarnple 3.8. Figure 3.7 shows histograms (with 10 bins) for sepal length, sepal width, petal length, and petal width. Since the shape of a histogram can depend on the number of bins, histograms for the same data, but with 20

bins, are shown in Figure 3.8. I

There are variations of the histogram plot. A relative (frequency) his- togram replaces the count by the relative frequency. However, this is just a

4 : t r .

t r .

A

A

7 .

Visualization 1-1-3

(a) Sepal length. (b) Sepal width. (c) Petal length (d) Petal width.

Figure 3.7. Histograms of four lris attributes (10 bins).

(a) Sepal length. (b) Sepal width. (c) Petal length.

Figure 3.8. Histograms of four lris attributes (20 bins).

change in scale of the g axis, and the shape of the histogram does not change. Another common variation, especially for unordered categorical data, is the Pareto histogram, which is the same as a normal histogram except that the categories are sorted by count so that the count is decreasing from left to right.

Two-Dimensional Histograms Two-dimensional histograms are also pos- sible. Each attribute is divided into intervals and the two sets of intervals define two-dimensional rectangles of values.

Example 3.9. Figure 3.9 shows a two-dimensional histogram of petal length and petal width. Because each attribute is split into three bins, there are nine rectangular two-dimensional bins. The height of each rectangular bar indicates the number of objects (flowers in this case) that fall into each bin. Most of the flowers fall into only three of the bins-those along the diagonal. It is not possible to see this by looking at the one-dimensional distributions. r

3.3

) 0 5 I 1 5 2 Petwnh

(d) Petal width.

LL4 Chapter 3 Exploring Data

Figure 3.9. Two-dimensional histogram of petal length and width in the lris data set.

While two-dimensional histograms can be used to discover interesting facts

about how the values of two attributes co-occur, they are visually more com- plicated. For instance, it is easy to imagine a situation in which some of the columns are hidden bv others.

Box Plots Box plots are another method for showing the distribution of the values of a single numerical attribute. Figure 3.10 shows a labeled box plot for sepal length. The lower and upper ends of the box indicate the 25th and 75th percentiles, respectively, while the line inside the box indicates the value of the 50th percentile. The top and bottom lines of the tails indicate the 10’h and 90th percentiles. Outliers are shown by “+” marks. Box plots are relatively compact, and thus, many of them can be shown on the same plot. Simplified versions of the box plot, which take less space, can also be used.

Example 3.1-0. The box plots for the first four attributes of the Iris data set are shown in Figure 3.11. Box plots can also be used to compare how attributes vary between different classes of objects, as shown in Figure 3.12.

T

Pie Chart A pie chart is similar to a histogram, but is typically used with categorical attributes that have a relatively small number of values. Instead of showing the relative frequency of different values with the area or height of a bar, as in a histogram, a pie chart uses the relative area of a circle to indicate relative frequency. Although pie charts are common in popular articles, they

<- Outlier

<- 90th percentile

<- 75rh percentile

<– sOth percentile

<_ 25rh percentile

<- 1Oth percentile

Figure 3.10. Description of box plot for sepal length,

Visualization 115

Sepal Length Sepal Width Petal Length Petal Width

Figure 3,11. Box plot for lris attributes.

3.3

g o o E

o

o

d

widh M L€ngth &lwidh Spalkngih Sepalwdlh told Length Peblwidh

(b) Versicolour.(a) Setosa. (c) Virginica.

Figure 3.12. Box plots of attributes by lris species.

are used less frequently in technical publications because the size of relative areas can be hard to judge. Histograms are preferred for technical work.

Example 3.11. Figure 3.13 displays a pie chart that shows the distribution of Iris species in the Iris data set. In this case, all three flower types have the same freouencv. r

Percentile Plots and Empirical Cumulative Distribution Functions A type of diagram that shows the distribution of the data more quantitatively is the plot of an empirical cumulative distribution function. While this type of plot may sound complicated, the concept is straightforward. For each value of a statistical distribution, a cumulative distribution function (CDF) shows

T

= LJ

l –

=

S6pal bngth S€palwdh bd bngth &l Wfr

116 Chapter 3 Exploring Data

Versicolour

Figure 3.13. Distribution of the types of lris flowers.

the probability that a point is less than that value. For each observed value, an empirical cumulative distribution function (ECDF) shows the fraction of points that are less than this value. Since the number of points is finite, the empirical cumulative distribution function is a step function.

Example 3.12. Figure 3.14 shows the ECDFs of the Iris attributes. The percentiles of an attribute provide similar information. Figure 3.15 shows the percentile plots of the four continuous attributes of the Iris data set from Table 3.2. The reader should compare these figures with the histograms given in Figures 3.7 and 3.8. r

Scatter Plots Most people are familiar with scatter plots to some extent, and they were used in Section 2.4.5 to illustrate linear correlation. Each data object is plotted as a point in the plane using the values of the two attributes as r and y coordinates. It is assumed that the attributes are either integer- or real-valued.

Example 3.13. Figure 3.16 shows a scatter plot for each pair of attributes of the Iris data set. The different species of Iris are indicated by different markers. The arrangement of the scatter plots of pairs of attributes in this type of tabular format, which is known as a scatter plot matrix, provides an organized way to examine a number of scatter plots simultaneously. I