Data Feature Facial Feature sepal length sepal width petal length petal width

size of face forehead/jaw relative arc length shape of forehead shape ofjaw

Example 3.22. A more extensive illustration of these two approaches to view- ing multidimensional data is provided by Figures 3.28 and 3.29, which shows the star and face plots, respectively, of 15 flowers from the Iris data set. The first 5 flowers are of species Setosa, the second 5 are Versicolour, and the last 5 are Virginica. r

3.3 Visualization L2g

oro

(b) Chernoff face of Iris 150.

Figure 3.27. Star coordinates graph and Chernoff face of the 150th flower of the lris data set.

I\ AJ 5

<t>\,/ 55

-/t\s \t/

105

Figute 3.28. Plot of 15 lris flowers using star coordinates.

/Cit /-F\v v

2 3 4 5

/oro\ \:./ 55

/oro) \7 105

04 3 4

,,f\<vv 53 54

,,’1t\ ,-T\ \ | / / \ l , / \Z \’/

‘103 104

rD 54

/\ /oro) \, ‘104

/o ro \ \_-/ 53

/oro\ t l \7 103

(a) Star graph of Iris 150.

A 2

/4\ â‚¬–Fv/

52

+ \t/ 102

Z\ Iore\ \:-/ 52

/eto) f ,1 \, \:/ 101 102

il\ 4J

,4\=vz 5 1

,,,,\ Y-T-7 \l/

J0 l

‘I

l^\

/o ro \ \_-/

51

:-;4″r6$h fp+;fr;o[,

Figure 3.29. A plot of 15 lris flowers using Chernoff faces.

130 Chapter 3 Exploring Data

Despite the visual appeal of these sorts of diagrams, they do not scale well, and thus, they are of limited use for many data mining problems. Nonetheless, they may still be of use as a means to quickly compare small sets of objects that have been selected by other techniques.

3.3.5 Do’s and Don’ts

To conclude this section on visualization, we provide a short list of visualiza- tion do’s and don’ts. While these guidelines incorporate a lot of visualization wisdom, they should not be followed blindly. As always, guidelines are no substitute for thoughtful consideration of the problem at hand.

ACCENT Principles The follov,ring are the ACCEN? principles for ef- fective graphical display put forth by D. A. Burn (as adapted by Michael Friendlv):

Apprehension Ability to correctly perceive relations among variables. Does the graph maximize apprehension of the relations among variables?

Clarity Ability to visually distinguish all the elements of a graph. Are the most important elements or relations visually most prominent?

Consistency Ability to interpret a graph based on similarity to previous graphs. Are the elements, symbol shapes, and colors consistent with their use in previous graphs?

Efficiency Ability to portray a possibly complex relation in as simple a way as possible. Are the elements of the graph economically used? Is the graph easy to interpret?

Necessity The need for the graph, and the graphical elements. Is the graph a more useful way to represent the data than alternatives (table, text)? Are all the graph elements necessary to convey the relations?

Tbuthfulness Ability to determine the true value represented by any graph- ical element by its magnitude relative to the implicit or explicit scale. Are the graph elements accurately positioned and scaled?

T\rfte’s Guidelines Edward R. Tufte has also enumerated the following principles for graphical excellence;

3.4 OLAP and Multidimensional Data Analysis 131

Graphical excellence is the well-designed presentation of interesting data- a matter of substance, of statistics, and of design.

Graphical excellence consists of complex ideas communicated with clar- ity, precision, and efficiency.

Graphical excellence is that which gives to the viewer the greatest num- ber of ideas in the shortest time with the least ink in the smallest space.

Graphical excellence is nearly always multivariate.

And graphical excellence requires telling the truth about the data.

3.4 OLAP and Multidimensional Data Analysis

In this section, we investigate the techniques and insights that come from viewing data sets as multidimensional arrays. A number of database sys- tems support such a viewpoint, most notably, On-Line Analytical Processing (OLAP) systems. Indeed, some of the terminology and capabilities of OLAP systems have made their way into spreadsheet programs that are used by mil- Iions of people. OLAP systems also have a strong focus on the interactive analysis of data and typically provide extensive capabilities for visualizing the data and generating summary statistics. For these reasons, our approach to multidimensional data analysis will be based on the terminology and concepts common to OLAP systems.

3.4.L Representing Iris Data as a Multidimensional Array

Most data sets can be represented as a table, where each row is an object and each column is an attribute. In many cases, it is also possible to view the data as a multidimensional array. We illustrate this approach by representing the Iris data set as a multidimensional array.

Table 3.7 was created by discretizing the petal length and petal width attributes to have values of low, med’ium, and hi,gh and then counting the number of flowers from the Iris data set that have particular combinations of petal width, petal length, and species type. (For petal width, the cat- egories low, med’ium, and hi,gh correspond to the intervals [0, 0.75), [0.75, 1.75), [7.75, oo), respectively. For petal length, the categories low, med’ium, and hi,gh correspond to the intervals 10, 2.5), 12.5, 5), [5, m), respectively.)

I32 Chapter 3 Exploring Data

Table 3.7. Number of flowers having a particular combination of petal width, petal length, and species

rype. Petal Length Petal Width Species T Count

Iov,r low

medium medium medium medium

high high high high

low medium

low medium

high high

medium medium

high high

Setosa Setosa Setosa

Versicolour Versicolour Virginica

Versicolour Virginica

Versicolour Virginica

40

2 2

+L)

3 3 2 a

2 44

Petal widrh

= oE

.= E o E

-c .9 -c

Virginica Versicolour Setosa

high

medium

low

Petal width

Figure 3.30. A multidimensional data representation for the lris data set.

3.4 OLAP and Multidimensional Data Analysis 133

Table 3.8. Crosstabulation of flowers accord- ing to petal length and width for flowers of the Setosa species.

Table 3.9, Cross-tabulation of flowers accord- ing to petal length and width for flowers of the Versicolour species.

width width Iow medium high

low medium

high

Table 3.10. Cross-tabulation of flowers ac- cording to petal length and width for flowers of the Virginica species.

width

Empty combinations-those combinations that do not correspond to at least one flower-are not shown.

The data can be organized as a multidimensional array with three dimen- sions cortesponding to petal width, petal length, and species type, as illus- trated in Figure 3.30. For clarity, slices of this array are shown as a set of three two-dimensional tables, one for each species-see Tables 3.8, 3.9, and 3.10. The information contained in both Table 3.7 and Figure 3.30 is the same. However, in the multidimensional representation shown in Figure 3.30 (and Tables 3.8, 3.9, and 3.10), the values of the attributes-petal width, petal length, and species type-are array indices.

What is important are the insights can be gained by looking at data from a multidimensional viewpoint. Tables 3.8, 3.9, and 3.10 show that each species of Iris is characterized by a different combination of values of petal length and width. Setosa flowers have low width and length, Versicolour flowers have medium width and length, and Virginica flowers have high width and length.

3.4.2 Multidimensional Data: The General Case

The previous section gave a specific example of using a multidimensional ap- proach to represent and analyze a familiar data set. Here we describe the general approach in more detail.

0 0 0 0 4 3 3 0 2 2

+l b0 et

j

L34 Chapter 3 Exploring Data

The starting point is usually a tabular representation of the data, such as that of Table 3.7, which is called a fact table. Two steps are necessary in order to represent data as a multidimensional arrayi identification of the dimensions and identification of an attribute that is the focus of the analy- sis. The dimensions are categorical attributes or, as in the previous example, continuous attributes that have been converted to categorical attributes. The values of an attribute serve as indices into the array for the dimension corre- sponding to the attribute, and the number of attribute values is the size of that dimension. In the previous example, each attribute had three possible values, and thus, each dimension was of size three and could be indexed by threevalues. Thisproduceda3 x 3 x 3multidimensional array.

Each combination of attribute values (one value for each difierent attribute) defines a cell of the multidimensional array. To illustrate using the previous example, if petal length : lou), petal width : mediutr\ and species : Setosa, a specific cell containing the value 2 is identified. That is, there are only two flowers in the data set that have the specified attribute values. Notice that each row (object) of the data set in Table 3.7 corresponds to a cell in the multidimensional array.

The contents of each cell represents the value of a target quantity (target variable or attribute) that we are interested in analyzing. In the Iris example, the target quantity is the nurnber of flowers whose petal width and length fall within certain limits. The target attribute is quantitative because a key goal of multidimensional data analysis is to look aggregate quantities, such as totals or averages.

The following summarizes the procedure for creating a multidimensional data representation from a data set represented in tabular form. First, identify the categorical attributes to be used as the dimensions and a quantitative attribute to be used as the target of the analysis. Each row (object) in the table is mapped to a cell of the multidimensional array. The indices of the cell are specified by the values of the attributes that were selected as dimensions, while the value of the cell is the value of the target attribute. Cells not defined by the data are assumed to have a value of 0.

Example 3.23. To further illustrate the ideas just discussed, we present a more traditional example involving the sale of products.The fact table for this example is given by Table 3.11. The dimensions of the multidimensional rep- resentation are the product ID, locati,on, and date attributes, while the target attribute is the reaenue. Figure 3.31 shows the multidimensional representa- tion of this data set. This larger and more complicated data set will be used to illustrate additional concepts of multidimensional data analysis. r

3.4 OLAP and Multidimensional Data Analysis 135

3.4.3 Analyzing Multidimensional Data

In this section, we describe different multidimensional analysis techniques. In particular, we discuss the creation of data cubes, and related operations, such as slicing, dicing, dimensionality reduction, roll-up, and drill down.

Data Cubes: Computing Aggregate Quantities

A key motivation for taking a multidimensional viewpoint of data is the im- portance of aggregating data in various ways. In the sales example, we might wish to find the total sales revenue for a specific year and a specific product. Or we might wish to see the yearly sales revenue for each location across all products. Computing aggregate totals involves fixing specific values for some of the attributes that are being used as dimensions and then summing over all possible values for the attributes that make up the remaining dimensions. There are other types of aggregate quantities that are also of interest, but for simplicity, this discussion will use totals (sums).

Table 3.12 shows the result of summing over all locations for various com- binations of date and product. For simplicity, assume that all the dates are within one year. Ifthere are 365 days in a year and 1000 products, then Table 3.12 has 365,000 entries (totals), one for each product-data pair. We could also specify the store location and date and sum over products, or specify the location and product and sum over all dates.

Table 3.13 shows the marginal totals of Table 3.12. These totals are the result of further summing over either dates or products. In Table 3.13, the total sales revenue due to product 1, which is obtained by summing across row 1 (over all dates), is $370,000. The total sales revenue on January 1, 2004, which is obtained by summing down column 1 (over all products), is

$527,362. The total sales revenue, which is obtained by summing over all rows and columns (all times and products) is $227,352,127. All of these totals are for all locations because the entries of Table 3.13 include all locations.

A key point of this example is that there are a number of different totals (aggregates) that can be computed for a multidimensional array, depending on how many attributes we sum over. Assume that there are n dimensions and that the ith dimension (attribute) has si possible values. There are n different ways to sum only over a single attribute. If we sum over dimension j, then we obtain s1 x … * sj-1 * tj+t * … * s’ totals, one for each possible combination of attribute values of the n- l other attributes (dimensions). The totals that result from summing over one attribute form a multidimensional array of n-I dimensions and there are n such arrays of totals. In the sales example, there

136 Chapter 3 Exploring Data

Table 3.11, Sales revenue of products (in dollars) for various locations and times.

Product ID Location Date Revenue

: : 1 Minneapolis 1 Chicago

i p*i.

27 Minneapolis 27 Chicago

n Paris

i i Oct. 18, 2004 $250 Oct. 18,2004 $79

Oct. 18, 2004 301

: : Oct. 18, 2004 $2,321 Oct. 18, 2004 $3,278

Oct. 18, 2004 $1,325 : :

{“‘v

Product lD

Figure 3.31. Multidimensional data representation for sales data.

tr

27 tr

3.4 OLAP and Multidimensional Data Analvsis L37

Table 3.12. Totals that result from summing over all locations for a fixed time and product.

date

Table 3.13. Table 3.12 with marginaltotals.

date Jan 1. 2004 Jan2,2004 Dec 31. 2004 | total

$3.800.020

27,362 , r27

are three sets of totals that result from summing over only one dimension and

each set of totals can be displayed as a two-dimensional table.

If we sum over two dimensions (perhaps starting with one of the arrays

of totals obtained by summing over one dimension), then we will obtain a

multidimensional array of totals with rz – 2 dimensions. There will be (!)

distinct anays of such totals. For the sales examples, there will be () : g

arays of totals that result from summing over location and product, Iocation

and time, or product and time. In general, summing over ,k dimensions yields

([) arrays of totals, each with dimension n – k.