A multidimensional representation of the data, together with all possible

totals (aggregates), is known as a data cube. Despite the name, the size of

each dimension-the number of attribute values-does not need to be equal.

AIso, a data cube may have either more or fewer than three dimensions. More

importantly, a data cube is a generalization of what is known in statistical

terminology as a cross-tabulation. If marginal totals were added, Tables

3.8, 3.9, or 3.10 would be typical examples of cross tabulations.

JanI .2O04 Jan2,2004

138 Chapter 3 Exploring Data

Dimensionality Reduction and Pivoting

The aggregation described in the last section can be viewed as a form of dimensionality reduction. Specifically, the jth dimension is eliminated by summing over it. Conceptually, this collapses each “column” of cells in the jth

dimension into a single cell. For both the sales and Iris examples, aggregating over one dimension reduces the dimensionality of the data from 3 to 2. If si is the number of possible values of the 7’h dimension, the number of cells is reduced by a factor of sr. Exercise 17 on page 143 asks the reader to explore the difference between this type of dimensionality reduction and that of PCA.

Pivoting refers to aggregating over all dimensions except two. The result is a two-dimensional cross tabulation with the two specified dimensions as the only remaining dimensions. Table 3.13 is an example of pivoting on date and product.

Slicing and Dicing

These two colorful names refer to rather straightforward operations. Slicing is selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions. Tables 3.8, 3.9, and 3.10 are three slices from the Iris set that were obtained by specifying three separate values for the species dimension. Dicing involves selecting a subset of cells by specifying a range of attribute values. This is equivalent to defining a subarray from the complete array. In practice, both operations can also be accompanied by aggregation over some dimensions.

Roll-Up and Drill-Down

In Chapter 2, attribute values were regarded as being “atomic” in some sense. However, this is not always the case. In particular, each date has a number of properties associated with it such as the year, month, and week. The data can also be identified as belonging to a particular business quarter) or if the application relates to education, a school quarter or semester. A location also has various properties: continent, country, state (province, etc.), and city. Products can also be divided into various categories, such as clothing, electronics, and furniture.

Often these categories can be organized as a hierarchical tree or lattice. For instance) years consist of months or weeks, both of which consist of days. Locations can be divided into nations, which contain states (or other units of local government), which in turn contain cities. Likewise, any category

3.5 Bibliographic Notes 1-39

of products can be further subdivided. For example, the product category,

furniture, can be subdivided into the subcategories, chairs, tables, sofas, etc.

This hierarchical structure gives rise to the roll-up and drill-down opera-

tions. To illustrate, starting with the original sales data, which is a multidi-

mensional array with entries for each date, we can aggregate (roll up) the

sales across all the dates in a month. Conversely, given a representation of the

data where the time dimension is broken into months, we might want to split

the monthly sales totals (drill down) into daily sales totals. Of course, this

requires that the underlying sales data be available at a daily granularity.

Thus, roll-up and drill-down operations are related to aggregation. No’

tice, however, that they differ from the aggregation operations discussed until

now in that they aggregate cells within a dimension, not across the entire

dimension.

3.4.4 Final Comments on Multidimensional Data Analysis

Multidimensional data analysis, in the sense implied by OLAP and related sys-

tems, consists of viewing the data as a multidimensional array and aggregating

data in order to better analyze the structure of the data. For the Iris data,

the differences in petal width and length are clearly shown by such an anal-

ysis. The analysis of business data, such as sales data, can also reveal many

interesting patterns, such as profitable (or unprofitable) stores or products.

As mentioned, there are various types of database systems that support

the analysis of multidimensional data. Some of these systems are based on

relational databases and are known as ROLAP systems. More specialized

database systems that specifically employ a multidimensional data represen-

tation as their fundamental data model have also been designed. Such systems

are known as MOLAP systems. In addition to these types of systems, statisti-

cal databases (SDBs) have been developed to store and analyze various types

of statistical data, e.g., census and public health data, that are collected by

governments or other large organizations. References to OLAP and SDBs are

provided in the bibliographic notes.

3.5 Bibliographic Notes

Summary statistics are discussed in detail in most introductory statistics

books, such as 192]. References for exploratory data analysis are the classic

text by Tirkey [104] and the book by Velleman and Hoaglin [105]. The basic visualization techniques are readily available, being an integral

part of most spreadsheets (Microsoft EXCEL [95]), statistics programs (sAS

1,4O Chapter 3 Exploring Data

[99], SPSS [102], R [96], and S-PLUS [98]), and mathematics software (MAT- LAB [94] and Mathematica [93]). Most of the graphics in this chapter were generated using MATLAB. The statistics package R is freely available as an open source software package from the R project.

The literature on visualization is extensive, covering many fields and many decades. One of the classics of the field is the book by Tufte [103]. The book by Spence [tOt], which strongly influenced the visualization portion of this chapter, is a useful reference for information visualization-both principles and techniques. This book also provides a thorough discussion of many dynamic visualization techniques that were not covered in this chapter. Two other books on visualization that may also be of interest are those by Card et al.

[87] and Fayyad et al. [S9]. Finally, there is a great deal of information available about data visualiza-

tion on the World Wide Web. Since Web sites come and go frequently, the best strategy is a search using “information visualization,” “data visualization,” or “statistical graphics.” However, we do want to single out for attention “The Gallery of Data Visualization,” by Friendly [90]. The ACCENT Principles for effective graphical display as stated in this chapter can be found there, or as originally presented in the article by Burn [86].

There are a variety of graphical techniques that can be used to explore whether the distribution of the data is Gaussian or some other specified dis- tribution. Also, there are plots that display whether the observed values are statistically significant in some sense. We have not covered any of these tech- niques here and refer the reader to the previously mentioned statistical and mathematical packages.

Multidimensional analysis has been around in a variety of forms for some time. One of the original papers was a white paper by Codd [88], the father of relational databases. The data cube was introduced by Gray et al. [91], who described various operations for creating and manipulating data cubes within a relational database framework. A comparison of statistical databases and OLAP is given by Shoshani [100]. Specific information on OLAP can be found in documentation from database vendors and many popular books. Many database textbooks also have general discussions of OLAP, often in the context of data warehousing. For example, see the text by Ramakrishnan and Gehrke [97].

Bibliography [86] D. A. Burn. Designing Effective Statistical Graphs. In C. R. Rao, editor, Hand,book of

Stati,stics 9. Elsevier/North-Holland, Amsterdam, The Netherlands, September 1993.

3.6 Exercises L4L

[87] S. K. Card, J. D. MacKinlay, and B. Shneiderman, editors. Read,ings ,in Informat’ion Visualization: Using Vision to Thznlc. Morgan Kaufmann Publishers, San Francisco, CA, January 1999.

[88] E. F. Codd, S. B. Codd, and C. T. Smalley. Providing OLAP (On-line Analytical Processing) to User- Analysts: An IT Mandate. White Paper, E.F. Codd and Associates, 1993.

f89] U. M. Fayyad, G. G. Grinstein, and A. Wierse, editors. Information V’isualization’in Data Mining and, Knowled,ge Discouery. Morgan Kaufmann Publishers, San FYancisco, CA, September 2001.

[90] M. F]iendly. Gallery of Data Visualization. http://www.math.yorku.ca/SCS/Gallery/, 2005.

[91] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group- By, Cross-Tab, and Sub-Totals. Journal Data Mining and Knouledge Discouerg, l(I): 29-53, 1997.

f92] B. W. Lindgren. Stat’istical Theory. CRC Press, January 1993.

[93] Mathematica 5.1. Wolfram Research, Inc. http://www.wolfram.comf ,2005.

[94] MATLAB 7.0. The MathWorks, Inc. http://www.mathworks.com, 2005.

[95] Microsoft Excel 2003. Microsoft, Inc. http://www.microsoft.comf ,2003.

[96] R: A language and environment for statistical computing and graphics. The R Project for Statistical Computing. http: / /www.r-project.org/, 2005.

[97] R Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, 3rd edition, August 2002.

198] S-PLUS. Insightful Corporation. http: //www.insightful.com, 2005.

[99] SAS: Statistical Analysis System. SAS Institute Inc. http:f f www.sas.com/, 2005.

[100] A. Shoshani. OLAP and statistical databases: similarities and differences. In Proc. of the Siateenth ACM SIGACT-SIGMOD-SIGART Symp. on Princ,i,ples of Database Sgstems, pages 185-196. ACM Press, 1997.

[101] R. Spence. Inforrnation Visual’izati,on ACM Press, New York, December 2000.

[102] SPSS: Statistical Package for the Social Sciences. SPSS, lnc. http://www.spss.com/, 2005.

f103] E. R. Tufte. The Visual Di.splag of Quantitatiue Informat’ion. Graphics Press, Cheshire, CT, March 1986.

[104] J. W. T\rkey. Exploratory data analys,is. Addison-Wesley, 1977.

[105] P. Velleman and D. Hoaglin. The ABC’s of EDA: Applications, Basics, and Computing of Exploratorg Data Analysis. Duxbury, 1981.

3.6 Exercises

1. Obtain one of the data sets available at the UCI Machine Learning Repository and apply as many of the different visualization techniques described in the chapter as possible. The bibliographic notes and book Web site provide pointers to visualization software.

L42 Chapter 3 Exploring Data

2. Identify at least two advantages and two disadvantages ofusing color to visually represent information.

E d .

What are the arrangement issues that arise with respect to three-dimensional plots?

Discuss the advantages and disadvantages of using sampling to reduce the num- ber of data objects that need to be displayed. Would simple random sampling (without replacement) be a good approach to sampling? Why or why not?

Describe how you would create visualizations to display information that de- scribes the following types of systems.

(a) Computer networks. Be sure to include both the static aspects of the network, such as connectivity, and the dynamic aspects, such as traffic.

(b) The distribution of specific plant and animal species around the world for a specific moment in time.

(c) The use of computer resources, such as processor time, main memory, and disk, for a set of benchmark database programs.

(d) The change in occupation of workers in a particular country over the last thirty years. Assume that you have yearly information about each person that also includes gender and level of education.

Be sure to address the following issues:

o Representation. How will you map objects, attributes, and relation- ships to visual elements?

o Arrangement. Are there any special considerations that need to be taken into account with respect to how visual elements are displayed? Spe- cific examples might be the choice of viewpoint, the use of transparency, or the separation of certain groups of objects.

o Selection. How will you handle a larqe number of attributes and data objects?

Describe one advantage and one disadvantage of a stem and leaf plot with respect to a standard histogram.

How might you address the problem that a histogram depends on the number and location of the bins?

Describe how a box plot can give information about whether the value of an attribute is symmetrically distributed. What can you say about the symmetry of the distributions of the attributes shown in Fieure 3.11?

9. Compare sepal length, sepal width, petal length, and petal width, using Figure 3.r2.