Practical consideration can also be important. Sometimes, a one or more proximity measures are already in use in a particular field, and thus, others will have answered the question of which proximity measures should be used. Other times, the software package or clustering algorithm being used may drastically limit the choices. If efficiency is a concern, then we may want to choose a proximity measure that has a property, such as the triangle inequality, that can be used to reduce the number of proximity calculations. (See Exercise 25.)

However, if common practice or practical restrictions do not dictate a choice, then the proper choice of a proximity measure can be a time-consuming task that requires careful consideration of both domain knowledge and the purpose for which the measure is being used. A number of different similarity measures may need to be evaluated to see which ones produce results that make the most sense.

2.5 Bibliographic Notes

It is essential to understand the nature of the data that is being analyzed, and at a fundamental level, this is the subject of measurement theory. In

2.5 Bibliographic Notes 85

particular, one of the initial motivations for defining types of attributes was to be precise about which statistical operations were valid for what sorts of data. We have presented the view of measurement theory that was initially described in a classic paper by S. S. Stevens 179]. (Tables 2.2 and 2.3 are derived from those presented by Stevens [80].) While this is the most common view and is reasonably easy to understand and apply, there is, of course, much more to measurement theory. An authoritative discussion can be found in a three-volume series on the foundations of measurement theory [63, 69, 81]. AIso of interest is a wide-ranging article by Hand [55], which discusses measurement theory and statistics, and is accompanied by comments from other researchers in the field. Finally, there are many books and articles that describe measurement issues for particular areas of science and engineering.

Data quality is a broad subject that spans every discipline that uses data. Discussions of precision, bias, accuracy, and significant figures can be found in many introductory science, engineering, and statistics textbooks. The view of data quality as “fitness for use” is explained in more detail in the book by Redman [76]. Those interested in data quality may also be interested in MIT’s Total Data Quality Management program [70, 84]. However, the knowledge needed to deal with specific data quality issues in a particular domain is often best obtained by investigating the data quality practices of researchers in that field.

Aggregation is a less well-defined subject than many other preprocessing

tasks. However, aggregation is one of the main techniques used by the database area of Online Analytical Processing (OLAP), which is discussed in Chapter 3. There has also been relevant work in the area of symbolic data analysis (Bock and Diday [a7]). One of the goals in this area is to summarize traditional record data in terms of symbolic data objects whose attributes are more complex than traditional attributes. Specifically, these attributes can have values that are sets of values (categories), intervals, or sets of values with weights (histograms). Another goal of symbolic data analysis is to be able to perform clustering, classification, and other kinds of data analysis on data that consists of symbolic data objects.

Sampling is a subject that has been well studied in statistics and related fields. Many introductory statistics books, such as the one by Lindgren [65], have some discussion on sampling, and there are entire books devoted to the subject, such as the classic text by Cochran [49]. A survey of sampling for data mining is provided by Gu and Liu [54], while a survey of sampling for databases is provided by Olken and Rotem [ZZ]. There are a number of other data mining and database-related sampling references that may be of interest,

86 Chapter 2 Data

including papers by Palmer and Faloutsos [74], Provost et al. [75], Toivonen

[82], and Zakiet al. [85]. In statistics, the traditional techniques that have been used for dimension-

ality reduction are multidimensional scaling (MDS) (Borg and Groenen [48], Kruskal and Uslaner [6a]) and principal component analysis (PCA) (Jolliffe

[58]), which is similar to singular value decomposition (SVD) (Demmel [50]). Dimensionality reduction is discussed in more detail in Appendix B.

Discretization is a topic that has been extensively investigated in data mining. Some classification algorithms only work with categorical data, and association analysis requires binary data, and thus, there is a significant moti- vation to investigate how to best binarize or discretize continuous attributes. For association analysis, we refer the reader to work by Srikant and Agrawal

[78], while some useful references for discretization in the area of classification include work by Dougherty et al. [51], Elomaa and Rousu [SZ], Fayyad and Irani [53], and Hussain et al. [56].

Feature selection is another topic well investigated in data mining. A broad coverage of this topic is provided in a survey by Molina et al. [71] and two books by Liu and Motada [66, 67]. Other useful paperc include those by Blum and Langley 1461, Kohavi and John [62], and Liu et al. [68].

It is difficult to provide references for the subject of feature transformations because practices vary from one discipline to another. Many statistics books have a discussion of transformations, but typically the discussion is restricted to a particular purpose, such as ensuring the normality of a variable or making sure that variables have equal variance. We offer two references: Osborne [73] and Ttrkey [83].

While we have covered some of the most commonly used distance and similarity measures, there are hundreds of such measures and more are being created all the time. As with so many other topics in this chapter, many of these measures are specific to particular fields; e.g., in the area of time series see papers by Kalpakis et al. [59] and Keogh and Pazzani [61]. Clustering books provide the best general discussions. In particular, see the books by Anderberg

[45], Jain and Dubes [57], Kaufman and Rousseeuw [00], and Sneath and Sokal

1771.

[45]

[461

Bibliography M. R. Anderberg. Cluster Analysis for Appli,cati,ons. Academic Press, New York, De- cember 1973.

A. BIum and P. Langley. Selection of Relevant Features and Examples in Machine Learning. Artificial Intellig ence, 97 (l=2) :245-27 l, 1997 .

Bibliography 87

l47l H. H. Bock and E. Diday. Analysis of Sgmbolic Data: Exploratory Methods for Ertract- ing Statistical Information from Complen Data (Studi,es in Classifi,cation, Data Analys’is, and, Know ledg e Org an’izat’ionl. Springer-Verlag Telos, January 2000.

[48] I. Borg and P. Groenen. Modern Multidimensional Scaling Theory and, Applications. Springer-Verlag, February 1997.

[49] W. G. Cochran. Sampling Techniques. John Wiley & Sons, 3rd edition, JuJy 1977.

[50] J. W. Demmel. Applied, Numerical Linear Algebra. Society for Industrial & Applied Mathematics, September 1997.

[51] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and Unsupervised Discretization

of Continuous Features. In Proc. of the 12th Intl. Conf. on Machine Learni,ng, pages

L94-202, t995.

[52] T. Elomaa and J. Rousu. General and Efficient Multisplitting of Numerical Attributes.

M achine Learni,ng, 36(3):201 244, 1999.

[53] U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuousvalued at- tributes for classification learning. In Proc. 13th Int. Joint Conf. on Arti;fi,cial Intelli,- gence, pages lO22-L027. Morgan Kaufman, 1993.

154] F. H. Gaohua Gu and H. Liu. Sampling and Its Application in Data Mining: A Survey. Technical Report TRA6/00, National University of Singapore, Singapore, 2000.

f55] D. J. Hand. Statistics and the Theory of Measurement. Jountal of the Rogal Statistical Societg: Series A (Statistics in Societg),159(3):445-492, 1996.

[56] F. Hussain, H. Liu, C. L. Tan, and M. Dash. TRC6/99: Discretization: an enabling technique. Technical report, National University of Singapore, Singapore, 1999.

[57j A. K. Jain and R. C. Dubes. Algorithrns for Clustering Data. Prentice Hall

Advanced Reference Series. Prentice Hall, March 1988. Book available online at

http: //www.cse.msu.edu/-jain/Clustering-Jain-Dubes.pdf.

[58] I. T. Jolliffe. Principal Cornponent Analys’is. Springer Verlag, 2nd edition, October 2002.

[59] K. Kalpakis, D. Gada, and V. Puttagunta. Distance Measures for Effective Clustering of ARIMA Time-Series. In Proc. of the 2001 IEEE Intl. Conf. on Data Mini’ng, pages

273-280. IEEE Computer Society, 2001.

[60] L. Kaufman and P. J. Rousseeuw. Findi,ng Groups in Data: An Introduction to Cluster Analysi.s. Wiley Series in Probability and Statistics. John Wiley and Sons, New York,

November 1990.