Characteristics of the Data Sets and Attributes As discussed in the

introduction, the type of data set and attributes can dictate the type of algo-

rithm to use. For instance, the K-means algorithm can only be used on data

for which an appropliate proximity measure is available that allows meaning-

ful computation of a cluster centroid. For other clustering techniques, such

as many agglomerative hierarchical approaches, the underlying nature of the

data sets and attributes is less important as long as a proximity matrix can

be created.

Noise and Outliers Noise and outliers are particularly important aspects

of the data. We have tried to indicate the effect of noise and outliers on the

various clustering algorithms that we have discussed. In practice, however, it

may be difficult to evaluate the amount of noise in the data set or the number

of outliers. More than that, what is noise oI an outlier to one person may

be interesting to another pelson. For example, if we are using clustering to

segment an area into regions of different population density, we do not want

to use a density-based clustering technique, such as DBSCAN, that assumes

that regions or points with density lower than a global threshold are noise or

outliers. As another example, hierarchical clustering schemes, such as CURE,

often discard clusters of points that are growing slowly since such groups tend

to represent outliers. However, in some applications we may be most interested

in relatively small clustersl e.g., in market segmentation, such groups might

represent the most profitable customers.

Number of Data Objects We have considered how clustering is affected

by the number of data objects in considerable detail in previous sections. We

reiterate, however, that this factor often plays an important role in determining

the type of clustering algorithm to be used. Suppose that we want to create

a hierarchical clustering of a set of data, we are not interested in a complete

hierarchy that extends all the way to individual objects, but only to the point

at which we have split the data into a few hundred clusters. If the data is

very large, we cannot directly apply an agglomerative hierarchical clustering

technique. We could, however, use a divisive clustering technique, such as

the minimum spanning tree (MST) algorithm, which is the divisive analog to

single link, but this would only work if the data set is not too large. Bisecting

K-means would also work for many data sets, but if the data set is large enough

that it cannot be contained completely in memory, then this scheme also runs

into problems. In this situation, a technique such as BIRCH, which does not

require that all data be in main memory, becomes more useful’

642 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

Number of Attributes We have also discussed the impact of dimension- ality at some length. Again, the key point is to realize that an algorithm that works well in low or moderate dimensions may not work well in high dimensions. As in many other cases in which a clustering algorithm is inap- propriately applied, the clustering algorithm may run and produce clusters, but the clusters may not represent the true structure of the data.

Cluster Description One aspect of clustering techniques that is often over- looked is how the resulting clusters are described. Prototype clusters are suc- cinctly described by a small set of cluster prototypes. In the case of mixture models, the clusters are described in terms of small sets of parameters, such as the mean vector and the covariance matrix. This is also a very compact and understandable representation. For SOM, it is typically possible to visualize the relationships between clusters in a two-dimensional plot, such as that of Figure 9.8. For graph- and density-based clustering approaches, however, clus- ters are typically described as sets of cluster members. Nonetheless, in CURE, clusters can be described by a (relatively) small set of representative points. Also, for grid-based clustering schemes, such as CLIQUE, more compact de- scriptions can be generated in terms of conditions on the attribute values that describe the grid cells in the cluster.

Algorithmic Considerations There are also important aspects of algo- rithms that need to be considered. Is the algorithm non-deterministic or order-dependent? Does the algorithm automatically determine the number of clusters? Is there a technique for determining the values of various pa- rameters? Many clustering algorithms try to solve the clustering problem by trying to optimize an objective function. Is the objective a good match for the application objective? If not, then even if the algorithm does a good job of finding a clustering that is optimal or close to optimal with respect to the objective function, the result is not meaningful. Also, most objective functions give preference to larger clusters at the expense of smaller clusters.

Sumrnary The task of choosing the proper clustering algorithm involves considering all of these issues, and domain-specific issues as well. There is no formula for determining the proper technique. Nonetheless, a general knowl- edge of the types of clustering techniques that are available and consideration of the issues mentioned above, together with a focus on the intended appli- cation, should allow a data analyst to make an informed decision on which clustering approach (or approaches) to try.

BibliographicNotes 643

9.7 Bibliographic Notes

An extensive discussion of fiizzy clustering, including a description of finzy

c-means and formal derivations of the formulas presented in Section 9.2.7, can

be found in the book on fuzzy cluster analysis by Hcippner et al. f4al]. While not discussed in this chapter, AutoClass by Cheeseman et al. 1424] is one of the earliest and most prominent mixture-model clustering programs. An introduction to mixture models can be found in the tutorial of Bilmes [420], the book by Mitchell [450] (which also describes how the K-means algorithm can be derived from a mixture model approach), and the article by Fraley and

Raftery [429]. Besides data exploration, SOM and its supervised learning variant, Learn-

ing Vector Quantization (LVQ), have been used for many tasks: image segmen-

tation, organization of document files, and speech processing. Our discussion of SOM was cast in the terminology of prototype-based clustering. The book

on SOM by Kohonen et al. 14471 contains an extensive introduction to SOM that emphasizes its neural network origins, as well as a discussion of some of its variations and applications. One important SOM-related clustering devel- opment is the Generative Topographic Map (GTM) algorithm by Bishop et

al. [421], which uses the EM algorithm to find Gaussian models satisfying two-dimensional topographic constraints.

The description of Chameleon can be found in the paper by Karypis et

al. [ 5]. Capabilities similar, although not identical to those of Chameleon have been implemented in the CLUTO clustering package by Karypis [425]. The METIS graph partitioning package by Karypis and Kumar [446] is used to perform graph partitioning in both programs, as well as in the OPOSSUM clustering algorithm by Strehl and Ghosh 1459]. The notion of SNN similarity

was introduced by Jarvis and Patrick 1442). A hierarchical clustering scheme

based on a similar concept of mutual nearest neighbors was proposed by Gowda and Krishna la3\. Guha et al. [437] created ROCK, a hierarchical graph-

based clustering algorithm for clustering transaction data, which among other interesting features, also uses a notion of similarity based on shared neighbors

that closely resembles the SNN similarity developed by Jarvis and Patrick. A

description of the SNN density-based clustering technique can be found in the publications of Ertoz et al. 1426,427]. SNN density-based clustering was used

by Steinbach et al. l457l to find climate indices. Examples of grid-based clustering algorithms are OptiGrid (Hinneburg and

Keim [4a0]), the BANG clustering system (Schikuta and Erhart [a55]), and

WaveCluster (Sheikholeslami et al. [456]). The CLIQUtr algorithm is de-

scribed in the paper by Guha et al. [418]. MAFIA (Nagesh et al. [452]) is

9 .7

644 Chapter 9 Cluster Analysis: Additional Issues and Alsorithms

a modification of CLIQUE whose goal is improved efficiency. Kailing et al.

14441 have developed SUBCLU (density-connected SUBspace ClUstering), a subspace clustering algorithm based on DBSCAN. The DENCLUtr algorithm was proposed by Hinneburg and Keim [439].

Our discussion of scalability was strongly influenced bv the article of Ghosh

1432]. A wide-ranging discussion of specific techniques for clustering massive data sets can be found in the paper by Murtagh [451]. CURE is work by Guha et al. [436], while details of BIRCH are in the paper by Zhang et al. [460]. CLARANS (Ng and Han [453]) is an algorithm for scaling K-medoid clustering to larger databases. A discussion of scaling EM and K-means clustering to large data sets is provided by Bradley et al. 1422, 423].

There are many aspects of clustering that we have not covered. Additional pointers are given in the books and surveys mentioned in the bibliographic notes of the previous chapter. Here, we mention four areas-omitting, un- fortunately, many more. Clustering of transaction data (Ganti et al. [430], Gibson et al. [433], Han et al. l 38], and Peters andZaki [ 5 ]) is an important area, as transaction data is common and of commercial importance. Streaming data is also becoming increasingly common and important as communications and sensor networks become pervasive. Two introductions to clustering for data streams are given in articles by Barbard [a19] and Guha et al. [4gb]. Conceptual clustering (Fisher and Langley la28], Jonyer et al. 1443], Mishra et al. [449], Michalski and Stepp [448], Stepp and Michalski l 5S]), which uses more complicated definitions of clusters that often correspond better to human notions of a cluster, is an area of clustering whose potential has perhaps not been fully realized. Finally, there has been a great deal of clustering work for data compression in the area of vector quantization. The book bv Gersho and Gray [a31] is a standard text in this area.

Bibliography 1418] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering

of high dimensional data for data mining applications. In Proc. of 1998 ACM-SIGMOD IntL. Conf. on Managernent of Data, pages 94 105, Seattle, Washington, June 1998. ACM Press.

[419] D. Barbar6. Requirements for clustering data streams. SIGKDD Etplorations Newslet- ter, 3(2):23-27, 2002.

1420] J. Bilmes. A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report ICSI- TR-97-021, University of California at Berkeley, 1997.

l42Il C. M. Bishop, M. Svensen, and C. K. I. Williams. GTM: A principled alternative to the self-organizing map. In C. von der Malsburg, W. von Seelen, J. C. Vorbruggen, and

Bibliography 645

B. Sendhoff, editors, Arti,fi,c’ial Neural Netuorks-ICANN96. Intl. Conf, Proc., pages

165 170. Springer-Verlag, Berlin, Germany, 1996.

14221 P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling Clustering Algorithms to Large

Databases. In Proc. of the lth IntI. Conf. on Knowledge Discouerltr and Data Mining, pages 9 15, New York City, August 1998. AAAI Press.

1423] P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling EM (Expectation Maximization)

Clustering to Large Databases. Technical Report MSR-TR-98-35, Microsoft Research,

October 1999.

1424] P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman. AutoClass:

a Bayesian classification system In Read’ings ‘in knowled,ge acqu’isition and” learning:

automating the construction and improuement of erper”t sgstems, pages 431 441. Morgan

Kaufmann Publishers Inc., 1993.