Characteristics of the Data Sets and Attributes As discussed in the
introduction, the type of data set and attributes can dictate the type of algo-
rithm to use. For instance, the K-means algorithm can only be used on data
for which an appropliate proximity measure is available that allows meaning-
ful computation of a cluster centroid. For other clustering techniques, such
as many agglomerative hierarchical approaches, the underlying nature of the
data sets and attributes is less important as long as a proximity matrix can
Noise and Outliers Noise and outliers are particularly important aspects
of the data. We have tried to indicate the effect of noise and outliers on the
various clustering algorithms that we have discussed. In practice, however, it
may be difficult to evaluate the amount of noise in the data set or the number
of outliers. More than that, what is noise oI an outlier to one person may
be interesting to another pelson. For example, if we are using clustering to
segment an area into regions of different population density, we do not want
to use a density-based clustering technique, such as DBSCAN, that assumes
that regions or points with density lower than a global threshold are noise or
outliers. As another example, hierarchical clustering schemes, such as CURE,
often discard clusters of points that are growing slowly since such groups tend
to represent outliers. However, in some applications we may be most interested
in relatively small clustersl e.g., in market segmentation, such groups might
represent the most profitable customers.
Number of Data Objects We have considered how clustering is affected
by the number of data objects in considerable detail in previous sections. We
reiterate, however, that this factor often plays an important role in determining
the type of clustering algorithm to be used. Suppose that we want to create
a hierarchical clustering of a set of data, we are not interested in a complete
hierarchy that extends all the way to individual objects, but only to the point
at which we have split the data into a few hundred clusters. If the data is
very large, we cannot directly apply an agglomerative hierarchical clustering
technique. We could, however, use a divisive clustering technique, such as
the minimum spanning tree (MST) algorithm, which is the divisive analog to
single link, but this would only work if the data set is not too large. Bisecting
K-means would also work for many data sets, but if the data set is large enough
that it cannot be contained completely in memory, then this scheme also runs
into problems. In this situation, a technique such as BIRCH, which does not
require that all data be in main memory, becomes more useful’
642 Chapter 9 Cluster Analysis: Additional Issues and Algorithms
Number of Attributes We have also discussed the impact of dimension- ality at some length. Again, the key point is to realize that an algorithm that works well in low or moderate dimensions may not work well in high dimensions. As in many other cases in which a clustering algorithm is inap- propriately applied, the clustering algorithm may run and produce clusters, but the clusters may not represent the true structure of the data.
Cluster Description One aspect of clustering techniques that is often over- looked is how the resulting clusters are described. Prototype clusters are suc- cinctly described by a small set of cluster prototypes. In the case of mixture models, the clusters are described in terms of small sets of parameters, such as the mean vector and the covariance matrix. This is also a very compact and understandable representation. For SOM, it is typically possible to visualize the relationships between clusters in a two-dimensional plot, such as that of Figure 9.8. For graph- and density-based clustering approaches, however, clus- ters are typically described as sets of cluster members. Nonetheless, in CURE, clusters can be described by a (relatively) small set of representative points. Also, for grid-based clustering schemes, such as CLIQUE, more compact de- scriptions can be generated in terms of conditions on the attribute values that describe the grid cells in the cluster.
Algorithmic Considerations There are also important aspects of algo- rithms that need to be considered. Is the algorithm non-deterministic or order-dependent? Does the algorithm automatically determine the number of clusters? Is there a technique for determining the values of various pa- rameters? Many clustering algorithms try to solve the clustering problem by trying to optimize an objective function. Is the objective a good match for the application objective? If not, then even if the algorithm does a good job of finding a clustering that is optimal or close to optimal with respect to the objective function, the result is not meaningful. Also, most objective functions give preference to larger clusters at the expense of smaller clusters.
Sumrnary The task of choosing the proper clustering algorithm involves considering all of these issues, and domain-specific issues as well. There is no formula for determining the proper technique. Nonetheless, a general knowl- edge of the types of clustering techniques that are available and consideration of the issues mentioned above, together with a focus on the intended appli- cation, should allow a data analyst to make an informed decision on which clustering approach (or approaches) to try.
9.7 Bibliographic Notes
An extensive discussion of fiizzy clustering, including a description of finzy
c-means and formal derivations of the formulas presented in Section 9.2.7, can
be found in the book on fuzzy cluster analysis by Hcippner et al. f4al]. While not discussed in this chapter, AutoClass by Cheeseman et al. 1424] is one of the earliest and most prominent mixture-model clustering programs. An introduction to mixture models can be found in the tutorial of Bilmes , the book by Mitchell  (which also describes how the K-means algorithm can be derived from a mixture model approach), and the article by Fraley and
Raftery . Besides data exploration, SOM and its supervised learning variant, Learn-
ing Vector Quantization (LVQ), have been used for many tasks: image segmen-
tation, organization of document files, and speech processing. Our discussion of SOM was cast in the terminology of prototype-based clustering. The book
on SOM by Kohonen et al. 14471 contains an extensive introduction to SOM that emphasizes its neural network origins, as well as a discussion of some of its variations and applications. One important SOM-related clustering devel- opment is the Generative Topographic Map (GTM) algorithm by Bishop et
al. , which uses the EM algorithm to find Gaussian models satisfying two-dimensional topographic constraints.
The description of Chameleon can be found in the paper by Karypis et
al. [ 5]. Capabilities similar, although not identical to those of Chameleon have been implemented in the CLUTO clustering package by Karypis . The METIS graph partitioning package by Karypis and Kumar  is used to perform graph partitioning in both programs, as well as in the OPOSSUM clustering algorithm by Strehl and Ghosh 1459]. The notion of SNN similarity
was introduced by Jarvis and Patrick 1442). A hierarchical clustering scheme
based on a similar concept of mutual nearest neighbors was proposed by Gowda and Krishna la3\. Guha et al.  created ROCK, a hierarchical graph-
based clustering algorithm for clustering transaction data, which among other interesting features, also uses a notion of similarity based on shared neighbors
that closely resembles the SNN similarity developed by Jarvis and Patrick. A
description of the SNN density-based clustering technique can be found in the publications of Ertoz et al. 1426,427]. SNN density-based clustering was used
by Steinbach et al. l457l to find climate indices. Examples of grid-based clustering algorithms are OptiGrid (Hinneburg and
Keim [4a0]), the BANG clustering system (Schikuta and Erhart [a55]), and
WaveCluster (Sheikholeslami et al. ). The CLIQUtr algorithm is de-
scribed in the paper by Guha et al. . MAFIA (Nagesh et al. ) is
644 Chapter 9 Cluster Analysis: Additional Issues and Alsorithms
a modification of CLIQUE whose goal is improved efficiency. Kailing et al.
14441 have developed SUBCLU (density-connected SUBspace ClUstering), a subspace clustering algorithm based on DBSCAN. The DENCLUtr algorithm was proposed by Hinneburg and Keim .
Our discussion of scalability was strongly influenced bv the article of Ghosh
1432]. A wide-ranging discussion of specific techniques for clustering massive data sets can be found in the paper by Murtagh . CURE is work by Guha et al. , while details of BIRCH are in the paper by Zhang et al. . CLARANS (Ng and Han ) is an algorithm for scaling K-medoid clustering to larger databases. A discussion of scaling EM and K-means clustering to large data sets is provided by Bradley et al. 1422, 423].
There are many aspects of clustering that we have not covered. Additional pointers are given in the books and surveys mentioned in the bibliographic notes of the previous chapter. Here, we mention four areas-omitting, un- fortunately, many more. Clustering of transaction data (Ganti et al. , Gibson et al. , Han et al. l 38], and Peters andZaki [ 5 ]) is an important area, as transaction data is common and of commercial importance. Streaming data is also becoming increasingly common and important as communications and sensor networks become pervasive. Two introductions to clustering for data streams are given in articles by Barbard [a19] and Guha et al. [4gb]. Conceptual clustering (Fisher and Langley la28], Jonyer et al. 1443], Mishra et al. , Michalski and Stepp , Stepp and Michalski l 5S]), which uses more complicated definitions of clusters that often correspond better to human notions of a cluster, is an area of clustering whose potential has perhaps not been fully realized. Finally, there has been a great deal of clustering work for data compression in the area of vector quantization. The book bv Gersho and Gray [a31] is a standard text in this area.
Bibliography 1418] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering
of high dimensional data for data mining applications. In Proc. of 1998 ACM-SIGMOD IntL. Conf. on Managernent of Data, pages 94 105, Seattle, Washington, June 1998. ACM Press.
 D. Barbar6. Requirements for clustering data streams. SIGKDD Etplorations Newslet- ter, 3(2):23-27, 2002.
1420] J. Bilmes. A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report ICSI- TR-97-021, University of California at Berkeley, 1997.
l42Il C. M. Bishop, M. Svensen, and C. K. I. Williams. GTM: A principled alternative to the self-organizing map. In C. von der Malsburg, W. von Seelen, J. C. Vorbruggen, and
B. Sendhoff, editors, Arti,fi,c’ial Neural Netuorks-ICANN96. Intl. Conf, Proc., pages
165 170. Springer-Verlag, Berlin, Germany, 1996.
14221 P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling Clustering Algorithms to Large
Databases. In Proc. of the lth IntI. Conf. on Knowledge Discouerltr and Data Mining, pages 9 15, New York City, August 1998. AAAI Press.
1423] P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling EM (Expectation Maximization)
Clustering to Large Databases. Technical Report MSR-TR-98-35, Microsoft Research,
1424] P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman. AutoClass:
a Bayesian classification system In Read’ings ‘in knowled,ge acqu’isition and” learning:
automating the construction and improuement of erper”t sgstems, pages 431 441. Morgan
Kaufmann Publishers Inc., 1993.