Consider the following four faces shown in Figure 8.39. Again, darkness or number of dots represents density. Lines are used only to distinguish regions and do not represent points.

(a) For each figure, could you use single link to find the patterns represented by the nose, eyes, and mouth? Explain.

(b) For each figure, could you use K-means to find the patterns represented by the nose, eyes) and mouth? Explain.

18.

1 0

20.

8.7 Exercises 565

Figure 8.39. Figure for Exercise 20.

(c) What limitation does clustering have in detecting all the patterns formed by the points in Figure 8.39(c)?

21. Compute the entropy and purity for the confusion matrix in Table 8.14.

22. You are given two sets of 100 points that fall within the unit square. One set of points is arranged so that the points are uniformly spaced. The other set of points is generated from a uniform distribution over the unit square.

(a) Is there a difference between the two sets of points?

(b) If so, which set of points will typically have a smaller SSE for K:10 clusters?

(c) What will be the behavior of DBSCAN on the uniform data set? The random data set?

Using the data in Exercise 24, compute the silhouette coefficient for each point, each of the two clusters, and the overall clustering.

Given the set of cluster labels and similarity matrix shown in Tables 8.15 and 8.16, respectively, compute the correlation between the similarity matrix and the ideal similarity matrix, i.e., the matrix whose ijth enfty is 1 if two objects belons to the same cluster. and 0 otherwise.

(d)(b)

23.

@@ @

\#

Table 8.14. Confusion matrix for Exercise 21. Cluster Entertainment Financial Foreign Metro National Sports Total

#1 I 1 0 1 1 / 676 693 4 ‘ 27 89 333 827 253 .).) 7562 -JJ.’ 326 40,1 8 105 16 29 949

Total 354 555 347 943 273 738 3204

24.

566 Chapter 8 Cluster Analysis:

Table 8,15. Table of cluster labels for Exercise 24.

Point Cluster Label P 1 1 P2 1 P3 2 P4 2

Basic Concepts and Algorithms

Table 8.16. Similarity matrix for Exercise 24.

Point P1 P2 P3 P4 P 1 1 0.8 0.65 t.r.5l)

P2 0.8 1 0.7 0.6 P3 t r .oc 0.7

‘I 0.9 P4 u . 5 5 0.6 0.9 1

25. Compute the hierarchical F-measure for the eight objects {p1, p2, p3, p4, p5, p6, p7, p8) and hierarchical clustering shown in Figure 8.40. Class A contains points pI,p2, and p3, while p4, p5, p6, p7, and p8 belong to class B.

Figure 8.40. Hierarchical clustering for Exercise 25.

Compute the cophenetic correlation coefficient for the hierarchical clusterings in Exercise 16. (You will need to convert the similarities into dissimilarities.)

Prove Equation 8.14.

Prove Equation 8.16.

Prove that Df-rD,.. (r – m)(m – *o) :0. This fact was used in the proof

that TSS : SSE + SSB in Section 8.5.2.

Clusters of documents can be summarized by finding the top terms (words) for the documents in the cluster, e.g., by taking the most frequent k terms, where k is a constant, say 10, or by taking all terms that occur more frequently than a specified threshold. Suppose that K-means is used to find clusters of both documents and words for a document data set.

(a) How might a set of term clusters defined by the top terms in a document cluster differ from the word clusters found by clustering the terms with K-means?

(b) How could term clustering be used to define clusters of documents?

We can represent a data set as a collection of object nodes and a collection of attribute nodes, where there is a link between each object and each attribute,

26.

27.

28.

29.

30.

(p1, p2, p3, p4, p5, p6, p7, p8)

{p1, p2, p4, p5,}

3 1 .

32.

8.7 Exercises 567

and where the weight of that link is the value of the object for that attribute. For sparse data, if the value is 0, the Iink is omitted. Bipartite clustering attempts to partition this graph into disjoint clusters, where each cluster consists of a set of object nodes and a set of attribute nodes. The objective is to maximize the weight of links between the object and attribute nodes of a cluster, while minimizing the weight of links between object and attribute links in different clusters. This type of clustering is also known as co-clustering since the objects and attributes are clustered at the same time.

(a) How is bipartite clust;ering (co-clustering) different from clustering the sets of objects and attributes separately?

(b) Are there any cases in which these approaches yield the same clusters?

(c) What are the strengths and weaknesses of co-clustering as compared to ordinary clustering?

In Figure 8.41, match the sirnilarity matrices, which are sorted according to cluster labels, with the sets of points. Differences in shading and marker shape distinguish between clusters, and each set of points contains 100 points and three clusters. In the set of points labeled 2, there are three very tight, equal- sized clusters.

‘1

0 9

0 8

0 7

0 6

> 0 5

o 4

0 3

568 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

,i#’ ds}{ 4* ‘ . .oh”

x

B

0 2 0 4 0 6 0 8 ‘ 1

40 60 80 100

Figure 8,41. Points and

2A 40 60

similarity matrices for Exercise 32.

Cluster Analysis: Additional lssues and Algorithms

A large number of clustering algorithms have been developed in a variety of domains for different types of applications. None of these algorithms is suitable for all types of data, clusters, and applications. In fact, it seems that there is arlways room for a new clustering algorithm that is more effi.cient or better suited to a particular type of data, cluster, or application. Instead, we can only claim that we have techniques that work well in some situations. The reason is that, in many cases, what constitutes a good set of clusters is open to subjective interpretation. Furthermore) when an objective measure is employed to give a precise definition of a cluster, the problem of finding the optimal clustering is often computationally infeasible.

This r:hapter focuses on important issues in cluster analysis and explores the concerpts and approaches that have been developed to address them. We begin with a discussion of the key issues of cluster analysis, namely, the char- acteristics of data, clusters, and algorithms that strongly impact clustering. These issues are important for understanding, describing, and comparing clus- tering techniques, and provide the basis for deciding which technique to use in a specific situation. For example) many clustering algorithms have a time or space connplexity of O(m2) (m being the number of objects) and thus, are not suitable fbr large data sets. We then discuss additional clustering techniques. For each rbechnique, we describe the algorithm, including the issues it addresses and the rnethods that it uses to address them. We conclude this chapter by providing; some general guidelines for selecting a clustering algorithm for a given application.

57O Chapter 9 Cluster Analysis: Additional Issues and Algorithms

9.1 Characteristics of Data, Clusters, and Cluster- ing Algorithms

This section explores issues related to the characteristics of data, clusters, and algorithms that are important for a broad understanding of cluster analysis. Some of these issues represent challenges, such as handling noise and outliers.

Other issues involve a desired feature of an algorithm, such as an ability to produce the same result regardless of the order in which the data objects are processed. The discussion in this section, along with the discussion of different types of clusterings in Section 8.1.2 and different types of clusters in Section 8.1.3, identifies a number of “dimensions” that can be used to describe and compare various clustering algorithms and the clustering results that they pro-

duce. To illustrate this, we begin this section with an example that compares two clustering algorithms that were described in the previous chapter, DB- SCAN and K-means. This is followed by a more detailed description of the characteristics of data, clusters, and algorithms that impact cluster analysis.

9.1.1 Example: Comparing K-means and DBSCAN

To simplify the comparison, we assume that that there are no ties in distances for either K-means or DBSCAN and that DBSCAN always assigns a border point that is associated with several core points to the closest core point.

o Both DBSCAN and K-means are partitional clustering algorithms that assign each object to a single cluster, but K-means typically clusters all the objects, while DBSCAN discards objects that it classifies as noise.

o K-means uses a prototype-based notion of a cluster; DBSCAN uses a density-based concept.

o DBSCAN can handle clusters of different sizes and shapes and is not strongly affected by noise or outliers. K-means has difficulty with non- globular clusters and clusters of different sizes. Both algorithms can perform poorly when clusters have widely differing densities.

o K-means can only be used for data that has a well-defined centroid, such as a mean or median. DBSCAN requires that its definition of density, which is based on the traditional Euclidean notion of density, be meaningful for the data.

o K-means can be applied to sparse, high-dimensional data, such as doc- ument data. DBSCAN typically performs poorly for such data because

9.1_ Characteristics of Data, Clusters, and Clustering Algorithms 57L

the traditional Euclidean definition of density does not work well for high-dimensional data.

The original versions of K-means and DBSCAN were designed for Eu- clidean data, but both have been extended to handle other types ofdata.

DIISCAN makes no assumption about the distribution of the data. The basic K-means algorithm is equivalent to a statistical clustering approach (nLixture models) that assumes all clusters come from spherical Gaussian dir;tributions with different means but the same covariance matrix. See Section 9.2.2.

o DI3SCAN and K-means both look for clusters using all attributes, that is, they do not look for clusters that may involve only a subset of the attbributes.

K-means can find clusters that are not well separated, even if they over- lap (see Figure 8.2(b)), but DBSCAN merges clusters that overlap.

Tlre K-means algorithm has a time complexity of O(m), while DBSCAN tdkes O(m2) time, except for special cases such as low-dimensional Eu- clidean data.

o DIISCAN produces the same set of clusters from one run to another, wtrile K-means, which is typically used with random initialization of centroids. does not.

DIISCAN automatically determines the number of clusters; for K-means, th,e number of clusters needs to be specified as a parameter. However, DIISCAN has two other parameters that must be specified, Eps and Mi,nPts.

K-means clustering can be viewed as an optimization problem; i.e., min- imize the sum of the squared error of each point to its closest centroid, and as a specific case of a statistical clustering approach (mixture mod- els). DBSCAN is not based on any formal model.

9. I .2 Data Character ist ics

The foll:wing are some characteristics of data that can strongly affect cluster analvsis,,

572 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

High Dimensionality In high-dimensional data sets, the traditional Eu- clidean notion of density, which is the number of points per unit volume, becomes meaningless. To see this, consider that as the number of dimensions increases, the volume increases rapidly, and unless the number of points grov/s exponentially with the number of dimensions, the density tends to 0. (Volume is exponential in the number of dimensions. For instance, a hypersphere with radius, r, and dimension, d, has volume proportional to rd.) Also, proximity tends to become more uniform in high-dimensional spaces. Another way to view this fact is that there are more dimensions (attributes) that contribute to the proximity between two points and this tends to make the proximity more uniform. Since most clustering techniques are based on proximity or density, they can often have difficulty with high-dimensional data. One approach to addressing such problems is to employ dimensionality reduction techniques. Another approach, as discussed in Sections 9.4.5 andg.4.7, is to redefine the notions of proximity and density.

Size Many clustering algorithms that work well for small or medium-size data sets are unable to handle larger data sets. This is addressed further in the discussion of the characteristics of clustering algorithms scalability is one such characteristic-and in Section 9.5, which discusses scalable clustering algorithms.

Sparseness Sparse data often consists of asymmetric attributes, where zeto values are not as important as non-zero values. Therefore, similarity measures appropriate for asymmetric attributes are commonly used. However, other, related issues also arise. For example, are the magnitudes of non-zero en- tries important, or do they distort the clustering? In other words, does the clustering work best when there are only two values, 0 and 1?

Noise and Outliers An atypical point (outlier) can often severely degrade the performance of clustering algorithms, especially algorithms such as K- means that are prototype-based. On the other hand, noise can cause tech- niques, such as single link, to join clusters that should not be joined. In some cases, algorithms for removing noise and outliers are applied before a clus- tering algorithm is used. Alternatively, some algorithms can detect points that represent noise and outliers during the clustering process and then delete them or otherwise eliminate their negative effects. In the previous chapter, for instanceT we saw that DBSCAN automatically classifies low-density points as noise and removes them from the clustering process. Chameleon (Section

9 . 1 Characteristics of Data, Clusters, and Clustering Algorithms 573

9.4.4), SNN density-based clustering (Section 9.4.8), and CURE (Section 9.5.3) are three of the algorithms in this chapter that explicitly deal with noise and outliers during the clustering process.

Type of Attributes and Data Set As discussed in Chapter 2, data sets can be ofvarious types, such as structured, graph, or ordered, while attributes can be categorical (nominal or ordinal) or quantitative (interval or ratio), and are binary, discrete, or continuous. Different proximity and density measures are appropriate for different types of data. In some situations, data may need to be discretized or binarized so that a desired proximity measure or clustering algorithm can be used. Another complication occurs when attributes are of widely differing types, e.g., continuous and nominal. In such cases, proximity and density are more difficult to define and often more ad hoc. Finally, special data structures and algorithms may be needed to handle certain types of data efficiently.

Scale Different attributes, e.g., height and weight, m&y be measured on dif- ferent scales. These differences can strongly affect the distance or similarity between two objects and, consequently, the results of a cluster analysis. Con- sider clustering a group of people based on their heights, which are measured in meters, and their weights, which are measured in kilograms. If we use Eu- clidean distance as our proximity measure) then height will have little impact and people will be clustered mostly based on the weight attribute. If, however, we standardize each attribute by subtracting off its mean and dividing by its standard deviation, then we will have eliminated effects due to the difference in scale. More generally, normalization techniques, such as those discussed in Section 2.3.7, are typically used to handle these issues.

Mathematical Properties of the Data Space Some clustering tech- niques calculate the mean of a collection of points or use other mathematical operations that only make sense in Euclidean space or in other specific data spaces. Other algorithms require that the definition of density be meaningful for the data.

9.1.3 Cluster Characteristics

The different types of clusters, such as prototype-, graph-, and density-based, were described earlier in Section 8.1.3. Here, we describe other important characteristics of clusters.

574 Chapter I Cluster Analysis: Additional Issues and Algorithms

Data Distribution Some clustering techniques assume a particular type of distribution for the data. More specifically, they often assume that data can be modeled as arising from a mixture of distributions, where each cluster corresponds to a distribution. Clustering based on mixture models is discussed in Sect ion 9.2.2.

Shape Some clusters are regularly shaped, e.g., rectangular or globular, but in general, clusters can be of arbitrary shape. Techniques such as DBSCAN and single link can handle clusters of arbitrary shape, but prototype-based

schemes and some hierarchical techniques, such as complete link and group

average) cannot. Chameleon (Section 9.4.4) and CURE (Section 9.5.3) are examples of techniques that were specifically designed to address this problem.

Differing Sizes Many clustering methods, such as K-means, don’t work well when clusters have different sizes. (See Section 8.2.4.) This topic is discussed further in Section 9.6.

Differing Densities Clusters that have widely varying density can cause problems for methods such as DBSCAN and K-means. The SNN density- based clustering technique presented in Section 9.4.8 addresses this issue.

Poorly Separated Clusters When clusters touch or overlap, some cluster- ing techniques combine clusters that should be kept separate. Even techniques that find distinct clusters arbitrarily assign points to one cluster or another. Ftzzy clustering, which is described in Section 9.2.1, is one technique for deal- ing with data that does not form well-separated clusters.

Relationships among Clusters In most clustering techniques, there is no explicit consideration of the relationships between clusters, such as their rel- ative position. Self-organizing maps (SOM), which are described in Section 9.2.3, are a clustering technique that directly considers the relationships be- tween clusters during the clustering process. Specifically, the assignment of a point to one cluster affects the definitions of nearby clusters.

Subspace Clusters Clusters may only exist in a subset of dimensions (at- tributes), and the clusters determined using one set of dimensions may be quite different from the clusters determined by using another set. While this issue can arise with as few as two dimensions, it becomes more acute as di- mensionality increases, since the number of possible subsets of dimensions is

9 . 1 Characteristics of Data, Clusters, and Clustering Algorithms 575

exponential in the total number of dimensions. For that reason, it is not feasi- ble to simply look for clusters in all possible subsets of dimensions unless the number of dimensions is relatively low.

One approach is to apply feature selection, which was discussed in Sec- tion 2.3.4. However, this approach assumes that there is only one subset of dimensions in which the clusters exist. In reality, clusters can exist in many distinct subspaces (sets of dimensions), some of which overlap. Section 9.3.2 considers techniques that address the general problem of subspace clustering, i.e., of finding both clusters and the dimensions they span.

9.L.4 General Characteristics of Clustering Algorithms

Clustering algorithms are quite varied. We provide a general discussion of important characteristics of clustering algorithms here, and make more specific comments during our discussion of particular techniques.

Order Dependence For some algorithms, the quality and number of clus- ters produced can vary, perhaps dramatically, depending on the order in which the data is processed. While it would seem desirable to avoid such algorithms, sometimes the order dependence is relatively minor or the algorithm may have other desirable characteristics. SOM (Section 9.2.3) is an example of an algo- rithm that is order dependent.

Nondeterminism Clustering algorithms, such as K-means, are not order- dependent, but they produce different results for each run since they rely on an initialization step that requires a random choice. Because the quality of the clusters can vary from one run to another, multiple runs can be necessary.

Scalability It is not unusual for a data set to contain millions of objects, and the clustering algorithms used for such data sets should have linear or near- linear time and space complexity. Even algorithms that have a complexity of O(*’) are not practical for large data sets. Furthermore, clustering techniques for data sets cannot always assume that all the data will fit in main memory or that data elements can be randomly accessed. Such algorithms are infeasible for large data sets. Section 9.5 is devoted to the issue of scalabilitv.

Parameter Selection Most clustering algorithms have one or more pa- rameters that need to be set by the user. It can be difficult to choose the

576 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

proper valuesl thus, the attitude is usually, “the fewer parameters, the bet- ter.” Choosing parameter values becomes even more challenging if a small change in the parameters drastically changes the clustering results. Finally, unless a procedure (which may involve user input) is provided for determining parameter values, a user of the algorithm is reduced to using trial and error to find suitable parameter values.

Perhaps the most well-known parameter selection problem is that of “choos- ing the right number of clusters” for partitional clustering algorithms, such as K-means. One possible approach to that issue is given in Section 8.5.5, while references to others are provided in the bibliographic notes.

Tlansforming the Clustering Problem to Another Domain One ap- proach taken by some clustering techniques is to map the clustering problem to a problem in a different domain. Graph-based clustering, for instance, maps the task of finding clusters to the task of partitioning a proximity graph into connected components.

TYeating Clustering as an Optirnization Problem Clustering is often viewed as an optimization problem: divide the points into clusters in a way that maximizes the goodness of the resulting set of clusters as measured by a user-specifi.ed objective function. For example, the K-means clustering algo- rithm (Section 8.2) tries to find the set of clusters that minimizes the sum of the squared distance of each point from its closest cluster centroid. In theory, such problems can be solved by enumerating all possible sets of clusters and selecting the one with the best value of the objective function, but this exhaus- tive approach is computationally infeasible. For this reason, many clustering techniques are based on heuristic approaches that produce good, but not op- timal clusterings. Another approach is to use objective functions on a greedy

or local basis. In particular, the hierarchical clustering techniques discussed in Section 8.3 proceed by making locally optimal (greedy) decisions at each step of the clustering process.

Road Map

We arrange our discussion of clustering algorithms in a manner similar to that of the previous chapter, grouping techniques primarily according to whether they are prototype-based, density-based, or graph-based. There is, however, a separate discussion for scalable clustering techniques. We conclude this chapter with a discussion of how to choose a clustering algorithm.

9.2 Prototype-BasedClustering 577

9.2 Prototype-Based Clustering

In prototype-based clustering, a cluster is a set of objects in which any object is closer to the prototype that defines the cluster than to the prototype of any other cluster. Section 8.2 described K-means, a simple prototype-based clustering algorithm that uses the centroid of the objects in a cluster as the prototype of the cluster. This section discusses clustering approaches that expand on the concept of prototype-based clustering in one or rrore ways, as discussed next:

o Objects are allowed to belong to more than one cluster. More specifically, an object belongs to every cluster with some weight. Such an approach addresses the fact that some objects are equally close to several cluster prototypes.

o A cluster is modeled as a statistical distribution, i.e., objects are gen- erated by a random process from a statistical distribution that is char- acterized by a number of statistical parameters, such as the mean and variance. This viewpoint generalizes the notion of a prototype and en- ables the use of well-established statistical techniques.

o Clusters are constrained to have fixed relationships. Most commonly, these relationships are constraints that specify neighborhood relation- ships; i.e., the degree to which two clusters are neighbors of each other. Constraining the relationships among clusters can simplify the interpre- tation and visualization of the data.

We consider three specific clustering algorithms to illustrate these exten- sions of prototype-based clustering. Ftzzy c-means uses concepts from the field of fitzzy logic and fiizzy set theory to propose a clustering scheme, which is much like K-means, but which does not require a hard assignment of a point to only one cluster. Mixture model clustering takes the approach that a set of clusters can be modeled as a mixture of distributions, one for each cluster. The clustering scheme based on Self-Organizing Maps (SOM) performs clustering within a framework that requires clusters to have a prespecified relationship to one another, e.8., & two-dimensional grid structure.

9.2.L hnzy Clustering

If data objects are distributed in well-separated groups, then a crisp clas- sification of the objects into disjoint clusters seems like an ideal approach. However, in most cases, the objects in a data set cannot be partitioned into

578 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

well-separated clusters, and there will be a certain arbitrariness in assigning an object to a particular cluster. Consider an object that lies near the boundary of two clusters, but is slightly closer to one of them. In many such cases, it might be more appropriate to assign a weight to each object and each cluster that indicates the degree to which the object belongs to the cluster. Mathe- matically, wii is the weight with which object 4 belongs to cluster Cr.

As shown in the next section, probabilistic approaches can also provide such weights. While probabilistic approaches are useful in many situations, there are times when it is difficult to determine an appropriate statistical model. In such cases, non-probabilistic clustering techniques are needed to provide similar capabilities. Fuzzy clustering techniques are based on fitzzy set theory and provide a natural technique for producing a clustering in which membership weights (the w61) have a natural (but not probabilistic) interpre- tation. This section describes the general approach of finzy clustering and provides a specific example in terms of finzy c-means (fuzzy K-means).

F\rzzy Sets

Lotfi Zadeh introduced frzzy set theor:y and finzy logic in 1965 as a way of dealing with imprecision and uncertainty. Briefly, fuzzy set theory allows an object to belong to a set with a degree of membership between 0 and 1, while htzzy logic allows a statement to be true with a degree of certainty between 0 and 1. Traditional set theory and logic are special cases of their finzy counter- parts that restrict the degree of set membership or the degree of certainty to be either 0 or 1. Ftzzy concepts have been applied to many different areas, in- cluding control systems, pattern recognition, and data analysis (classification and clustering).

Consider the following example of hnzy logic. The degree of truth of the statement “It is cloudy” can be defined to be the percentage of cloud cover in the sky, e.g., if the sky is 50% covered by clouds, then we would assign “It is cloudy” a degree of truth of 0.5. If we have two sets, “cloudy days” and “non- cloudy days,” then we can similarly assign each day a degree of membership in the two sets. Thus, if a day were 2570 cloudy, it would have a 25Vo degree of membership in “cloudy days” and a 75To degree of membership in “non-cloudy days.tt

F\tzzy Clusters