graph-based separation and cohesionD;:g; Prori,mits(x,Y)

Cluster Evaluation 539

Note that any unsupervised measure of cluster validity potentially can be used as an objective function lbr a clustering algorithm and vice versa. The ClUstering TOolkit (CLUTO) (see the bibliographic notes) uses the cluster evaluation measures described in Table 8.6, as well as some other evaluation measures not mentioned here, to drive the clustering process. It does this by using an algorithm that is similar to the incremental K-means algorithm dis- cussed in Section 8.2.2. Specifically, each point is assigned to the cluster that produces the best value for the cluster evaluation function. The cluster eval- uation measure Z2 corresponds to traditional K-means and produces clusters that have good SSE values. The other measures produce clusters that are not as good with respect to SSE, but that are more optimal with respect to the specified cluster validity measure.

Relationship between Prototype-Based Cohesion and Graph-Based Cohesion

While the graph-based and prototype-based approaches to measuring the co- hesion and separation of a cluster seem distinct, for some proximity measures they are equivalent. For instance, for the SSE and points in Euclidean space, it can be shown (Equation 8.14) that the average pairwise distance between the points in a cluster is equivzr,lent to the SSE of the cluster. See Exercise 27 on page 566.

cluster ssE : I dtrt1″6,*)2 : r^L D I d,tst(x,y)2 xeC i

– ” ” 1 xQCay€C i

(8 .14)

Two Approaches to Prototype-Based Separation

When proximity is measured by Euclidean distance, the traditional measure of separation between clusters is the between group sum of squares (SSB), which is the sum of the squared distance of a cluster centroid, c6, to the overall meanT c, of all the data points. By summing the SSB over all clusters, we obtain the total SSB, which is given by Equation 8.15, where ca is the mean of the i,th cluster and c is the overall mean. The higher the total SSB of a clustering, the more separated the clusters are from one another.

Total SSB :D*o d, i ,st(c i ,c)2 x : l

It is straightforward to show that the total SSB is directly related to the pairwise distances between the centroids. In particular, if the cluster sizes are

8 .5

(8.15)

54O Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

equal, i.e., mi : m I K , then this relationship takes the simple form given by Equation 8.16. (See Exercise 28 on page 566.) It is this type of equivalence that motivates the definition of prototype separation in terms of both Equations 8.12 and 8 .13 .

1 K K

Total SSB : + t , ry dist(c; .c i )2 2 K ? ? ‘ x

r – t l = I

Relationship between Cohesion and Separation

(8 .16)

In some cases, there is also a strong relationship between cohesion and separa- tion. Specifically, it is possible to show that the sum of the total SSE and the total SSB is a constant; i.e., that it is equal to the total sum of squares (TSS), which is the sum of squares of the distance of each point to the overall mean of the data. The importance of this result is that minimizing SSE (cohesion) is equivalent to maximizing SSB (separation).

We provide the proof of this fact below, since the approach illustrates techniques that are also applicable to proving the relationships stated in the last two sections. To simplify the notation, we assume that the data is one- dimensional, i.e., d’ist(r,A) : (r-A;2. Also, we use the fact that the cross-term

Di:tD”ecn@ – c,1)(c – q) is 0. (See Exercise 29 on page 566.)

TSS K

\- ZJ

K

\- Z-/

K

K

t D (” – “n)’ +Dlcil(c – c)2

t

(* – ,) ‘

( ( ” – ” n )

– ( r – ” n ) ) ‘

K K

(* – ro)’ – 2DI (” – “o)(” -“,) + D l k – “u)’

\- Zr

r e u

\- /J

TE L;

\-/ r e C

? : 1 r e u ,

i :L r€C;

SSE + SSB

? : I J E U ) :

K

i : l r € C t

K

i :7 r€Ct

8.5 Cluster Evaluation 54L

Evaluating Individual Clusters and Objects

So far, we have focused on using cohesion and separation in the overall eval-

uation of a group of clusters. Many of these measures of cluster validity also

can be used to evaluate individual clusters and objects. For example, we can

rank individual clusters according to their specific value of cluster validity, i.e.,

cluster cohesion or separation. A cluster that has a high value of cohesion may

be considered better than a clutter that has a lower value. This information

often can be used to improve the quality of a clustering. If, for example, a

cluster is not very cohesive, then we may want to split it into several subclus-

ters. On the other hand, if two clusters are relatively cohesive, but not well

separated, we may want to merge them into a single cluster.

We can also evaluate the objects within a cluster in terms of their con-

tribution to the overall cohesion or separation of the cluster. Objects that

contribute more to the cohesion and separation are near the “interior” of the

cluster. Those objects for which the opposite is true are probably near the

“edge” of the cluster. In the following section, we consider a cluster evalua-

tion measure that uses an approach based on these ideas to evaluate points,

clusters, and the entire set of clusters.

The Silhouette Coefficient

The popular method of silhouette coefficients combines both cohesion and sep-

aration. The following steps explain how to compute the silhouette coefficient

for an individual point, a process that consists of the following three steps.

We use distances, but an analog;ous approach can be used for similarities.

1. For the ith object, calculate its average distance to all other objects in

its cluster. Call this value a;.

2. For the i,th object and any cluster not containing the object, calculate

the object’s average distance to all the objects in the given cluster. Find

the minimum such value with respect to all clusters; call this value b.;.

3. For the i,th object, the silhouette coefficient is s; : (b’i – a.i)f rnax(ai,b1).

The value of the silhouette coefficient can vary between -1 and 1. A

negative value is undesirable because this corresponds to a case in which a;,

the average distance to points in the cluster, is greater than ba, the minimum

average distance to points in another cluster. We want the silhouette coefficient

to be positive (a; .–b), and for o; to be as close to 0 as possible, since the

coefficient assumes its maximum value of 1 when at.:0.

542 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

e :t,,ti’tt

et$&s

,,ti l:i* s gr:’r i: i :*b

, r;-

r:,;:!:1:i

di r” t,

..d{-.*4 + i-:Y,$

;I’ir,’irr’,

* 4#F* o

“l’+*-I’u ti’u,in*

“i-’11,.,, zu –

: i;,f{,e

* I t

“‘” :”j:r, s “.tiii: f.i,:i ,.r:,bs.

o 01 02 03 :, l r**,”0:*n.,:”:

07 08 oe 1

Figure 8.29. Silhouette coefficients for points in ten clusters.

We can compute the average silhouette coemcient of a cluster by simply taking the average of the silhouette coefficients of points belonging to the cluster. An overall measure of the goodness of a clustering can be obtained by computing the average silhouette coefficient of all points.

Example 8.8 (Silhouette Coefficient). Figure 8.29 shows a plot of the silhouette coeffi.cients for points in 10 clusters. Darker shades indicate lower silhouette coefficients. I

8.5.3 lJnsupervised Cluster Evaluation Using the Proximity Matrix

In this section, we examine a couple of unsupervised approaches for assessing cluster validity that are based on the proximity matrix. The first compares an actual and idealized proximity matrix, while the second uses visualization.

Measuring Cluster Validity via Correlation

If we are given the similarity matrix for a data set and the cluster labels from a cluster analysis of the data set, then we can evaluate the “goodness” of the clustering by looking at the correlation between the similarity matrix and an ideal version of the similarity matrix based on the cluster labels. (With minor changes, the following applies to proximity matrices, but for simplicity, we discuss only similarity matrices.) More specifically, an ideal cluster is one whose points have a similarity of 1 to all points in the cluster, and a similarity of 0 to all points in other clusters. Thus, if we sort the rows and columns of the similarity matrix so that all objects belonging to the same class are together, then an ideal similarity matrix has a block diagonal structure. In other words, the similarity is non-zero, i.e., 1, inside the blocks of the similarity

8.5 Cluster Evaluation 543

matrix whose entries represent intra-cluster similarity, and 0 elsewhere. The ideal similarity matrix is constructed by creating a matrix that has one row and one column for each data point just like an actual similarity matrix- and assigning a 1 to an entry if the associated pair of points belongs to the same cluster. All other entries are 0.

High correlation between the ideal and actual similarity matrices indicates that the points that belong to the same cluster are close to each other, while low correlation indicates the opposite. (Since the actual and ideal similarity matrices are symmetric, the correlation is calculated only among the n(n-L) l2 entries below or above the diagonal of the matrices.) Consequently, this is not a good measure for many density- or contiguity-based clusters, because they are not globular and may be closely intertwined with other clusters.

Example 8.9 (Correlation of Actual and Ideal Similarity Matrices). To illustrate this measure, we calculated the correlation between the ideal and actual similarity matrices for the K-means clusters shown in Figure 8.26(c) (random data) and Figure S.30(a) (data with three well-separated clusters). The correlations were 0.5810 and 0.9235, respectively, which reflects the ex- pected result that the clusters found by K-means in the random data are worse than the clusters found by K-means in data with well-separated clusters. r

Judging a Clustering Visually by Its Similarity Matrix

The previous technique suggestsi a more general, qualitative approach to judg-

ing a set of clusters: Order the riimilarity matrix with respect to cluster labels and then plot it. In theory, if we have well-separated clusters, then the simi- larity matrix should be roughly block-diagonal. If not, then the patterns dis- played in the similarity matrix can reveal the relationships between clusters. Again, all of this can be applied to dissimilarity matrices, but for simplicity, we will only discuss similarity matrices.

Example 8.10 (Visualizing a Similarity Matrix). Consider the points in Figure 8.30(a), which form three well-separated clusters. If we use K-means to group these points into three clusters, then we should have no trouble finding these clusters since they are well-separated. The separation of these clusters is illustrated by the reordered similarity matrix shown in Figure 8.30(b). (For uniformity, we have transformed the distances into similarities using the for- mula s – 1- (d-rni,n-d)l(mar^d-min-d).) Figure 8.31 shows the reordered similarity matrices for clusters found in the random data set of Figure 8.26 by DBSCAN, K-means, and complete link.

544 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

(a) Well-separated clusters. (b) Similarity matrix sorted by K-means cluster labels.

Figure 8.30. Similarity matrix for well-separated clusters.

The well-separated clusters in Figure 8.30 show a very strong, block- diagonal pattern in the reordered similarity matrix. However, there are also weak block diagonal patterns-see Figure 8.31 in the reordered similarity matrices of the clusterings found by K-means, DBSCAN, and complete link in the random data. Just as people can find patterns in clouds, data mining algorithms can find clusters in random data. While it is entertaining to find patterns in clouds, it is pointless and perhaps embarrassing to find clusters in noise. r

This approach may seem hopelessly expensive for large data sets, since the computation of the proximity matrix takes O(m2) time, where rn is the number of objects, but with sampling, this method can still be used. We can take a sample of data points from each cluster, compute the similarity between these points, and plot the result. It may be necessary to oversample small clusters and undersample large ones to obtain an adequate representation of all clusters.

8.5.4 lJnsupervised Evaluation of Hierarchical Clustering

The previous approaches to cluster evaluation are intended for partitional clusterings. Here we discuss the cophenetic correlation, a popular evaluation measure for hierarchical clusterings. The cophenetic distance between two objects is the proximity at which an agglomerative hierarchical clustering tech-

Similarity

(a) Similarity matrix sorted by DBSCAN cluster labels.

(b) Similarity matrix sorted by K-means cluster labels.

8.5 Cluster Evaluation 545

(c) Similarity matrix sorted by complete link cluster labels.

Figure 8.31. Similarity matrices for clusters from random data.

nique puts the objects in the same cluster for the first time. For example, if at

some point in the agglomerative hierarchical clustering process, the smallest

distance between the two clusters that are merged is 0.1, then all points in

one cluster have a cophenetic clistance of 0.1 with respect to the points in the

other cluster. In a cophenetic distance matrix, the entries are the cophenetic

distances between each pair of objects. The cophenetic distance is different

for each hierarchical clustering of a set of points.

Example 8.L1- (Cophenetic Distance Matrix). Table 8.7 shows the cophen-

tic distance matrix for the single link clustering shown in Figure 8.16. (The

data for this figure consists of the 6 two-dimensional points given in Table

8 .3 . )

Table 8.7. Cophenetic distance matrix for single link and data in table 8.3

Point P 1 P2 P3 P4 P5 P6 P1 0 0.222 0.222 0.222 0.222 0.222 P2 0.222 0 0.148 0.15r 0.139 0.148 P3 0.222 0.148 0 0.151 0.148 0.110 P4 0.222 0.151 0.151 0 0.151 0.151 P5 0.222 0.139 0.148 0.151 0 0.148

P6 0.222 0.148 0.1 10 0.151 0.148 0

The between

I

CoPhenetic Correlation Coefficient (CPCC) is the correlation

the entries of this matrix and the original dissimilarity matrix and is

546 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

a standard measure of how well a hierarchical clustering (of a particular type) fits the data. One of the most common uses of this measure is to evaluate which type of hierarchical clustering is best for a particular type of data.

Example 8.12 (Cophenetic Correlation Coefficient). We calculated the CPCC for the hierarchical clusterings shown in Figures 8.16-8.1g.These values are shown in Table 8.8. The hierarchical clustering produced by the single link technique seems to fit the data less well than the clusterings produced by complete link, group average, and Ward’s method.

Table 8.8. Cophenetic correlation coefficient for data of Table 8.3 and four agglomerative hierarchical clustering techniques.

Technique CPCC Single Link

Complete Link Group Average

Ward’s

0.44 0.63 0.66 0.64

8.5.5 Determining the Correct Number of Clusters

Various unsupervised cluster evaluation measures can be used to approxi- mately determine the correct or natural number of clusters.

Example 8.13 (Number of Clusters). The data set of Figure 8.29 has 10 natural clusters. Figure 8.32 shows a plot of the SSE versus the number of clusters for a (bisecting) K-means clustering of the data set, while Figure 8.33 shows the average silhouette coefficient versus the number of clusters for the same data. There is a distinct knee in the SSE and a distinct peak in the silhouette coefficient when the number of clusters is equal to 10. r

Thus, we can try to find the natural number of clusters in a data set by looking for the number of clusters at which there is a knee, peak, or dip in the plot of the evaluation measure when it is plotted against the number of clusters. Of course, such an approach does not always work well. Clusters may be considerably more intertwined or overlapping than those shown in Figure 8.29. Also, the data may consist of nested clusters. Actually, the clusters in Figure 8.29 are somewhat nested; i.e., there are 5 pairs of clusters since the clusters are closer top to bottom than they are left to right. There is a knee that indicates this in the SSE curve, but the silhouette coefficient curve is not

gr-br.v*,’F/r”P 8.5 Cluster Evaluation 547

0

0

o

a

0 5 5

0 5

0 4 5

0 4

0 3 5

0 3 0 5 1 0 1 5 2 0 2 5 3 0

Number of Clusters

Figure 8.32, SSE versus number of clusters for the data of Figure 8.29.

0 5 1 0 1 5 2 0 2 5 3 0 Number of ClusteF

Figure 8.33. Average silhouette coefficient ver- sus number of clusters for the data of Figure 8.29,

as clear. In summary, while caution is needed, the technique we have just

described can provide insight into the number of clusters in the data.

8.5.6 Clustering Tendency

One obvious way to determine if a data set has clusters is to try to cluster it. However, almost all clusterirrg algorithms will dutifully find clusters when given data. To address this issue, we could evaluate the resulting clusters and only claim that a data set has clusters if at least some of the clusters are of good quality. However, this approactr does not address the fact the clusters in the

data can be of a different type than those sought by our clustering algorithm. To handle this additional problem, we could use multiple algorithms and again evaluate the quality of the resulting clusters. If the clusters are uniformly poor,

then this may indeed indicate that there are no clusters in the data. Alternatively, and this is the focus of measures of clustering tendency, we

can try to evaluate whether a data set has clusters without clustering. The most common approach, especially for data in Euclidean space, has been to

use statistical tests for spatial randomness. Unfortunately, choosing the cor- rect model, estimating the parameters, and evaluating the statistical signifi- cance of the hypothesis that the data is non-random can be quite challenging. Nonetheless, many approaches have been developed, most of them for points

in low-dimensional Euclidean space.

Exarnple 8.14 (Hopkins Statistic). For this approach, we generatep points

that are randomly distributed across the data space and also sample p actual

548 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

data points. For both sets of points we find the distance to the nearest neigh- bor in the original data set. Let theui be the nearest neighbor distances ofthe artificially generated points, while the ui are the nearest neighbor distances of the sample of points from the original data set. The Hopkins statistic fI is then defined by Equation 8.17.

H- DT:t.t (8 .17) D,Pl:rut, + Dl:twt

If the randomly generated points and the sample of data points have roughly the same nearest neighbor distances, then 11 will be near 0.5. Values of fI near 0 and 1 indicate, respectively, data that is highly clustered and data that is regularly distributed in the data space. To give an example, the Hopkins statistic for the data of Figure 8.26 was computed for p : 20 and 100 different trials. The average value of 11 was 0.56 with a standard deviation of 0.03. The same experiment was performed for the well-separated points of Figure 8.30. The average value of fI was 0.95 with a standard deviation of 0.006. r

8.5.7 Supervised Measures of Cluster Validity

When we have external information about data, it is typically in the form of externally derived class labels for the data objects. In such casesT the usual procedure is to measure the degree of correspondence between the cluster labels and the class labels. But why is this of interest? After all, if we have the class labels, then what is the point in performing a cluster analysis? Motivations for such an analysis are the comparison of clustering techniques with the “ground truth” or the evaluation of the extent to which a manual classification process can be automatically produced by cluster analysis.

We consider two different kinds of approaches. The first set of techniques use measures from classification, such as entropy, purity, and the F-measure. These measures evaluate the extent to which a cluster contains objects of a single class. The second group of methods is related to the similarity measures for binary data, such as the Jaccard measure that we saw in Chapter 2. These approaches measure the extent to which two objects that are in the same class are in the same cluster and vice versa. For convenience, we will refer to these two types of measures as classification-oriented and similarity-oriented, respectively.

8 .5 Cluster Evaluation 549

Classification-Oriented Measures of Cluster Validity

There are a number of measures entropy, purity, precision, recall, and the F-measure that are commonly used to evaluate the performance of a classi- fication model. In the case of classification? we measure the degree to which predicted class labels correspond to actual class labels, but for the measures just mentioned, nothing fundamental is changed by using cluster labels in- stead of predicted class labels. Next, we quickly review the definitions of these measures, which were discussed in Chapter 4.

Entropy: The degree to which each cluster consists of objects of a single class. For each cluster, the class distribution ofthe data is calculated first, i.e., for cluster jl we compute pU, the probability that a member of cluster ri belongs to class j as pU : mt j /mt, where mi is the number of objects in cluster i, and mii is the number of objects of class j in cluster i. Using this class distribution, the entropy of each cluster ri is calculated using the standard formula, €i == -Di-tptilogzptj, where tr is the number of classes. The total entropy for a set of clusters is calculated as the sum of the e_ntropies of each cluster weighted by the size of each cluster, i.e., e : DLtffei, where 1{ is the number of clusters and rn is the total number of data points.

Purity: Another measure of ttLe extent to which a cluster contains objects of a single class. Using the previous terminology, the purity of cluster i is pi : rta;xpri, the overall purity of a clustering is puri,ta : Di:tTpt,.

Precision: The fraction of a cluster that consists of objects of a specified class. The precision of cluster i with respect to class j is preci,si,on(i., j) : pij.

Recall: The extent to which a cluster contains all objects of a specified class. The recall of cluster z with respect to class j is recall(i., j) : miif mi, where m3 is the number of objects in class j.