F-measure A combination of both precision and recall that measures the extent to which a cluster contains only objects of a particular class and all objects of that class. The F-measure of cluster i with respect to class 3 is F(i , , j ) : (2xpreci ,s i ,on( i , j )xrecal l ( i , j ) ) l (precision( i , , j )*recal l ( i , j ) ) .
Example 8.15 (Supervised Evaluation Measures). We present an exam- ple to illustrate these measures. Specifically, we use K-means with the cosine similarity measure to cluster 3204 newspaper articles from the Los Angeles
Tabfe 8.9. K-means clustering results for the LA Times document data set.
Cluster Enter- tainment
Financial Foreign Metro National Sports Entropy Purity
1 3 K 40 506 96 27 r.2270 0.7474 2 4 7 280 29 39 2 1.1472 0.7756 .-) 1 1 I 7 4 677 0.1813 0.9796 4 10 r62 ,f 1 1 9 I J 2 r.7487 0.4390 r 331 22 r 70 I J 23 1.3976 0.7134 o ( 358 1 2 2r2 48 I J r.5523 0.5525
Total D < A cDo 34r 943 273 738 1.1450 0.7203
550 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms
Ti,mes. These articles come from six different classes: Entertainment, Finan- cial, Foreign, Metro, National, and Sports. Table 8.9 shows the results of a K-means clustering to find six clusters. The first column indicates the clus- ter, while the next six columns together form the confusion matrix; i.e., these columns indicate how the documents of each category are distributed among the clusters. The last two columns are the entropy and purity of each cluster, respectively.
Ideally, each cluster will contain documents from only one class. In reality, each cluster contains documents from many classes. Nevertheless, many clus- ters contain documents primarily from just one class. In particular, cluster 3, which contains mostly documents from the Sports section, is exceptionally good, both in terms of purity and entropy. The purity and entropy of the other clusters is not as good, but can typically be greatly improved if the data is partitioned into a larger number of clusters.
Precision. recall. and the F-measure can be calculated for each cluster. To give a concrete example, we consider cluster 1 and the Metro class of Table 8.9. The precision is 5061677: 0.75, recall is 5061943:0.26, and hence, the F value is 0.39. In contrast, the F value for cluster 3 and Sports is 0.94. r
Similarity-Oriented Measures of Cluster Validity
The measures that we discuss in this section are all based on the premise that any two objects that are in the same cluster should be in the same class and vice versa. We can view this approach to cluster validity as involving the comparison of two matrices: (1) the ideal cluster similarity rnatrix discussed previously, which has a 1 in the i,jth entry if two objects, i. and j, are in the same cluster and 0, otherwise, and (2) an ideal class similarity matrix defined with respect to class labels, which has a 1 in the i,jth entry if
Cluster Evaluation 551
two objects, i and j, belong to the same class, and a 0 otherwise. As before, we can take the correlation of these two matrices as the measure of cluster validity. This measure is known as the f statistic in clustering validation literature.
Example 8.16 (Correlation between Cluster and Class Matrices). To demonstrate this idea more concretely, we give an example involving five data points, pt tpztpstp4,p5, two clusters, Ct: {pt ,p2,p3} and Cz: {pq,p5}, and two classes, Lr : {pt,,pz} and L2 : {pz,p+,p5}. The ideal cluster and class similarity matrices are given in Tables 8.10 and 8.11. The correlation between the entries of these two matrices is 0.359.
I
More generally, we can use any of the measures for binary similarity that we sav/ in Section 2.4.5. (For erxample, we can convert these two matrices into
binary vectors by appending the rows.) We repeat the definitions of the four quantities used to define those similarity measures, but modify our descriptive text to fit the current context. Specifically, we need to compute the following four quantities for all pairs of distinct objects. (There are m(m – I) 12 stch pairs, if m is the number of objects.)
,foo : number of pairs of objects having a different class and a different cluster
.for : number of pairs of objects having a different class and the same cluster
,fio : number of pairs of objects having the same class and a different cluster
.fir : number of pairs of objects having the same class and the same cluster
In particular, the simple matching coefficient, which is known as the Rand statistic in this context, and the Jaccard coefficient are two of the most fre- quently used cluster validity measures.
foo -l ft
8 .5
Table 8.10. ldeal cluster similaritv matrix.
Point pl p2 p3 p4 p5 p1 p2 p3 p4 p5
1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1
Table 8.11. ldeal class similariW matrix.
Point p1 p2 p3 p4 p5 p1 p2 p3 p4 p5
1 1 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1
Rand statistic : ,foo * “for
-l frc -l fn (8.18)
552 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms
Jaccard coefficient : (8.1e) fot -t fn’l ft
Example 8.17 (Rand and Jaccard Measures). Based on these formulas, we can readily compute the Rand statistic and Jaccard coefficient for the example based on Tables 8.10 and 8.11. Not ing that foo: 4, fot :2, f rc:2, and /rr : 2, the Rand statistic : (2 + 4) 110: 0.6 and the Jaccard coefficient :2 lQ+2+2) :0 .33 . r
We also note that the four quantities, foo, fot, fis, and fi1, define a con- t’ingency table as shown in Table 8.12.
Table 8.12. Two-way contingency table for determining whether pairs of objects are in the same class and same cluster.
Same Cluster Different Cluster Same Ulass . t 1 7 Iro
Diflerent Class . l 07 ./ 00
Previously, in the context of association analysis-see Section 6.7.1-we presented an extensive discussion of measures of association that can be used for this type of contingency table. (Compare Table 8.12 with Table 6.7.) Those measures can also be applied to cluster validity.
Cluster Validity for Hierarchical Clusterings
So far in this section, we have discussed supervised measures of cluster va- lidity only for partitional clusterings. Supervised evaluation of a hierarchical clustering is more difficult for a variety of reasons, including the fact that a preexisting hierarchical structure often does not exist. Here, we will give an example of an approach for evaluating a hierarchical clustering in terms of a (flat) set of class labels, which are more likely to be available than a preexisting hierarchical structure.
The key idea of this approach is to evaluate whether a hierarchical clus- tering contains, for each class, at least one cluster that is relatively pure and includes most of the objects of that class. To evaluate a hierarchical cluster- ing with respect to this goal, we compute, for each class, the F-measure for each cluster in the cluster hierarchy. For each class, we take the maximum F- measure attained for any cluster. Finally, we calculate an overall F-measure for the hierarchical clustering by computing the weighted average of all per-class F-measures, where the weights are based on the class sizes. More formally,
. l 1 r
8.5 Cluster Evaluation 553
this hierarchical F-measure is defined as follows:
F-
where the maximum is taken over all clusters z at all levels, m1 is the number
of objects in class j, and rn is the total number of objects.
8.5.8 Assessing the Significance of Cluster Validity Measures
Cluster validity measures are intended to help us measure the goodness of the clusters that we have obtained. Indeed, they typically give us a single number
as a measure of that goodness. However, we are then faced with the problem
of interpreting the significance of this number, a task that may be even more
difficult. The minimum and maximum values of cluster evaluation measures may
provide some guidance in many cases. For instance, by definition, a purity of
0 is bad, while a purity of 1 is good, at least if we trust our class labels and want our cluster structure to reflect the class structure. Likewise, an entropy
of 0 is good, as is an SSE of 0. Sometimes, however, there may not be a minimum or maximum value,
or the scale of the data may affect the interpretation. Also, even if there are minimum and maximum values with obvious interpretations, intermediate
values still need to be interpreted. In some cases, we can use an absolute
standard. If, for example, we Erre clustering for utility, we may be willing to
tolerate only a certain level of error in the approximation of our points by a
cluster centroid. But if this is not the case, then we must do something else. A common
approach is to interpret the value of our validity measure in statistical terms.
Specifically, we attempt to judge how likely it is that our observed value may
be achieved by random chance. The value is good if it is unusual; i.e., if it is
unlikely to be the result of random chance. The motivation for this approach is that we are only interested in clusters that reflect non-random structure in
the data, and such structures should generate unusually high (low) values of
our cluster validity measure, at least if the validity measures are designed to
reflect the presence of strong cluster structure.
Example 8.18 (Significance of SSE). To show how this works, we present
an example based on K-means and the SSE. Suppose that we want a measure of
how good the well-separated clusters of Figure 8.30 are with respect to random
data. We generate many random sets of 100 points having the same range as
S- 7Zl
z
554 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms
50
Figure 8.34. Histogram of SSE for 500 random data sets.
the points in the three clusters, find three clusters in each data set using K- means, and accumulate the distribution of SSE values for these clusterings. By using this distribution of the SSE values, we can then estimate the probability of the SSE value for the original clusters. Figure 8.34 shows the histogram of the SSE from 500 random runs. The lowest SSE shown in Figure 8.34 is 0.0173. For the three clusters of Figure 8.30, the SSE is 0.0050. We could therefore conservatively claim that there is less than a I%o chance that a clustering such as that of Figure 8.30 could occur by chance. r
To conclude, we stress that there is more to cluster evaluation-supervised or unsupervised than obtaining a numerical measure of cluster validity. Un- less this value has a natural interpretation based on the definition of the mea- sure, we need to interpret this value in some way. If our cluster evaluation measure is defined such that lower values indicate stronger clusters, then we can use statistics to evaluate whether the value we have obtained is unusually low, provided we have a distribution for the evaluation measure. We have pre- sented an example of how to find such a distribution, but there is considerably more to this topic, and we refer the reader to the bibliographic notes for more pointers.
Finally, even when an evaluation measure is used as a relative measure, i.e., to compare two clusterings, we still need to assess the significance in the difference between the evaluation measures of the two clusterings. Although one value will almost always be better than another, it can be difficult to determine if the difference is significant. Note that there are two aspects to this significance: whether the difference is statistically significant (repeatable)
BibliographicNotes 555
and whether the magnitude of the difference is meaningful with respect to the
application. Many would not regard a difference of 0.1% as significant, even if
it is consistently reproducible.
8.6 Bibliographic Notes
Discussion in this chapter has been most heavily influenced by the books on
cluster analysis written by Jain and Dubes [396], Anderberg [374], and Kauf-
man and Rousseeuw [400]. Additional clustering books that may also be of
interest include those by Aldenderfer and Blashfield [373], Everitt et al. [388], Hartigan [394], Mirkin [405], Murtagh [407], Romesburg [409], and Spzith [413]. A more statistically oriented approach to clustering is given by the pattern
recognition book of Duda et al. [385], the machine learning book of Mitchell
[406], and the book on statistical learning by Hastie et al. 1395]. A general
survey of clustering is given by Jain et al. 1394, while a survey of spatial data
mining techniques is provided by Han et al. 1393]. Behrkin [379] provides a
survey of clustering techniques for data mini