Example 8.4 (Single Link). Figure 8.16 shows the result of applying the single link technique to our example data set of six points. Figure 8.16(a) shows the nested clusters as a sequence of nested ellipses, where the numbers associated with the ellipses indicate the order of the clustering. Figure S.16(b) shows the same information, but as a dendrogram. The height at which two clusters are merged in the dendrogram reflects the distance of the two clusters. For instance, from Table 8.4, we see that the distance between points 3 and 6

8.3

52O Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

(a) Single link cluslering. (b) Single link dendrogram.

Figure 8.16, Single link clustering of the six points shown in Figure 8.15.

is 0.11, and that is the height at which they are joined into one cluster in the dendrogram. As another example, the distance between clusters {3,6} and

{2,5} is given by

di ,s t ( {3 ,6} , {2 ,5}) : min (dest (3, 2), di,st (6, 2), di,st (3, 5), di,st (6, 5))

min(O. 15, 0.25, 0.28, 0.39)

0 .15 .

T

Complete Link or MAX or CLIQUE

For the complete link or MAX version of hierarchical clustering, the proximity of two clusters is defined as the maximum of the distance (minimum of the similarity) between any two points in the two different clusters. Using graph terminology, if you start with all points as singleton clusters and add links between points one at a time, shortest links first, then a group of points is not a cluster until all the points in it are completely linked, i.e., form a clique. Complete link is less susceptible to noise and outliers, but it can break large clusters and it favors globular shapes.

Example 8.5 (Complete Link). Figure 8.17 shows the results of applying MAX to the sample data set of six points. As with single link, points 3 and 6

Agglomerative Hierarchical Clustering 521

(a) Complete link clustering. (b) Complete link dendrogram.

Figure 8.17. Complete link clustering of the six points shown in Figure 8.15.

are merged first. However, {3,6} is merged with {4}, instead of {2,5i or {1} because

dist({3,6}, {4}) : max(di,st(3, 4), di,st(6, 4)) : max(0.15,0.22) = 0.22.

dist({3,6}, {2, 5}) : max(dzst(3 ,2), di,st(6,2), di,st(3,5), di,st(6,5)) : max(0.15,0.25,0.28,0.39) = 0.39.

dist({3,6}, {1}) : max(di,st(3,L), di,st(6, 7)) : max(0.22,0.23) = 0.23.

Group Average

For the group average version of hierarchical clustering, the proximity of two clusters is defined as the average pairwise proximity among all pairs of points in the different clusters. This is an intermediate approach between the single and complete link approaches. Thus, for group average, the cluster proxim-

8.3

3 6 4 1 2

522 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

(a) Group average clustering.

3 6 4 2 5 1

(b) Group average dendrogram.

Figure 8.18. Group average clustering of the six points shown in Figure 8.15.

ity prori,mi,ty(Ci,Ci) of clusters C,i and Cr’, which are of size mi and mi, respectively, is expressed by the following equation:

\*.co Prori’mitY(x,Y) prori,mi.ty(Cr,C) : r€ci

rni * mj . (8.6)

Example 8.6 (Group Average). Figure 8.18 shows the results of applying the group average approach to the sample data set of six points. To illustrate how group average works, we calculate the distance between some clusters.

d is t ( {3 ,6,4} , { i } ) : (0 .22 + 0.37 + 0.23) / (3 x t ) : 0 .28

di,st({2,5}, {1}) : (0.2357 + 03a21) l(2 * I) : 0.2889

di ,s t ( {3 ,6,4} , {2 ,5}) : (0 .15 + 0.28 + 0.25 + 0.39 + 0.20 + 0.29) l (6 *2) : 0.26

Because di,st({3,6,4},{2.,5}) is smaller than dist({3,6,4}, {1}) anddi,st({2,5}, {1}), clusters {3,6,4} and {2,5} are merged at the fourth stage. r

0.1

8.3 AgglomerativeHierarchicalClustering 523

(a) Ward’s clustering. (b) Ward’s dendrogram.

Figure 8.19. Ward’s clustering of the six points shown in Figure 8.15.

Wardts Method and Centroid Methods

For Ward’s method, the proximity between two clusters is defined as the in- crease in the squared error that results when two clusters are merged. Thus, this method uses the same objective function as K-means clustering. While it may seem that this feature makes Ward’s method somewhat distinct from other hierarchical techniques, it can be shown mathematically that Ward’s method is very similar to the group average method when the proximity be- tween two points is taken to be the square of the distance between them.

Example 8.7 (Ward’s Method). Figure 8.19 shows the results of applying Ward’s method to the sample data set of six points. The clustering that is produced is different from those produced by single link, complete link, and group average. r

Centroid methods calculate the proximity between two clusters by calcu- lating the distance between the centroids of clusters. These techniques may seem similar to K-means, but as we have remarked, Ward’s method is the correct hierarchical analog.

Centroid methods also have a characteristic—-often considered bad-that is not possessed by the other hierarchical clustering techniques that we have discussed: the possibility of inversions. Specifically, two clusters that are merged may be more similar (less distant) than the pair of clusters that were merged in a previous step. For the other methods, the distance between

524 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

Table 8.5. Table of Lance-Williams coefficients for common hierarchical clustering approaches.

Clustering Method Q.A 0.B, /’l ‘l

Sinele Link 1 / r r /2 0 – 1 / ‘

Complete Link 1 / ‘ 1 / 9 0 1 / ‘

Group Average m A I I L B 0 0 Centroid M A

n a l m n

M B

m a + m B ( n o + m o \ 2 0

Ward’s m A + m B + m O t t . H T ‘ r o c )

m A + n B + n O

_ f r L Q

m a + n R + m o 0

merged clusters monotonically increases (or is, at worst, non-increasing) we proceed from singleton clusters to one all-inclusive cluster.

8.3.3 The Lance-Williams Formula for Cluster Proximitv

Any of the cluster proximities that we have discussed in this section can be viewed as a choice of different parameters (in the Lance-Williams formula shown below in Equation 8.7) for the proximity between clusters Q and R, where,R is formed by merging clusters A and B. In this equation, p(.,.) is a proximity function, while rn4, mB, and me are the number of points in clusters A, B, and Q, respectively. In other words, after we merge clusters A and B to form cluster .8, the proximity of the new cluster, ft, to an existing cluster, Q, is a linear function of the proximities of Q with respect to the original clusters A and B. Table 8.5 shows the values of these coefficients for the techniques that we have discussed.

p(R,Q) : aAp(A,Q) + *e p(B,Q) + 0 p(A, B) + t lp@,Q) – p(B,A) l (8.7)

Any hierarchical clustering technique that can be expressed using the Lance-Wiiliams formula does not need to keep the original data points. In- stead, the proximity matrix is updated as clustering occurs. While a general formula is appealing, especially for implementation, it is easier to understand the different hierarchical methods by looking directly at the definition of clus- ter proximity that each method uses.

8.3.4 Key Issues in Hierarchical Clustering

Lack of a Global Objective Function

We previously mentioned that agglomerative hierarchical clustering cannot be viewed as globally optimizing an objective function. Instead, agglomerative hierarchical clustering techniques use various criteria to decide locally, at each

Agglomerative Hierarchical Clustering 525

step, which clusters should be merged (or split for divisive approaches). This approach yields clustering algorithms that avoid the difficulty of attempting to solve a hard combinatorial optimization problem. (It can be shown that the general clustering problem for an objective function such as “minimize SSE” is computationally infeasible.) Furthermore, such approaches do not have problems with local minima or difficulties in choosing initial points. Of course, the time complexity of O(m2log m) and the space complexity of O(m2) are prohibitive in many cases.

Ability to Handle Different Cluster Sizes

One aspect of agglomerative hierarchical clustering that we have not yet dis- cussed is how to treat the relative sizes of the pairs of clusters that are merged. (This discussion applies only to cluster proximity schemes that involve sums, such as centroid, Ward’s, and group average.) There are two approaches: weighted, which treats all clusters equally, and unweighted, which takes the number of points in each cluster into account. Note that the terminology of weighted or unweighted refers to the data points, not the clusters. In other words, treating clusters of unequal size equally gives different weights to the points in different clusters, while taking the cluster size into account gives points in different clusters the same weight.

We will illustrate this using the group average technique discussed in Sec- tion 8.3.2, which is the unweighted version of the group average technique. In the clustering literature, the full name of this approach is the Unweighted Pair Group Method using Arithmetic averages (UPGMA). In Table 8.5, which gives the formula for updating cluster similarity, the coefficients for UPGMA involve the size of each of the clusters that were merged:. aA: *ffi,(tB :

,” f f i ,0:0,?:0. For the weighted version of group average-known as WPGMA-the coeff ic ients are constants: aa: Lf 2,aB : L12,P: 0,7 : 0. In general, unweighted approaches are preferred unless there is reason to be- lieve that individual points should have different weights; e.g., perhaps classes of objects have been unevenly sampled.

Merging Decisions Are Final

Agglomerative hierarchical clustering algorithms tend to make good local de- cisions about combining two clusters since they can use information about the pairwise similarity of all points. However, once a decision is made to merge two clusters, it cannot be undone at a later time. This approach prevents a local optimization criterion from becoming a global optimization criterion.

8.3

526 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

For example, although the “minimize squared error” criterion from K-means is used in deciding which clusters to merge in Ward’s method, the clusters at each level do not represent local minima with respect to the total SSE. Indeed, the clusters are not even stable, in the sense that a point in one cluster may be closer to the centroid of some other cluster than it is to the centroid of its current cluster. Nonetheless, Ward’s method is often used as a robust method of initializing a K-means clustering, indicating that a local “minimize squared error” objective function does have a connection to a global “minimize squared error” objective function.

There are some techniques that attempt to overcome the limitation that merges are final. One approach attempts to fix up the hierarchical clustering by moving branches of the tree around so as to improve a global objective function. Another approach uses a partitional clustering technique such as K- means to create many small clusters, and then performs hierarchical clustering using these small clusters as the starting point.

8.3.5 Strengths and’Weaknesses