8.2 K-means 5O7

Reducing the SSE with Postprocessing

An obvious way to reduce the SSE is to find more clusters, i.e., to use a larger K. However, in many cases, we would like to improve the SSE, but don’t want to increase the number of clusters. This is often possible because K- means typically converges to a local minimum. Various techniques are used to “fix up” the resulting clusters in order to produce a clustering that has lower SSE. The strategy is to focus on individual clusters since the total SSE is simply the sum of the SSE contributed by each cluster. (We will use the terminology total SSE and cluster ,9^9E, respectively, to avoid any potential confusion.) We can change the total SSE by performing various operations on the clusters, such as splitting or merging clusters. One commonly used approach is to use alternate cluster splitting and merging phases. During a splitting phase, clusters are divided, while during a merging phase, clusters are combined. In this way, it is often possible to escape local SSE minima and still produce a clustering solution with the desired number of clusters. The following are some techniques used in the splitting and merging phases.

Two strategies that decrease the total SSE by increasing the number of clusters are the following:

Split a cluster: The cluster with the largest SSE is usually chosen, but we could also split the cluster with the largest standard deviation for one particular attribute.

Introduce a new cluster centroid: Often the point that is farthest from any cluster center is chosen. We can easily determine this if we keep track of the SSE contributed by each point. Another approach is to choose randomly from all points or from the points with the highest SSE.

Two strategies that decrease the number of clusters, while trying to mini- mize the increase in total SSE, are the following:

Disperse a cluster: This is accomplished by removing the centroid that cor- responds to the cluster and reassigning the points to other clusters. Ide- ally, the cluster that is dispersed should be the one that increases the total SSE the Ieast.

Merge two clusters: The clusters with the closest centroids are typically chosen, although another, perhaps better, approach is to merge the two clusters that result in the smallest increase in total SSE. These two merging strategies are the same ones that are used in the hierarchical

508 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

clustering techniques known as the centroid method and Ward’s method, respectively. Both methods are discussed in Section 8.3.

Updating Centroids Incrementally

Instead of updating cluster centroids after all points have been assigned to a

cluster, the centroids can be updated incrementally, after each assignment of

a point to a cluster. Notice that this requires either zero or two updates to cluster centroids at each step, since a point either moves to a new cluster (two

updates) or stays in its current cluster (zero updates). Using an incremental

update strategy guarantees that empty clusters are not produced since all clusters start with a single point, and if a cluster ever has only one point, then that point will always be reassigned to the same cluster.

In addition, if incremental updating is used, the relative weight of the point

being added may be adjusted; e.g., the weight of points is often decreased as the clustering proceeds. While this can result in better accuracy and faster

convergence, it can be difficult to make a good choice for the relative weight, especially in a wide variety of situations. These update issues are similar to those involved in updating weights for artificial neural networks.

Yet another benefit of incremental updates has to do with using objectives other than “minimize SSE.” Suppose that we are given an arbitrary objective function to measure the goodness of a set of clusters. When we process an individual point, we can compute the value of the objective function for each possible cluster assignment, and then choose the one that optimizes the objec- tive. Specific examples of alternative objective functions are given in Section 8.5 .2 .

On the negative side, updating centroids incrementally introduces an or- der dependency. In other wotds, the clusters produced may depend on the

order in which the points are processed. Although this can be addressed by randomizing the order in which the points are processed, the basic K-means approach of updating the centroids after all points have been assigned to clus- ters has no order dependency. Also, incremental updates are slightly more expensive. However, K-means converges rather quickly, and therefore, the number of points switching clusters quickly becomes relatively small.

8.2.3 Bisect ingK-means

The bisecting K-means algorithm is a straightforward extension of the basic K-means algorithm that is based on a simple idea: to obtain K clusters, split the set of all points into two clusters, select one of these clusters to split, and

8 .2 K-means 509

so on, until K clusters have been produced. The details of bisecting K-means are given by Algorithm 8.2.

Algorithm 8.2 Bisecting K-means algorithm. 1: Initialize the list of clusters to contain the cluster consisting of all points. 2: repeat 3: Remove a cluster from the list of clusters. 4: {Perform several “trial” bisections of the chosen cluster.} 5: for i : 7 to number of tri.als do 6: Bisect the selected cluster usins basic K-means. 7: end for 8: Select the two clusters from the bisection with the lowest total SSE. 9: Add these two clusters to the list of clusters.

10: until Until the list of clusters contains 1( clusters.

There are a number of different ways to choose which cluster to split. We can choose the largest cluster at each step, choose the one with the largest SSE, or use a criterion based on both size and SSE. Different choices result in different clusters.

We often refine the resulting clusters by using their centroids as the initial centroids for the basic K-means algorithm. This is necessary because, although the K-means algorithm is guaranteed to find a clustering that represents a local minimum with respect to the SSE, in bisecting K-means we are using the K- means algorithm “locally,” i.e., to bisect individual clusters. Therefore, the final set of clusters does not represent a clustering that is a Iocal minimum with respect to the total SSE.

Example 8.3 (Bisecting K-means and Initialization). To illustrate that bisecting K-means is less susceptible to initialization problems, we show, in Figure 8.8, how bisecting K-means finds four clusters in the data set originally shown in Figure 8.6(a). In iteration 1, two pairs of clusters are found; in iteration 2,the rightmost pair of clusters is split; and in iteration 3, the leftmost pair of clusters is split. Bisecting K-means has less trouble with initialization because it performs several trial bisections and takes the one with the lowest SSE, and because there are only two centroids at each step. r

Finally, by recording the sequence of clusterings produced as K-means bisects clusters, we can also use bisecting K-means to produce a hierarchical clustering.

51_0 Chapter

o

o o o o oo8 o o o o o

o +

o

Cluster Analysis: Basic Concepts and Algorithms

A

a A A

A X A A -̂ A

+ ^rF a

A , A – A A ; A A

A

o

NE t r

oft E ‘ E B

E E tr

o

o

o o o o oo8 o o o o o

o +

o o o ooo% ?

o o ‘ O 6

B

n E E

E “-+E E ‘ E E

t r t r tr

o o o o o a n

“q%^: “fr”4o 6 a a ^ A

O o O o ooo%3

o o – O O

(a) Iteration 1. (b) Iteration 2. (c) Iteration 3.

Figure 8.8. Bisecting K-means on the four clusters example.

8.2.4 K-means and Different Types of Clusters

K-means and its variations have a number of limitations with respect to finding different types of clusters. In particular, K-means has difficulty detecting the

“natural” clusters, when clusters have non-spherical shapes or widely different sizes or densities. This is illustrated by Figures 8.9, 8.10, and 8.11. In Figure 8.9, K-means cannot find the three natural clusters because one of the clusters is much larger than the other two, and hence, the larger cluster is broken, while one of the smaller clusters is combined with a portion of the larger cluster. In Figure 8.10, K-means fails to find the three natural clusters because the two smaller clusters are much denser than the Iarger cluster. Finally, in Figure 8.11, K-means finds two clusters that mix portions of the two natural clusters because the shape of the natural clusters is not globular.

The difficulty in these three situations is that the K-means objective func- tion is a mismatch for the kinds of clusters we are trying to find since it is minimized by globular clusters of equal size and density or by clusters that are well separated. However) these limitations can be overcome, in some sense, if the user is willing to accept a clustering that breaks the natural clusters into a number of subclusters. Figure 8.12 shows what happens to the three previous

data sets if we find six clusters instead of two or three. Each smaller cluster is pure in the sense that it contains only points from one of the natural clusters.

8.2.5 Strengths and Weaknesses

K-means is simple and can be used for a wide variety of data types. It is also quite efficient, even though multiple runs are often performed. Some variants, including bisecting K-means, are even more efficient, and are less suscepti- ble to initialization problems. K-means is not suitable for all types of data,

8.2 K-means 51-1

o o o o o o o o o 9 o o O 9 O o

(b) Three K-means clusters.

ooo A ^ O O ^ O

o “o *o t oo

ooo o oooo

o o o o o o o * . ‘ o

6 O – o o o o

o o

(a) Original points.

V

V V V

V V V w V

V V V .

V V

V V V V V v

V V V

V V ‘ v v y v

V V

(a) Original points.

(a) Original points.

Figure 8.9. K-means with clusters of different size.

V V

V V v

v ‘ v v

ooo o oo ooo

o

A

“4+”\^AAa

L – n

– E –

u – n – r i l i f-c r t r – t r

+ – l

fi– ,Unt r

” u_uqa u – t r

_ t r – LJ T’I

t r ” n t r t r

ro ooq o oo o ouo o

oo

Figure 8.10. K-means with clusters of different density.

(b) Three K-means clusters.

(b) Two K-means clusters.

oo ooo

u o O a’\-

“ooo o

Figure 8.1 1. K-means with non-globular clusters.

5L2 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

oo o opo

oo o a\

(a) Unequal sizes.

o g t r t r

A n A

o oo””* a

o io , ‘ -4A

o ^ l A g e O v V v

v Y 7o oo ” + 17 n a \”# –

v v o o g v

O V

I t r D t r o o o (

!+ N A ^

n-D “* +A < ‘ v.

“oo^^ ozro ooq$ oo o -o

o ouo o n-oo \J

ln- – l l U –

-E-tr-l-Enn”u n l f ‘ l -oFf , ”

(b) Unequal densities

(c) Non-spherical shapes.

Figure 8.12. Using K-means to find clusters that are subclusters of the natural clusters.

8.2 K-means 513