5O2 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

A A

r A n A A N A

a ^ a A

o +1o!loP “a 9 U v O v n

o ^o”o n” ” o o o

– –

A A

A

o t r t r

u n u O

o

tr on

* t r t r

t r t r

tr

(a) Optimal clustering (b) Suboptimal clustering.

Figure 8.4. Three optimal and non-optimal clusters.

the SSE for three clusters, while Figure 8.4(b) shows a suboptimal clustering that is only a local minimum.

Choosing the proper initial centroids is the key step of the basic K-means procedure. A common approach is to choose the initial centroids randomly, but the resulting clusters are often poor.

Exarnple 8.1 (Poor Initial Centroids). Randomly selected initial cen- troids may be poor. We provide an example of this using the same data set used in Figures 8.3 and 8.4. Figures 8.3 and 8.5 show the clusters that re- sult from two particular choices of initial centroids. (For both figures, the positions of the cluster centroids in the various iterations are indicated by crosses.) In Figure 8.3, even though all the initial centroids are from one natu- ral cluster, the minimum SSE clustering is still found. In Figure 8.5, however, even though the initial centroids seem to be better distributed, we obtain a suboptimal clustering, with higher squared error. r

Example 8.2 (Lirnits of Random Initialization). One technique that is commonly used to address the problem of choosing initial centroids is to perform multiple runs, each with a different set of randomly chosen initial centroids, and then select the set of clusters with the minimum SSE. While simple, this strategy may not work very well, depending on the data set and the number of clusters sought. We demonstrate this using the sample data set shown in Figure 8.6(a). The data consists of two pairs of clusters, where the clusters in each (top-bottom) pair are closer to each other than to the clusters in the other pair. Figure 8.6 (b d) shows that if we start with two initial centroids per pair of clusters, then even when both centroids are in a single

8.2 K-means 503

A A

” A r

A A A A ^

a ^61

” s “.”&”:4. v o ” v O- O O

o

D t r

^ A a ^aa ^-a o ofo.’ q :’:f”.4 t o ” o – o

tr o t r

o – S o

tr

tr

n t r v 0 6 o o o i ” o o ” – o *

o o o o

o o n U t r –

o o o

t r o t r – t r o

tr

tr o o

t r – o o

o

?_oo E o o

(a) Iteration 1. (b) Iteration 2, (c) Iteration 3. (d) Iteration 4.

Figure 8.5. Poor starting centroids for K-means.

cluster, the centroids will redistribute themselves so that the “true” clusters are found. However, Figure 8.7 shows that if a pair of clusters has only one initial centroid and the other pair has three, then two of the true clusters will be combined and one true cluster will be split.

Note that an optimal clustering will be obtained as long as two initial centroids fall anywhere in a pair of clusters, since the centroids will redistribute themselves, one to each cluster. Unfortunately, as the number of clusters becomes larger, it is increasingly likely that at least one pair of clusters will have only one initial centroid. (See Exercise 4 on page 559.) In this case, because the pairs of clusters are farther apart than clusters within a pair, the K-means algorithm will not redistribute the centroids between pairs of clusters, and thus, only a local minimum will be achieved. r

Because of the problems with using randomly selected initial centroids, which even repeated runs may not overcome, other techniques are often em- ployed for initialization. One effective approach is to take a sample of points and cluster them using a hierarchical clustering technique. K clusters are ex- tracted from the hierarchical clustering, and the centroids of those clusters are used as the initial centroids. This approach often works well, but is practical only if (1) the sample is relatively small, â‚¬.9., a few hundred to a few thousand (hierarchical clustering is expensive), and (2) K is relatively small compared to the sample size.

The following procedure is another approach to selecting initial centroids. Select the first point at random or take the centroid of all points. Then, for each successive initial centroid, select the point that is farthest from any of the initial centroids already selected. In this way, we obtain a set of initial

5O4 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

A

A

A

A

A

a o a a A-aA a

a a a a

A

(a) Initial points.

A

A A A A A ^

a a A a

t

o a O O

o o o v o

o

(b) Iteration 1.

A

A A A A A ^ a a ^

A A A A A

A { a t r a – t r

“o ifo t r t r

A A ^

A A

f;Sa A A A A

A

o a O Oo – ^ o o

o ” o o o v O 9

V

^oo 9- v _ : oo {Fot-

V \ /

v

tr tr trtr tr

t r n t r _ B u o

n n

t r t r

A

A

o o o Oln- o

A

A

tr trtr tr t r n t r

o o { o D ‘ t r

o t r

tr tr otr tr

t r d – o o t r t r

t r t r

(c) Iteration 2. (d) Iteration 3.

Figure 8.6, Two pairs of clusters with a pair of initial centroids within each pair of clusters.

centroids that is guaranteed to be not only randomly selected but also well separated. Unfortunately, such an approach can select outliers, rather than points in dense regions (clusters). AIso, it is expensive to compute the farthest point from the current set of initial centroids. To overcome these problems? this approach is often applied to a sample of the points. Since outliers are rare, they tend not to show up in a random sample. In contrast, points from every dense region are likely to be included unless the sample size is very small. Also, the computation involved in finding the initial centroids is greatly reduced because the sample size is typically much smaller than the number of points.

Later on, we will discuss two other approaches that are useful for produc- ing better-quality (lower SSE) clusterings: using a variant of K-means that

8.2 K-means 505

A

A

A

A

A

A

A

A

A

A A

a A ^ A A .

a a A a

+

A A A A A a A

A A A ^ A A .

A 4

(a) Iteration 1.

A A ^

A A

A A ^ A A ^

A A A A

o – ^ o oo”+o o o o

v o o

A A ^

A A

oot o a a A a

A

+ A

A A A A r A

a a ? n A A –

A a

(b) Iteration 2

A A ^

A.

A A ^ a a ^

a a A a

^ O o o – ^ o o o “{o o o o o o

o

A

+ A

A A A A

^ A A A

A A _ A A

A

+ A

A A A A A n A

a a – A n

A A _ A A

o

o ” ^ o o o “+o o o o

o

-trv u.u v r Vo-+’-w – u v

o t r v o

(c) Iteration 3. (d) Iteration 4.

Figure 8.7. Two pairs of clusters with more or fewer than two initial centroids within a pair of clusters.

is less susceptible to initialization problems (bisecting K-means) and using postprocessing to “fixup” the set of clusters produced.

Time and Space Complexity

The space requirements for K-means are modest because only the data points

and centroids are stored. Specifically, the storage required is O((m-l K)n), where rn is the number of points and n is the number of attributes. The time requirements for K-means are also modest-basically linear in the number of data points. In particular, the time required is O(1x K +mxn), where 1is the number of iterations required for convergence. As mentioned, 1 is often small and can usually be safely bounded, as most changes typically occur in the

506 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

first few iterations. Therefore, K-means is linear in rn, the number of points, and is efficient as well as simple provided that K, the number of clusters, is significantly less than m.

8.2.2 K-means: Additional Issues

Handling Empty Clusters

One of the problems with the basic K-means algorithm given earlier is that empty clusters can be obtained if no points are allocated to a cluster during the assignment step. If this happens, then a strategy is needed to choose a replacement centroid, since otherwise, the squared error will be larger than necessary. One approach is to choose the point that is farthest away from any current centroid. If nothing else, this eliminates the point that currently contributes most to the total squared error. Another approach is to choose the replacement centroid from the cluster that has the highest SSE. This will typically split the cluster and reduce the overall SSE of the clustering. If there are several empty clusters, then this process can be repeated several times.

Outliers

When the squared error criterion is used, outliers can unduly influence the clusters that are found. In particular, when outliers are present, the resulting cluster centroids (prototypes) -uy not be as representative as they otherwise would be and thus, the SSE will be higher as well. Because of this, it is often useful to discover outliers and eliminate them beforehand. It is important, however, to appreciate that there are certain clustering applications for which outliers should not be eliminated. When clustering is used for data com- pression, every point must be clustered, and in some cases, such as financial analysis, apparent outliers, e.g., unusually profitable customers, can be the most interesting points.

An obvious issue is how to identify outliers. A number of techniques for identifying outliers will be discussed in Chapter 10. If we use approaches that remove outliers before clustering, we avoid clustering points that will not clus- ter well. Alternatively, outliers can also be identified in a postprocessing step. For instance) we can keep track of the SSE contributed by each point, and eliminate those points with unusually high contributions, especially over mul- tiple runs. Also, we may want to eliminate small clusters since they frequently represent groups of outliers.