until The parameters do not change. (Alternatively, stop if the change in the parameters is below a specified threshold.)

The trM algorithm is similar to the K-means algorithm given in Section 8.2.1. Indeed, the K-means algorithm for Euclidean data is a special case of the EM algorithm for spherical Gaussian distributions with equal covariance ma- trices, but different means. The expectation step corresponds to the K-means step of assigning each object to a cluster. Instead, each object is assigned to every cluster (distribution) with some probability. The maximization step corresponds to computing the cluster centroids. Instead, all the parameters of the distributions, as well as the weight parameters, are selected to maximize the likelihood. This process is often straightforward, as the parameters are typically computed using formulas derived from maximum likelihood estima- tion. For instance, for a single Gaussian distribution, the MLE estimate of the

9.2 Prototype-BasedClustering 589

mean is the mean of the objects in the distribution. In the context of mixture models and the trM algorithm, the computation of the mean is modified to account for the fact that every object belongs to a distribution with a certain probability. This is illustrated further in the following example.

Example 9.a (Simple Example of EM Algorithm). This example illus- trates how EM operates when applied to the data in Figure 9.2. To keep the example as simple as possible) we assume that we know that the standard deviation of both distributions is 2.0 and that points were generated with equal probability from both distributions. We will refer to the left and right distributions as distributions 1 and 2, respectively.

We begin the EM algorithm by making initial guesses for p,1 and p,2) say)

Ft : -2 and p,2 : 3. Thus, the initial parameters,0 : (Lt,o), for the two distributions are? respectively, 01 : (-2,2) and 02: (3,2). The set of parameters for the entire mixture model is O: {0t,02}.For the expectation step of EM, we want to compute the probability that a point came from a particular distribution; i.e., we want to compute prob(di,stri,bution 1lr;, O) and prob(di,stri,bution 2lri,O). These values can be expressed by the following equation, which is a straightforward application of Bayes rule (see Appendix c),

pr ob(distribution j l, i, 0) : 0.5 prob(r.il?i)

(e.13) 0.5 prob(ri l?t) + 0.5 prob(ri l?2)’

where 0.5 is the probability (weight) of each distribution and j is 7 or 2. For instance, assume one of the points is 0. Using the Gaussian den-

sity function given in Equation 9.7, we compute that prob(010t) : 0.12 and prob(0102): 0.06. (Again, we are really computing probability densities.) Us- ing these values and Equation 9.13, we find that prob(di,stribution 110, O) :

0.12 I Q.12+ 0.06) : 0.66 and pr ob(distri,buti,on 210, O) : 0.06/(0. 12 + 0.06) :

0.33. This means that the point 0 is twice as likely to belong to distribution 1 as distribution 2 based on the current assumptions for the parameter values.

After computing the cluster membership probabilities for all 20,000 points,

we compute new estimates for pr,1 and pt2 (using Equations 9.14 and 9.15) in

the maximization step of the EM algorithm. Notice that the new estimate for the mean of a distribution is just a weighted average of the points, where the weights are the probabilities that the points belong to the distribution, i.e., the pr ob(di,stri,buti,on 7 | 16 ) values.

p r obldi,stri but ion 1 | r;, O) (e.14)’

D}foo p, ob(d,i stri but ion Llr ;, Q)

20,000 \-

1 4 : ) . r i :7

590 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

Table 9.1. First few iterations of the EM algorithm for the simple example.

20,000

\- L

Iteration l-rt Hz 0 1

.l

4

-2.00 , a A

-3.94 -3.97 -3.98 -3.98

3.00 4. r0 4.07 4.04 4.03 4.03

U 2 : (e.15)

We repeat these two steps until the estimates of p1 and p,2 either don’t change or change very little. Table 9.1 gives the first few iterations of the EM algorithm when it is applied to the set of 20,000 points. For this data, we know which distribution generated which point, so we can also compute the mean of the points from each distribution. The means are p1 : -3.98 and

l tz : 4 .03 . I

Example 9.5 (The EM Algorithm on Sample Data Sets). We give three examples that illustrate the use of the EM algorithm to find clusters using mixture models. The first example is based on the data set used to illustrate the finzy c-means algorithm-see Figure 9.1. We modeled this data as a mixture of three two-dimensional Gaussian distributions with different means and identical covariance matrices. We then clustered the data using the EM algorithm. The results are shown in Figure 9.4. Each point was assigned to the cluster in which it had the largest membership weight. The points belonging to each cluster are shown by different marker shapes, while the degree of membership in the cluster is shown by the shading. Membership in a cluster is relatively weak for those points that are on the border of the two clusters, but strong elsewhere. It is interesting to compare the membership weights and probabilities of Figures 9.4 and 9.1. (See Exercise 11 on page 648.)

For our second example, we apply mixture model clustering to data that contains clusters with different densities. The data consists of two natural clusters, each with roughly 500 points. This data was created by combining two sets of Gaussian data, one with a center at (-4,1) and a standard deviation of 2, and one with a center at (0,0) and a standard deviation of 0.5. Figure 9.5 shows the clustering produced by the EM algorithm. Despite the differences

9 .2 Prototvpe-Based Clusterins 591

in the density, the EM algorithm is quite successful at identifying the original clusters.

For our third example, we use mixture model clustering on a data set that K-means cannot properly handle. Figure 9.6(a) shows the clustering produced by a mixture model algorithm, while Figure 9.6(b) shows the K-means cluster- ing of the same set of 1000 points. For mixture model clustering, each point has been assigned to the cluster for which it has the highest probability. In both figures, different markers are used to distinguish different clusters. Do not confuse the ‘f’ and ‘x’ markers in Figure 9.6(a).

Advantages and Limitations of Mixture Model Clustering Using the EM Algorithm

Finding clusters by modeling the data using mixture models and applying the EM algorithm to estimate the parameters of those models has a variety of advantages and disadvantages. On the negative side, the trM algorithm can be slow, it is not practical for models with large numbers of components, and it does not work well when clusters contain only a few data points or if the data points are nearly co-linear. There is also a problem in estimating the number of clusters or7 more generally, in choosing the exact form of the model to use. This problem typically has been dealt with by applying a Bayesian approach, which, roughly speaking, gives the odds of one model versus another, based on an estimate derived from the data. Mixture models may also have difficulty with noise and outliers, although work has been done to deal with this problem.

On the positive side, mixture models are more general than K-means or finzy c-means because they can use distributions of various types. As a result, mixture models (based on Gaussian distributions) can find clusters of different sizes and elliptical shapes. AIso, a model-based approach provides a disciplined way of eliminating some of the complexity associated with data. To see the patterns in data, it is often necessary to simplify the data, and fitting the data to a model is a good way to do that if the model is a good match for the data. Furthermore, it is easy to characterize the clusters produced, since they can be described by a small number of parameters. Finally, many sets of data are indeed the result of random processes) and thus should satisfy the statistical assumptions of these models.

592 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

o

O

o o

a

O O

o o a o

^.

A

A

^.

a o o

a o

o o

1 O a

o a

a

a

a

a

a

I

85

8

7

a

^.

A A

^.

A A A

A

r .4. t l

a t t a

.65

.6

0.55

0.5

l l

I

l r ^’ 1 A

l r

l r r

I

Maximum Probablility

Figure 9.4. EM clustering of a two-dimensional point set with three clusters.

!-o ..

a

Figure 9.5. EM clustering of a two-dimensional point set with two clusters of differing density.

9,2 Prototype-Based Clustering

* X X

x *

* rx +*

: * l<

x **

* )e

X< { x( , \x

x

+ + . +

f l T

+ + ‘ t * , r’ +T +TJ+]- + + -+=

x ? ( x : X

* a < **

(a) Clusters produced by mixture model clustering.

;r

* h,(

*

*x * ,x ,t*

* : x)( * **

(b) Clusters produced by K-means clustering.

Figure 9.6. Mixture model and K-means clustering of a set of two-dimensional points.

j;tr+if;

‘ ‘ +

594 Chapter I Cluster Analysis: Additional Issues and Algorithms

9.2.3 Self-Organizing Maps (SOM)

The Kohonen Self-Organizing Feature Map (SOFM or SOM) is a clustering and data visualization technique based on a neural network viewpoint. Despite the neural network origins of SOM, it is more easily presented-at least in the context of this chapter-as a variation of prototype-based clustering. As with other types of centroid-based clustering, the goal of SOM is to find a set of centroids (reference vectors in SOM terminology) and to assign each object in the data set to the centroid that provides the best approximation of that object. In neural network terminology, there is one neuron associated with each centroid.

As with incremental K-means, data objects are processed one at a time and the closest centroid is updated. Unlike K-means, SOM imposes a topographic ordering on the centroids and nearby centroids are also updated. F\rrthermore, SOM does not keep track of the current cluster membership of an object, and, unlike K-means, if an object switches clusters, there is no explicit update of the old cluster centroid. Of course, the old cluster may well be in the neighborhood of the new cluster and thus may be updated for that reason. The processing of points continues until some predetermined limit is reached or the centroids are not changing very much. The final output of the SOM technique is a set of centroids that implicitly define clusters. Each cluster consists of the points closest to a particular centroid. The following section explores the details of this process.

The SOM Algorithm

A distinguishing feature of SOM is that it imposes a topographic (spatial) organization on the centroids (neurons). Figure 9.7 shows an example of a two-dimensional SOM in which the centroids are represented by nodes that are organized in a rectangular lattice. Each centroid is assigned a pair of coor- dinates (2, j). Sometimes, such a network is drawn with links between adjacent nodes, but that can be misleading because the influence of one centroid on an- other is via a neighborhood that is defined in terms of coordinates, not links. There are many types of SOM neural networks, but we restrict our discussion to two-dimensional SOMs with a rectangular or hexagonal organization of the centroids.

Even though SOM is similar to K-means or other prototype-based ap- proaches, there is a fundamental difference. Centroids used in SOM have a predetermined topographic ordering relationship. During the training process, SOM uses each data point to update the closest centroid and centroids that are

9.2 Prototype-BasedClustering 595

@@@

@@@ r.;) r.;) G)\r\rv

Figure 9,7. Two-dimensional 3-by-3 rectangular SOM neural network.

nearby in the topographic ordering. In this way, SOM produces an ordered set

of centroids for any given data set. In other words, the centroids that are close

to each other in the SOM grid are more closely related to each other than to

the centroids that are farther away. Because of this constraint, the centroids of

a two-dimensional SOM can be viewed as lying on a two-dimensional surface

that tries to fit the n-dimensional data as well as possible. The SOM centroids

can also be thought of as the result of a nonlinear regression with respect to

the data points. At a high level, clustering using the SOM technique consists of the steps

described in Algorithm 9.3.

Algorithm 9.3 Basic SOM Algorithm. 1: Initialize the centroids. 2: repeat 3: Select the next object. 4: Determine the closest centroid to the object. 5: Update this centroid and the centroids that are close, i.e., in a specified neigh-