Intuitively, the weight u,;i, which indicates the degree of membership of

point x; in cluster C7, should be relatively high if x6 is close to centroid c7 (if

di,st(xi,c7) is low) and relatively low if xa is far from centroid ci (if dist(x6,ci)

is high). If uii : lf dist(xi,c7)2, which is the numerator of Equation 9.4, then

this will indeed be the case. However, the membership weights for a point

will not sum to one unless they are normalized; i.e., divided by the sum of all

the weights as in Equation 9.4. To summarize, the membership weight of a

point in a cluster is just the reciprocal of the square of the distance between

the point and the cluster centroid divided by the sum of all the membership

weights of the point. Now consider the impact of the exponent tl(p – 1) in Equation 9.3. If

p > 2, then this exponent decreases the weight assigned to clusters that are

close to the point. Indeed, as p goes to infinity, the exponent tends to 0 and

weights tend to the value 1/k. On the other hand, as P approaches 1, the

exponent increases the membership weights of points to which the cluster is

close. As p goes to 1, the membership weight goes to 1 for the closest cluster

and to 0 for all the other clusters. This corresponds to K-means.

582 Chapter I Cluster Analysis: Additional Issues and Algorithms

A A ^.

A ^’

A

A ^ ‘

A ^’

. ^ ‘

A r a

A A

^’ A

^.

A

oo tooa o

t a ‘

o O O

a

@ o @

ao o

A

o

rEo

A

t r @ r

l 1

l l

I I

I

1 l

I I

T

r l

l l

o

o O @

@

tr t

@ T

E

I I

I I

I

T

I

I

@

@

@ o o

#ffiT,:fl], Figure 9.1. Fuzzy c-means clustering of a two-dimensional point set.

Example 9.L (Ftzzy c-means on Three Circular Clusters). Figure 9.1 shows the result of applying fuzzy c-rneans to find three clusters for a two- dimensional data set of 100 points. Each point was assigned to the cluster in which it had the largest membership weight. The points belonging to each cluster are shown by different marker shapes, while the degree of membership in the cluster is shown by the shading. The darker the points, the stronger their membership in the cluster to which they have been assigned. The membership in a cluster is strongest toward the center of the cluster and weakest for those points that are between clusters.

Strengths and Limitations

A positive feature of FCM is that it produces a clustering that provides an indication of the degree to which any point belongs to any cluster. Otherwise, it has much the same strengths and weaknesses as K-means, although it is somewhat more computationally intensive.

9.2 Prototype-BasedClustering 583

9.2.2 Clustering Using Mixture Models

This section considers clustering based on statistical models. It is often con- venient and effective to assume that data has been generated as a result of a statistical process and to describe the data by finding the statistical model that best fits the data, where the statistical model is described in terms of a distribution and a set of parameters for that distribution. At a high level, this process involves deciding on a statistical model for the data and estimating the parameters of that model from the data. This section describes a par- ticular kind of statistical model, mixture models, which model the data by using a number of statistical distributions. Each distribution corresponds to a cluster and the parameters of each distribution provide a description of the corresponding cluster, typically in terms of its center and spread.

The discussion in this section proceeds as follows. After providing a de- scription of mixture models, we consider how parameters can be estimated for statistical data models. We first describe how a procedure known as maxi- mum likelihood estimation (MLE) can be used to estimate parameters for simple statistical models and then discuss how we can extend this approach for estimating the parameters of mixture models. Specifically, we describe the well-known Expectation-Maximization (EM) algorithm, which makes an initial guess for the parameters, and then iteratively improves these estimates. We present examples of how the EM algorithm can be used to cluster data by estimating the parameters of a mixture model and discuss its strengths and limitations.

A firm understanding of statistics and probability, as covered in Appendix C, is essential for understanding this section. Also, for convenience in the following discussion, we use the term probability to refer to both probability

and probability density.

Mixture Models

Mixture models view the data as a set of observations from a mixture of differ- ent probability distributions. The probability distributions can be anything, but are often taken to be multivariate normal, since this type of distribution is well understood, mathematically easy to work with, and has been shown to produce good results in many instances. These types of distributions can model ellipsoidal clusters.

Conceptually, mixture models correspond to the following process of gen-

erating data. Given several distributions, usually of the same type, but with different parameters, randomly select one of these distributions and generate

584 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

an object from it. Repeat the process m times, where rn is the number of objects.

More formally, assume that there are K distributions and m objects, ,{ :

{*t, . . . ,x*}. Let the 7rh distribution have parameters 0i, and let O be the set of al l parameters, i .e. , @ : {0r, . . . ,0x}. Then, prob(x,101) is the probabi l i ty of the ith object if it comes from the 7’h distribution. The probability that the jth distribution is chosen to generate an object is given by the weight u7, I < j < 1(, where these weights (probabilities) are subject to the constraint that they sum to one, i.e., Ditul: 1. Then, the probability of an object x is given by Equation 9.5.

If the objects are generated in an independent manner) then the probability of the entire set of objects is just the product of the probabilities of each individual x;.

K

prob(xl}): t wpi@lli) j : t

prob(x l@) : fin,ou(4 l@) – iIi w ip i @,10 1) i : 1 i : l i : 7

prob(r,ll): + ,-\# v zlTo

(e.5)

(e.6)

For mixture models, each distribution describes a different group, i.e., a different cluster. By using statistical methods, we can estimate the parame- ters of these distributions from the data and thus describe these distributions (clusters). We can also identify which objects belong to which clusters. How- ever, mixture modeling does not produce a crisp assignment of objects to clusters, but rather gives the probability with which a specific object belongs to a particular cluster.

Example 9.2 (Univariate Gaussian Mixture). We provide a concrete illustration of a mixture model in terms of Gaussian distributions. The prob- ability density function for a one-dimensional Gaussian distribution at a point r i s

(e.7)

The parameters of the Gaussian distribution are given by 0 : (p,o), where p is the mean of the distribution and o is the standard deviation. Assume that there are two Gaussian distributions, with a common standard deviation of 2 and means of -4 and 4, respectively. Also assume that each of the two

a

o n

(a) Probability density function for the mixture model.

9.2 Prototvpe-BasedClusterine 585

(b) 20,000 points generated from the mixture model.

Figure 9.2. Mixture model consisting of two normal distributions with means of -4 and 4, respectively.

Both distributions have a standard deviation of 2.

distribrrtions is selected with equal probability, i.e., wy : w2 : 0.5. Then

Equation 9.5 becomes the following:

1 – t r t 4 t 2 1 _ D r o b ( r l t O ) : – c 8 + _ e’

2\/2iT 2t/2r . (e.8)

Figure 9.2(a) shows a plot of the probability density function of this mix-

ture m,rdel, while Figure 9.2(b) shows the histogram for 20,000 points gener-

ated from this mixture model.

Estimiating Model Parameters Using Maximum Likelihood

Given ix statistical model for the data, it is necessary to estimate the param-

eters of that model. A standard approach used for this task is maximum

likelihc’od estimation, which we now explain. To begin, consider a set of rn points that are generated from a one-

dimensional Gaussian distribution. Assuming that the points are generated

independently, the probability of these points is just the product of their in-

dividual probabilities. (Again, we are dealing with probability densities, but

to keep our terminology simple, we will refer to probabilities.) Using Equa-

tion 9.7, we can write this probability as shown in Equation 9.9. Since this probability would be a very small number, we typically will work with the log probability, as shown in Equation 9.10.

586 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

We would like to find a procedure to estimate z and o if they are unknown. One approach is to choose the values of the parameters for which the data is most probable (most likely). In other words, choose the p and o that maximize Equation 9.9. This approach is known in statistics as the maximum like- lihood principle, and the process of applying this principle to estimate the parameters of a statistical distribution from the data is known as maximum likelihood estimation (MtE).

The principle is called the maximum likelihood principle because, given a set of data, the probability of the data, regarded as a function of the parame- ters, is called a likelihood function. To illustrate, we rewrite Equation 9.9 as Equation 9.11 to emphasize that we view the statistical parameters p, and o as our variables and that the data is regarded as a constant. For practical reasons, the log likelihood is more commonly used. The log likelihood func- tion derived from the log probability of Equation 9.10 is shown in Equation 9.72. Note that the parameter values that maximize the log likelihood also maximize the likelihood since log is a monotonically increasing function.

! – | t , i – u t 2 prob(Xl@): I I , ; e–” ;z-

. a v a ^ u a : l

! \ (r – ut2 Ios prob(X lO) : –

Lt;+ – 0.5m1og2tr – mlogo

i = I

m 1 .

ti,ketihood,(@lx) : L(@lx) : TI + ” “EttL L ^ / r * ^ . 1 v a t t v

(e.e)

(e.10)

(e .11)

los ti,kelihood(Olx) : l(OlX): – t ry – 0.bmlog2tr – mlogo (9.12)

z–I

Example 9.3 (Maximum Likelihood Parameter Estimation). We pro- vide a concrete illustration of the use of MLE for finding parameter values. Suppose that we have the set of 200 points whose histogram is shown in Figure 9.3(a). Figure 9.3(b) shows the maximum log likelihood plot for the 200 points under consideration. The values of the parameters for which the log probabil- ity is a maximum are LL- -4.1 and o:2.1, which are close to the parameter values of the underlying Gaussian distribution, LL : -4.0 and o : 2.0. r

9.2 Prototype-BasedClustering 587

E z

4

40

40

470

40

J90

-s0

-510

IOg proD

(a) Ilistogram of 200 points from a Gaussian distribution.

(b) Log likelihood plot of the 200 points for

different values of the mean and standard

deviation.

Figure 9.3. 200 points from a Gaussian distribution and their log probability for different parameter values.

Graphing the likelihood of the data for different values of the parameters is

not practical, at least if there are more than two parameters. Thus, standard

statisrbical procedure is to derive the maximum likelihood estimates of a statis-

tical parameter by taking the derivative of likelihood function with respect to

that parameter, setting the result equal to 0, and solving. In particular, for a

Gauss;ian distribution, it can be shown that the mean and standard deviation

of the sample points are the maximum likelihood estimates of the correspond- ing peurameters of the underlying distribution. (See Exercise 9 on 648.) Indeed,

for the 200 points considered in our example, the parameter values that max-

imized the log likelihood were precisely the mean and standard deviation of

the 200 po in ts , i .e . , u : -4 .1and o :2 .1 .

Estinnating Mixture Model Parameters Using Maximum Likelihood: The .EM Algorithm

We cia,n also use the maximum likelihood approach to estimate the model parameters for a mixture model. In the simplest case, we know which data

objecl;s come from which distributions, and the situation reduces to one of

estimrating the parameters of a single distribution given data from that distri-

bution. For most common distributions, the maximum likelihood estimates of

the parameters are calculated from simple formulas involving the data.

1 5

588 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

In a more general (and more realistic) situation, we do not know which points were generated by which distribution. Thus, we cannot directly cal- culate the probability of each data point, and hence, it would seem that we cannot use the maximum likelihood principle to estimate parameters. The solution to this problem is the EM algorithm, which is shown in Algorithm 9.2. Briefly, given a guess for the parameter values, the EM algorithm cal- culates the probability that each point belongs to each distribution and then uses these probabilities to compute a new estimate for the parameters. (These parameters are the ones that maximize the likelihood.) This iteration con- tinues until the estimates of the parameters either do not change or change very little. Thus, we stil employ maximum likelihood estimation, but via an iterative search.

4:

Algorithm 9.2 EM algorithm.

z :

J :

Select an initial set of model parameters. (As with K-means, this can be done randomly or in a variety of ways.) repeat

Expectation Step For each object, calculate the probability that each object belongs to each distribution, i.e., calculate pr ob(di str ibut ion, lx i , O). Maximization Step Given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood.