7: Assign each object to its closest centroid and return the centroids and clusters’

Initialization This step (line 1) can be performed in a number of ways.

One approach is to choose each component of a centroid randomly from the

range of values observed in the data for that component. While this approach

works, it is not necessarily the best apploach, especially for producing rapid

convergence. Another approach is to randomly choose the initial centroids

596 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

from the available data points. This is very much like randomly selecting centroids for K-means.

Selection of an object The first step in the loop (line 3) is the selection of the next object. This is fairly straightforward, but there are some difficulties. Since convergence may require many steps, each data object may be used multiple times, especially if the number of objects is small. However, if the number of objects is large, then not every object needs to be used. It is also possible to enhance the influence of certain groups of objects by increasing their frequency in the training set.

Assignment The determination of the closest centroid (line 4) is also straight- forward, although it requires the specification of a distance metric. The Eu- clidean distance metric is often used, as is the dot product metric. When using the dot product distance, the data vectors are typically normalized beforehand and the reference vectors are normalized at each step. In such cases, using the dot product metric is equivalent to using the cosine measure.

Update The update step (line 5) is the most complicated. Let m1, …, rn7, be the centroids. (For a rectangular grid, note that k is the product of the number of rows and the number of columns.) For time step f, let p(t) be the current object (point) and assume that the closest centroid to p(l) is m;. Then, for time t+I, the jth centroid is updated by using the following equation. (We will see shortly that the update is really restricted to centroids whose neurons are in a small neighborhood of m;.)

rni(t t 1) : “‘i(r) + hj(t)(p(t) –

“‘i(t)) (e.16)

Thus, at time f, a centroid mi(t) is updated by adding a term, hj(t) (p(t) – mj(t)), which is proportional to the difference, p(t) – -i(t), between the centroid, mi(t), and the current object, p(t). hj(t) determines the effect that the difference, p(l) -^i(t), will have and is chosen so that (t) it diminishes with time and (2) it enforces a neighborhood effect, i.e., the effect of an object is strongest on the centroids closest to the centroid m7. Here we are referring to the distance in the grid, not the distance in the data space. Typically, h7(l) is chosen to be one of the followins two functions:

9.2 Prototype-BasedClustering 597

hi(t) : a(t)erp(-di,st(ri,rp72 1 1Zo2 (t)) (Gaussian function)

hi?) : a(t) if di,st(ri,ri < threshold, 0 otherwise (step function)

These functions require more explanation. a(t) is a learning rate param-

eter, 0 < a(t) < 1, which decreases monotonically with time and controls

the rate of convergence. r/, : (rk,gp) is the two-dimensional point that gives

the grid coordinates of the k’h centroid. dist(ri,16) is the Euclidean distance

b e t w e e n t h e g r i d l o c a t i o n o f t h e t w o c e n t r o i d s , i . e . , ‘ ] ( r 1 – r 4 f + ( g i – a r ) , . Consequently, for centroids whose grid locations are far from the grid location

of centroid m;, the influence of object p(t) wilt be either greatly diminished or

non-existent. Finally. note that o is the typical Gaussian variance parameter

and controls the width of the neighborhood, i.e., a small o will yield a small

neighborhood, while a large o will yield a wide neighborhood. The threshold

used for the step function also controls the neighborhood size.

Remember, it is the neighborhood updating technique that enforces a Ie-

Iationship (ordering) between centroids associated with neighboring neurons.

Termination Deciding when we are close enough to a stable set of centroids

is an important issue. Ideally, iteration should continue until convergence

occurs, that is, until the reference vectors either do not change or change very

little. The rate of convergence will depend on a number of factors, such as

the data and a(f). We will not discuss these issues further, except to mention

that, in general, convergence can be slow and is not guaranteed.

Example 9.6 (Document Data). We present two examples. In the first

case, we apply SOM with a 4-by-4 hexagonal grid to document data. We

clustered 3204 newspaper articles from the Los Angeles T’imes, which come

from 6 different sections: Entertainment, Financial, Foreign, Metro, National,

and sports. Figure 9.8 shows the SoM grid. we have used a hexagonal

grid, which allows each centroid to have six immediate neighbors instead of

four. Each SOM grid cell (cluster) has been labeled with the majority class

label of the associated points. The clusters of each particular category form

contiguous groups, and their position relative to other categories of clusters

gives us additional information, e.g., that the Metro section contains stories

related to all other sections. I

Example 9.7 (Two-Dimensional Points). In the second case) we use a

rectangular SOM and a set of two-dimensional data points. Figure 9.9(a)

598 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

Figure 9.8. Visualization of the relationships between SOM cluster lor Los Angeles Times document data set.

shows the points and the positions of the 36 reference vectors (shown as x’s) produced by SOM. The points are arranged in a checkerboard pattern and are split into five classes: circles, triangles, squares, diamonds, and hexagons (stars). A 6-by-6 two-dimensional rectangular grid of centroids was used with random initialization. As Figure 9.9(a) shows, the centroids tend to distribute themselves to the dense areas. Figure 9.9(b) indicates the majority class of the points associated with that centroid. The clusters associated with triangle points are in one contiguous area, as are the centroids associated with the four other types of points. This is a result of the neighborhood constraints enforced by SOM. While there are the same number of points in each of the five groups, notice also that the centroids are not evenly distributed. This is partly due to the overall distribution of points and partly an artifact of putting each centroid in a sinsle cluster.

Applications

Once the SOM vectors are found, they can be used for many purposes other than clustering. For example, with a two-dimensional SOM, it is possible to associate various quantities with the grid points associated with each centroid (cluster) and to visualize the results via various types of plots. For example, plotting the number of points associated with each cluster yields a plot that reveals the distribution of points among clusters. A two-dimensional SOM is a non-linear projection of the original probability distribution function into two

9.2 Prototype-BasedClustering 599

o s n { o d X ‘ o o o o E ^ oo o o g l

; ; ;

” .*r* * S””\

^ X d o e o

q X –

i ” x * ” ” ?

” J ”

o x e . i ”

– o o ‘ o X * *

” ” ^ o x

” o o ” 8 > E

? X o – x o o ^ s

^ & ” : , ^ ^ o x o

9 { ” – . o o

o o^ ^-g( o o o ‘ o > q t

– “l -4

-o@

r n % a ^ a A ^ a a d’ A a ^ ” ) \ a ^ X X

r ^ o A A A

” ^ X a a

” A

diamond diamond diamond h9Xagon nexagon nexagon

diamond dlamond diamond ctrde hexagon hengon

diamond dimond ctrcle nexagon

square cfcb triangle tiangle

quare tiangle hangle

quare quare tianglâ‚¬ tdan9le hangb

” . ‘ x 8k”.”‘ ” , x 8 k

” d . ” X F ”

X . F o

o o ” E o o X D

E B o

(a) Distribution of SOM reference vectors (X’s) for a two-dimensional point set.

(b) Classes of the SOM cen- broids.

Figure 9.9. SOM applied to two-dimensional data points.

dimensions. This projection attempts to preserve topological features; thus,

using SOM to capture the structure of the data has been compared to the process of ttpressing a flower.tt

Strengths and Limitations

SOM is a clustering technique that enforces neighborhood relationships on the

resulting cluster centroids. Because of this, clusters that are neighbors are

more related to one another than clusters that are not. Such relationships facilitate the interpretation and visualization of the clustering results. Indeed,

this aspect of SOM has been exploited in many areas, such as visualizing Web

documents or gene array data. SOM also has a number of limitations, which are listed next. Some of

the listed limitations are only valid if we consider SOM to be a standard clustering technique that aims to find the true clusters in the data, rather

than a technique that uses clustering to help discover the structure of the

data. Also, some of these limitations have been addressed either by extensions of SOM or by clustering algorithms inspired by SOM. (See the bibliographic notes.)

o The user must choose the settings of parameters, the neighborhood func-

tion, the grid type, and the number of centroids.

600 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

A SOM cluster often does not correspond to a single natural cluster. In some cases, a SOM cluster may encompass several natural clusters, while in other cases a single natural cluster is split into several SOM clusters. This problem is partly due to the use of a grid of centroids and partly due to the fact that SOM, like other prototype-based clustering techniques, tends to split or combine natural clusters when they are of varying sizes, shapes, and densities.

SOM lacks a specific objective function. SOM attempts to find a set of centroids that best approximate the data, subject to the topographic constraints among the centroids, but the success of SOM in doing this cannot be expressed by a function. This can make it difficult to compare different SOM clustering results.

o SOM is not guaranteed to converge, although, in practice, it typically does.

9.3 Density-Based Clustering

In Section 8.4, we considered DBSCAN, a simple, but effective algorithm for finding density-based clusters, i.e., dense regions ofobjects that are surrounded by low-density regions. This section examines additional density-based clus- tering techniques that address issues of efficiency, finding clusters in subspaces, and more accurately modeling density. First, we consider grid-based cluster- ing, which breaks the data space into grid cells and then forms clusters from cells that are sufficiently dense. Such an approach can be efficient and effec- tive, at least for low-dimensional data. Next, we consider subspace clustering, which looks for clusters (dense regions) in subsets of all dimensions. For a data space with n dimensions, potentially 2″ – 1 subspaces need to be searched, and thus an efficient technique is needed to do this. CLIQUE is a grid-based clus- tering algorithm that provides an efficient approach to subspace clustering based on the observation that dense areas in a high-dimensional space imply the existence of dense areas in lower-dimensional space. Finally, we describe DENCLUE, a clustering technique that uses kernel density functions to model density as the sum of the influences of individual data objects. While DEN- CLUE is not fundamentally a grid-based technique, it does employ a grid-based approach to improve efficiency.

9.3 Density-Based Clustering 601

9.3.1 Grid-Based Clustering

A grid is an efficient way to organize a set of data, at least in low dimensions. The idea is to split the possible values of each attribute into a number of contiguous intervals, creating a set of grid cells. (We are assuming, for this discussion and the remainder of the section, that our attributes are ordinal, interval, or continuous.) Each object falls into a grid cell whose corresponding attribute intervals contain the values ofthe object. Objects can be assigned to grid cells in one pass through the data, and information about each cell, such as the number of points in the cell, can also be gathered at the same time.

There are a number of ways to perform clustering using a grid, but most approaches are based on density, at least in part, and thus, in this section, we will use grid-based clustering to mean density-based clustering using a grid. Algorithm 9.4 describes a basic approach to grid-based clustering. Various aspects of this approach are explored next.

Algorithm 9.4 Basic grid-based clustering algorithm. 1: Defi.ne a set of grid cells. 2: Assign objects to the appropriate cells and compute the density of each cell. 3: Eliminate cells having a density below a specified threshold, r. 4: Form clusters from contiguous (adjacent) groups of dense cells.