+1 (208) 254-6996 essayswallet@gmail.com

8 . 1 Overview 495

fades into the noise and does not form a cluster in Figure 8.2(d). A density- based definition of a cluster is often employed when the clusters are irregular or intertwined, and when noise and outliers are present. By contrast, a contiguity- based definition of a cluster would not work well for the data of Figure 8.2(d) since the noise would tend to form bridses between clusters.

Don't use plagiarized sources. Get Your Custom Essay on
8 . 1 Overview 495 fades into the noise and does not form a cluster in Figure 8.2(d)
Just from $13/Page
Order Essay

Shared-Property (Conceptual Clusters) More generally, we can define a cluster as a set of objects that share some property. This definition encom- passes all the previous definitions of a cluster; e.g., objects in a center-based cluster share the property that they are all closest to the same centroid or medoid. However, the shared-property approach also includes new types of clusters. Consider the clusters shown in Figure 8.2(e). A triangular area (cluster) is adjacent to a rectangular one, and there are two intertwined circles (clusters). In both cases, a clustering algorithm would need a very specific concept of a cluster to successfully detect these clusters. The process of find- ing such clusters is called conceptual clustering. However, too sophisticated a notion of a cluster would take us into the area of pattern recognition, and thus, we only consider simpler types of clusters in this book.

Road Map

In this chapter, we use the following three simple, but important techniques to introduce many of the concepts involved in cluster analysis.

o K-means. This is a prototype-based, partitional clustering technique that attempts to find a user-specified number of clusters (K), which are represented by their centroids.

o Agglomerative Hierarchical Clustering. This clustering approach refers to a collection of closely related clustering techniques that produce a hierarchical clustering by starting with each point as a singleton cluster and then repeatedly merging the two closest clusters until a single, all- encompassing cluster remains. Some of these techniques have a natural interpretation in terms of graph-based clustering, while others have an interpretation in terms of a prototype-based approach.

o DBSCAN. This is a density-based clustering algorithm that produces a partitional clustering, in which the number of clusters is automatically determined by the algorithm. Points in low-density regions are classi- fied as noise and omitted; thus, DBSCAN does not produce a complete clustering.



496 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

(a) Well-separated clusters. Each point is closer to all of the points in its cluster than to any point in another cluster.

(b) Center-based clusters. Each point is closer to the center of its cluster than to the center of anv other cluster.

(d) Density-based clusters. Clus- ters are regions of high density sep- arated by regions of low density.

:: : :: ; : : : : : : : : ::iiill,’lN: : : :: : : : : : : : : : : : ; : : : ; : : / l i i i l i i i i i i i i i iN : : : ; : :

: : : : : i : : : i : : : i l i i i i i : i i i i i i i i i i i r i i l i , i , i , : ‘ : : : . : : : : : \ , i ‘ l , i , i , i , i , i . l t : : : : : :

: , i ‘ : ‘ i ‘ i i : ‘ i , i l i l . r i r4: , : , : , ,

(c) Contiguity-based clusters. Each point is closer to at least one point in its cluster than to anv point in another cluster.

(e) Conceptual clusters. Points in a cluster share some general property that derives from the entire set of points. (Points in the intersection of the circles belong to both.)

Figure 8.2. Different types of clusters as illustrated by sets of two-dimensional points.

8.2 K-means

Prototype-based clustering techniques create a one-level partitioning of the data objects. There are a number of such techniques, but two of the most prominent are K-means and K-medoid. K-means defines a prototype in terms of a centroid, which is usually the mean of a group of points, and is typically



K-means 497

applied to objects in a continuous n-dimensional space. K-medoid defines a prototype in terms of a medoid, which is the most representative point for a group of points, and can be applied to a wide range of data since it requires only a proximity measure for a pair of objects. While a centroid almost never corresponds to an actual data point, a medoid, by its definition, must be an actual data point. In this section, we will focus solely on K-means, which is one of the oldest and most widely used clustering algorithms.

8.2.1 The Basic K-means Algorithm

The K-means clustering technique is simple, and we begin with a description of the basic algorithm. We first choose K initial centroids, where l( is a user- specified parameter, namely, the number of clusters desired. Each point is then assigned to the closest centroid, and each collection of points assigned to a centroid is a cluster. The centroid of each cluster is then updated based on the points assigned to the cluster. We repeat the assignment and update steps until no point changes clusters, or equivalently, until the centroids remain the same.

K-means is formally described by Algorithm 8.1. The operation of K-means is illustrated in Figure 8.3, which shows how, starting from three centroids, the final clusters are found in four assignment-update steps. In these and other figures displaying K-means clustering, each subfigure shows (1) the centroids at the start of the iteration and (2) the assignment of the points to those centroids. The centroids are indicated by the “*” symbol; all points belonging to the same cluster have the same marker shape.

Algorithm 8.1 Basic K-means algorithm.


1: Select K points as initial centroids. 2: repeat 3: Form K clusters by assigning each point to its closest centroid. 4: Recompute the centroid of each cluster. 5: until Centroids do not change.

In the first step, shown in Figure 8.3(a), points are assigned to the initial centroids, which are all in the larger group of points. For this example, we use the mean as the centroid. After points are assigned to a centroid, the centroid is then updated. Again, the figure for each step shows the centroid at the beginning of the step and the assignment of points to those centroids. In the second step, points are assigned to the updated centroids, and the centroids






t r D A

t r o + o + o o o t r ^ t r o ^ o t r |

o – o o 8 . o 9 o – – o o 8 3 s o – { o o 8 . T St r ” o – – : c P t r ” o – ” b P t r ‘ o – – : c P

o ” o o o ” o o o ” o o

498 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

tr t rQ

o-dP o


(a) Iteration 1. (b) Iteration 2. (c) Iteration 3. (d) Iteration 4.

Figure 8.3. Using the K-means algorithm to find three clusters in sample data.

are updated again. In steps 2, 3, and 4, which are shown in Figures 8.3 (b), (c), and (d), respectively, two of the centroids move to the two small groups of points at the bottom of the figures. When the K-means algorithm terminates in Figure 8.3(d), because no more changes occur, the centroids have identified the natural groupings of points.

For some combinations of proximity functions and types of centroids, K- means always converges to a solution; i.e., K-means reaches a state in which no points are shifting from one cluster to another, and hence, the centroids don’t change. Because most of the convergence occurs in the early steps, however, the condition on line 5 of Algorithm 8.1 is often replaced by a weaker condition, e.g., repeat until only l% of the points change clusters.

We consider each of the steps in the basic K-means algorithm in more detail and then provide an analysis of the algorithm’s space and time complexity.

Assigning Points to the Closest Centroid

To assign a point to the closest centroid, we need a proximity measure that quantifies the notion of “closest” for the specific data under consideration. Euclidean (L2) distance is often used for data points in Euclidean space, while cosine similarity is more appropriate for documents. However, there may be several types of proximity measures that are appropriate for a given type of data. For example, Manhattan (L1) distance can be used for Euclidean data, while the Jaccard measure is often employed for documents.

Usually, the similarity measures used for K-means are relatively simple since the algorithm repeatedly calculates the similarity of each point to each centroid. In some cases. however. such as when the data is in low-dimensional



K-means 499

Table 8.1. Table of notation.

Svmbol Description x Ci Ci

c T11,4



An object. The zth cluster. The centroid of cluster C6. The centroid of all points. The number of objects in the ith cluster. The number of objects in the data set. The number of clusters.

Euclidean space, it is possible to avoid computing many of the similarities, thus significantly speeding up the K-means algorithm. Bisecting K-means (described in Section 8.2.3) is another approach that speeds up K-means by reducing the number of similarities computed.

Centroids and Objective F\rnctions

Step 4 of the K-means algorithm was stated rather generally as “recompute the centroid of each cluster,” since the centroid can vary, depending on the proximity measure for the data and the goal of the clustering. The goal of the clustering is typically expressed by an objective function that depends on the proximities of the points to one another or to the cluster centroids; e.9., minimize the squared distance of each point to its closest centroid. We illus- trate this with two examples. However, the key point is this: once we have specified a proximity measure and an objective function, the centroid that we should choose can often be determined mathematically. We provide mathe- matical details in Section 8.2.6, and provide a non-mathematical discussion of this observation here.

Data in Euclidean Space Consider data whose proximity measure is Eu- clidean distance. For our objective function, which measures the quality of a clustering, we use the sum of the squared error (SSE), which is also known as scatter. In other words, we calculate the error of each data point, i.e., its Euclidean distance to the closest centroid, and then compute the total sum of the squared errors. Given two different sets of clusters that are produced by two different runs of K-means, we prefer the one with the smallest squared error since this means that the prototypes (centroids) of this clustering are a better representation of the points in their cluster. Using the notation in Table 8.1, the SSE is formally defined as follows:




500 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms


SSE: t t d, i ,st(c i , r )2 i : t r€Ci

where di,st is the standard Euclidean (Lz) distance between two objects in Euclidean space.

Given these assumptions, it can be shown (see Section 8.2.6) that the centroid that minimizes the SSE of the cluster is the mean. Using the notation in Table 8.1, the centroid (mean) of the ith cluster is defined by Equation8.2.

(8 .1)



t* xeC,i.


\-a \- I o ta luones lon : ) ) cosxne l x , c ; . )


i:I x€C,i


mi C i :

To illustrate, the centroid of a cluster containing the three two-dimensional po in ts , (1 ,1 ) , (2 ,3 ) , and (6 ,2 ) , i s ( (1 + 2 + 6 )13 , ( (1 +3 + 2)13) : (3 ,2 ) .

Steps 3 and 4 of the K-means algorithm directly attempt to minimize the SSE (or more generally, the objective function). Step 3 forms clusters by assigning points to their nearest centroid, which minimizes the SSE for the given set of centroids. Step 4 recomputes the centroids so as to further minimize the SSE. However, the actions of K-means in Steps 3 and 4 are only guaranteed to find a local minimum with respect to the SSE since they are based on optimizing the SSE for specific choices of the centroids and clusters, rather than for all possible choices. We will later see an example in which this Ieads to a suboptimal clustering.

Document Data To illustrate that K-means is not restricted to data in Euclidean space, we consider document data and the cosine similarity measure. Here we assume that the document data is represented as a document-term matrix as described on page 31. Our objective is to maximizethe similarity of the documents in a cluster to the cluster centroid; this quantity is known as the cohesion of the cluster. For this objective it can be shown that the cluster centroid is, as for Euclidean data, the mean. The analogous quantity to the total SSE is the total cohesion, which is given by Equation 8.3.

The General Case There are a number of choices for the proximity func- tion, centroid, and objective function that can be used in the basic K-means



8.2 K-means 501

Table 8.2. K-means: Common choices for proximity, centroids, and objective functions.

Proximity Fbnction Centroid Obiective F\rnction Manhattan (L1) median Minimize sum of the Lr distance of an ob-

iect to its cluster centroid Squared Euclidean (L!) mean Minimize sum of the squared L2 distance

of an obiect to its cluster centroid coslne mean Maximize sum of the cosine similarity of

an object to its cluster centroid Bregman divergence mean Minimize sum of the Bregman divergence

of an obiect to its cluster centroid

algorithm and that are guaranteed to converge. Table 8.2 shows some possible choices, including the two that we have just discussed. Notice that for Man- hattan (L1) distance and the objective of minimizing the sum of the distances, the appropriate centroid is the median of the points in a cluster.

The last entry in the table, Bregman divergence (Section 2.4.5), is actually a class of proximity measures that includes the squared Euclidean distance, Ll, the Mahalanobis distance, and cosine similarity. The importance of Bregman divergence functions is that any such function can be used as the basis of a K- means style clustering algorithm with the mean as the centroid. Specifically, if we use a Bregman divergence as our proximity function, then the result- ing clustering algorithm has the usual properties of K-means with respect to convergence, local minima, etc. Furthermore) the properties of such a cluster- ing algorithm can be developed for all possible Bregman divergences. Indeed, K-means algorithms that use cosine similarity or squared Euclidean distance are particular instances of a general clustering algorithm based on Bregman divergences.

For the rest our K-means discussion, we use two-dimensional data since it is easy to explain K-means and its properties for this type of data. But, as suggested by the last few paragraphs, K-means is a very general clustering algorithm and can be used with a wide variety of data types, such as documents and time series.

Choosing Initial Centroids

When random initialization of centroids is used, different runs of K-means typically produce different total SSEs. We illustrate this with the set of two- dimensional points shown in Figure 8.3, which has three natural clusters of points. Figure 8.4(a) shows a clustering solution that is the global minimum of

Order your essay today and save 10% with the discount code ESSAYHELP