however. It cannot handle non-globular clusters or clusters of different sizes and densities, although it can typically find pure subclusters if a large enough number of clusters is specified. K-means also has trouble clustering data that contains outliers. Outlier detection and removal can help significantly in such situations. Finally, K-means is restricted to data for which there is a notion of a center (centroid). A related technique, K-medoid clustering, does not have this restriction, but is more expensive.
8.2.6 K-means as an Optimization Problem
Here, we delve into the mathematics behind K-means. This section, which can be skipped without loss of continuity, requires knowledge of calculus through partial derivatives. Familiarity with optimization techniques, especially those based on gradient descent, may also be helpful.
As mentioned earlier, given an objective function such as “minimize SSE,’ clustering can be treated as an optimization problem. One way to solve this problem-to find a global optimum is to enumerate all possible ways of di- viding the points into clusters and then choose the set of clusters that best satisfies the objective function, e.g., that minimizes the total SSE. Of course, this exhaustive strategy is computationally infeasible and as a result, a more practical approach is needed, even if such an approach finds solutions that are not guaranteed to be optimal. One technique, which is known as gradient descent, is based on picking an initial solution and then repeating the fol- lowing two steps: compute the change to the solution that best optimizes the objective function and then update the solution.
We assume that the data is one-dimensional, i.e., di.st(r,A) : @ – A)2. This does not change anything essential, but greatly simplifies the notation.
Derivation of K-means as an Algorithm to Minimize the SSE
In this section, we show how the centroid for the K-means algorithm can be mathematically derived when the proximity function is Euclidean distance and the objective is to minimize the SSE. Specifically, we investigate how we can best update a cluster centroid so that the cluster SSE is minimized. In mathematical terms, we seek to minimize Equation 8.1, which we repeat here, specialized for one-dimensional data.
K
sSE:ttk,-*) ‘ i : l reCt
(8.4)
5L4 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms
Herc, Ci is the ith cluster, r is a point in C,i, and c,; is the mean of the ith cluster. See Table 8.1 for a complete list of notation.
We can solve for Lhe kth centroid c4, which minimizes Equation 8.4, by differentiating the SSE, setting it equal to 0, and solving, as indicated below.
K
\ – \ -/ , / ‘ i:I n€C.i.
\ – r * ./r
–
iL tU t1
9stu : dcn
n Ko \ – \ -orr?^ fu,
(“0 – ,) ‘
r D – L )
2 + ( q , – r n ) : 0
D”r)(1, – reC6
1
mlc D”r r€Cp
D, * @n – rn): 0 + rr l1ac1, :
Thus, as previously indicated, the best centroid for minimizing the SSE of a cluster is the mean of the points in the cluster.
Derivation of K-means for SAE
To demonstrate that the K-means algorithm can be applied to a variety of different objective functions, we consider how to partition the data into K clusters such that the sum of the Manhattan (L1) distances of points from the center of their clusters is minimized. We are seeking to minimize the sum of the L1 absolute errors (SAE) as given by the following equation, where d,i.sty, is the L1 distance. Again, for notational simplicity, we use one-dimensional data, i.e., di,sfi., : lci – nl.
SAE:t t d, isty,(” t , r ) (8.5) i:I r€Cl.
We can solve for the kth centroid c7r, which minimizes Equation 8.5, by differentiating the SAE, setting it equal to 0, and solving.
8.3 Agglomerative Hierarchical Clustering 5L5
3tou : ocx
: t 3p1,-r l :s ,6ro”*
a\ fiV-
– rl 😮 * e
sisn(r – cn) : oT a c / 1 ,
If we solve for q”, we find that qr: med’ian{r e Cp}, the median of the points in the cluster. The median of a group of points is straightforward to compute and less susceptible to distortion by outliers.
8.3 Agglomerative Hierarchical Clustering
Hierarchical clustering techniques are a second important category of cluster- ing methods. As with K-means, these approaches are relatively old compared to many clustering algorithms, but they still enjoy widespread use. There are two basic approaches for generating a hierarchical clustering:
Agglomerative: Start with the points as individual clusters and, at each step, merge the closest pair of clusters. This requires defining a notion of cluster proximity.
Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain. In this case) we need to decide which cluster to split at each step and how to do the splitting.
Agglomerative hierarchical clustering techniques are by far the most common) and, in this section, we will focus exclusively on these methods. A divisive hierarchical clustering technique is described in Section 9.4.2.
A hierarchical clustering is often displayed graphically using a tree-like diagram called a dendrogram, which displays both the cluster-subcluster
516 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms
(a) Dendrogram. (b) Nested cluster diagram.
Figure 8.13. A hierarchical clustering of four points shown as a dendrogram and as nested clusters.
relationships and the order in which the clusters were merged (agglomerative view) or split (divisive view). For sets of two-dimensional points, such as those that we will use as examples, a hierarchical clustering can also be graphically represented using a nested cluster diagram. Figure 8.13 shows an example of these two types of figures for a set of four two-dimensional points. These points were clustered using the single-link technique that is described in Section 8.3.2.
8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm
Many agglomerative hierarchical clustering techniques are variations on a sin- gle approach: starting with individual points as clusters, successively merge the two closest clusters until only one cluster remains. This approach is ex- pressed more formally in Algorithm 8.3.
Algorithm 8.3 Basic agglomerative hierarchical clustering algorithm. 1: Compute the proximity matrix, if necessary. 2: repeat 3: Merge the closest two clusters. 4: Update the proximity matrix to reflect the proximity between the new
cluster and the original clusters. 5: until Only one cluster remains.
p4p3p2pl
AgglomerativeHierarchical Clustering 5L7
Defining Proximity between Clusters
The key operation of Algorithm 8.3 is the computation of the proximity be- tween two clusters, and it is the definition of cluster proximity that differ- entiates the various agglomerative hierarchical techniques that we will dis- cuss. Cluster proximity is typically defined with a particular type of cluster in mind-see Section 8.1.2. For example, many agglomerative hierarchical clustering techniques, such as MIN, MAX, and Group Average, come from a graph-based view of clusters. MIN defines cluster proximity as the prox- imity between the closest two points that are in different clusters, or using graph terms, the shortest edge between two nodes in different subsets of nodes. This yields contiguity-based clusters as shown in Figure 8.2(c). Alternatively, MAX takes the proximity between the farthest two points in different clusters to be the cluster proximity, or using graph terms, the longest edge between two nodes in different subsets of nodes. (If our proximities are distances, then the names, MIN and MAX, are short and suggestive. For similarities, however, where higher values indicate closer points, the names seem reversed. For that reason, we usually prefer to use the alternative names, single link and com- plete link, respectively.) Another graph-based approach, the group average technique, defines cluster proximity to be the average pairwise proximities (av- erage length of edges) of all pairs of points from different clusters. Figure 8.14 illustrates these three approaches.
(a) MIN (single link.) (b) MAX (complete link.) (c) Group average.
Figure 8.14. Graph-based definitions of cluster proximity
If, instead, we take a prototype-based view, in which each cluster is repre- sented by a centroid, different definitions of cluster proximity are more natural. When using centroids, the cluster proximity is commonly defined as the prox- imity between cluster centroids. An alternative technique, Wardts method, also assumes that a cluster is represented by its centroid, but it measures the proximity between two clusters in terms of the increase in the SSE that re-
8.3
5L8 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms
sults from merging the two clusters. Like K-means, Ward’s method attempts to minimize the sum of the squared distances of points from their cluster centroids.
Time and Space Complexity
The basic agglomerative hierarchical clustering algorithm just presented uses a proximity matrix. This requires the storage of lm2 proximities (assuming the proximity matrix is symmetric) where rn is the number of data points. The space needed to keep track of the clusters is proportional to the number of clusters, which is m- 1, excluding singleton clusters. Hence, the total space complexity is O(m2).
The analysis of the basic agglomerative hierarchical clustering algorithm is also straightforward with respect to computational complexity. O(m2) time is required to compute the proximity matrix. After that step, there are m – 1 iterations involving steps 3 and 4 because there are rn clusters at the start and two clusters are merged during each iteration. If performed as a Iinear search of the proximity matrix, then for the ith iteration, step 3 requires O((*-i+D2) time, which is proportional to the current number of clusters squared. Step 4 only requires O(* – i + 1) time to update the proximity matrix after the merger of two clusters. (A cluster merger affects only O(m- i + 1) proximities for the techniques that we consider.) Without modification, this would yield a time complexity of O(ms). If the distances from each cluster to all other clusters are stored as a sorted list (or heap), it is possible to reduce the cost of finding the two closest clusters to O(m – i + 1). However, because of the additional complexity of keeping data in a sorted list or heap, the overall time required for a hierarchical clustering based on Algorithm 8.3 is O(m2logm).
The space and time complexity of hierarchical clustering severely limits the size of data sets that can be processed. We discuss scalability approaches for clustering algorithms, including hierarchical clustering techniques, in Section 9.5 .
8.3.2 Specif icTechniques
Sample Data
To illustrate the behavior of the various hierarchical clustering algorithms, we shall use sample data that consists of 6 two-dimensional points, which are shown in Figure 8.15. The r and g coordinates of the points and the Euclidean distances between them are shown in Tables 8.3 and 8.4. respectivelv.
Agglomerative Hierarchical Clustering 519
Point r Coordinate gr Coordinate p1 0.40 0.53 p2 0.22 0.38 p3 0.35 0.32 p4 0.26 0.19 p5 0.08 0.41 p6 0.45 0.30
Figure 8.15. Set of 6 two-dimensional points. Table 8.3, rg coordinates of 6 points.
p1 p2 p3 p4 p5 p6 p1 0.00 0.24 0.22 0.37 0.34 0.23 p2 0.24 0.00 0 . 1 5 0.20 0.14 0.25 p3 0.22 0.15 0.00 u . l i ) 0.28 0 . 1 1 p4 0.37 0.20 0.15 0.00 0.29 0.22 p5 0.34 0.14 0.28 0.29 0.00 0.39 p6 0.23 0.25 0 . 1 1 0.22 0.39 0.00
Table 8.4. Euclidean distance matrix for 6 ooints.
Single Link or MIN
For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined as the minimum of the distance (maximum of the similarity) between any two points in the two different clusters. Using graph terminology, if you start with all points as singleton clusters and add links between points one at a time, shortest links first, then these single links com- bine the points into clusters. The single link technique is good at handling non-elliptical shapes, but is sensitive to noise and outliers.