Consider a set of adjacent, k-dimensional cells that form a cluster; i.e., there is a collection of adjacent cells that have a density above the specified threshold (. A corresponding set of cells in k – 1 dimensions can be found by omitting one of the k dimensions (attributes). The lower-dimensional cells are still ad- jacent, and each low-dimensional cell contains all points of the corresponding high-dimensional cell. It may contain additional points as well. Thus, a low- dimensional cell has a density greater than or equal to that of its corresponding high-dimensional cell. Consequently, the low-dimensional cells form a cluster; i.e., the points form a cluster with the reduced set of attributes.

Algorithm 9.5 gives a simplified version of the steps involved in CLIQUE. Conceptually, the CLIQUtr algorithm is similar to the Apriori, algorithm for finding frequent itemsets. See Chapter 6.

Strengths and Limitations of CLIQUE

The most useful feature of CLIQUE is that it provides an efficient technique for searching subspaces for clusters. Since this approach is based on the well-

608 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

Algorithm 9.5 CLIQUE. 1: Find all the dense areas in the one-dimensional spaces corresponding to each

attribute. This is the set of dense one-dimensional cells. 2 : k – 2 3: repeat 4: Generate all candidate dense k-dimensional cells from dense (k – 1)-dimensional

cells. 5: Eliminate cells that have fewer than { points. 6 : k – k * l 7: until There are no candidate dense k-dimensional cells. 8: Find clusters by taking the union of all adjacent, high-density cells. 9: Summarize each cluster using a small set of inequalities that describe the attribute

ranses of the cells in the cluster.

known Apri,ori, principle from association analysis, its properties are well un- derstood. Another useful feature is CLIQUE’s ability to summarize the list of cells that comprises a cluster with a small set of inequalities.

Many limitations of CLIQUE are identical to the previously discussed lim- itations of other grid-based density schemes. Other limitations are similar to those of the Apri,ori. algorithm. Specifically, just as frequent itemsets can share items, the clusters found by CLIQUE can share objects. Allowing clusters to overlap can greatly increase the number of clusters and make interpretation difficult. Another issue is that Apri,orz (and CLIQUE) potentially have expo- nential time complexity. In particular, CLIQUE will have difficulty if too many dense cells are generated at lower values of k. Raising the density threshold ( can alleviate this problem. Still another potential limitation of CLIQUE is explored in Exercise 20 on page 650.

9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based Clustering

DENCLUE (DENsity ClUstEring) is a density-based clustering approach that models the overall density of a set of points as the sum of influence functions associated with each point. The resulting overall density function will have local peaks, i.e., local density maxima, and these local peaks can be used to define clusters in a natural way. Specifically, for each data point, a hill- climbing procedure finds the nearest peak associated with that point, and the set of all data points associated with a particular peak (called a local density attractor) becomes a cluster. However, if the density at a local peak is too low, then the points in the associated cluster are classified as noise

Density-Based Clustering 609

Figure 9.13. lllustration of DENCLUE density concepts in one dimension.

and discarded. Also, if a local peak can be connected to a second local peak by a path of data points, and the density at each point on the path is above the minimum density threshold, then the clusters associated with these local peaks are merged. Therefore, clusters of any shape can be discovered.

Example 9.L0 (DENCLUE Density). We illustrate these concepts with Figure 9.13, which shows a possible density function for a one-dimensional data set. Points A-E are the peaks of this density function and represent local density attractors. The dotted vertical lines delineate local regions of influence for the local density attractors. Points in these regions will become center-defined clusters. The dashed horizontal line shows a density threshold,

â‚¬. All points associated with a local density attractor that has a density less than (, such as those associated with C, will be discarded. All other clusters are kept. Note that this can include points whose density is less than (, as long as they are associated with local density attractors whose density is greater than {. Finally, clusters that are connected by a path of points with a density above { are combined. Clusters A and B would remain separate, while clusters D and E would be combined. r

The high-level details of the DENCLUE algorithm are summarized in Al- gorithm 9.6. Next, we explore various aspects of DENCLUE in more detail. First, we provide a brief overview of kernel density estimation and then present the grid-based approach that DENCLUE uses for approximating the density.

Kernel Density Estimation

DENCLUE is based on a well-developed area of statistics and pattern recogni- tion that is known as kernel density estirnation. The goal of this collection

9 .3

.= a c o o

610 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

Algorithm 9.6 DENCLUE algorithm. 1: Derive a density function for the space occupied by the data points.

2: Identify the points that are local maxima. (These are the density attractors.)

3: Associate each point with a density attractor by moving in the direction of max-

imum increase in density. 4: Define clusters consisting of points associated with a particular density attractor. 5: Discard clusters whose density attractor has a density less than a user-specified

threshold of (. 6: Combine clusters that are connected by a path of points that all have a density

of { or higher.

of techniques (and many other statistical techniques as well) is to describe

the distribution of the data by a function. For kernel density estimation, the

contribution of each point to the overall density function is expressed by an

influence or kernel function. The overall density function is simply the sum

of the influence functions associated with each point. Typically, the influence or kernel function is symmetric (the same in all

directions) and its value (contribution) decreases as the distance from the

point increases. For example, for a particular point, x, the Gaussian function,

K(a): “-distance(x,v)tl2o2,

is often used as a kernel function. (o is a parame-

ter, which is analogous to the standard deviation) that governs how quickly the

influence of a point drops off. Figure 9.14(a) shows what a Gaussian density

function would look like for a single point in two dimensions, while Figures

9.14(c) and 9.14(d) show the overall density function produced by applying

the Gaussian influence function to the set of points shown in Figure 9.14(b).

Irnplementation Issues

Computation of kernel density can be quite expensive, and DENCLUE uses a

number of approximations to implement its basic approach efficiently. First, it

explicitly computes density only at data points. However, this still would result

in an O(m2) time complexity since the density at each point is a function of the

density contributed by every point. To reduce the time complexity, DENCLUE

uses a grid-based implementation to efficiently define neighborhoods and thus

limit the number of points that need to be considered to define the density at

a point. First, a preprocessing step creates a set of grid cells. Only occupied

cells are created, and these cells and their related information can be efficiently accessed via a search tree. Then, when computing the density of a point and

finding its nearest density attractor, DENCLUtr considers only the points in

,*-tr.ea&/t e Density-Based Clustering 61-1

o

of 12(b) Set points.

(c) Overall density gray scale plot. (cl) Overal l density surlace plot.

Figure 9,14. Example of the Gaussian influence (kernel) function and an overall density function.

the neighborhood; i.e., points in the same cell and in cells that are connected to the point’s cell. While this approach can sacrifice some accuracy with respect to density estimation, computational complexity is greatly reduced.

Strengths and Limitations of DENCLUE

DENCLUE has a solid theoretical foundation because is based on kernel den- sity functions and the notion of kernel density estimation, which is a well- developed area of statistics. For this reason, DENCLUE provides a more flexible and potentially more accurate way to compute density than other grid-based clustering techniques and DBSCAN. (DBSCAN is a special case of DENCLUE.) An approach based on kernel density functions is inherently

9 .3

6L2 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

computationally expensive, but DENCLUE employs grid-based techniques to

address such issues. Nonetheless, DENCLUE can be more computationally ex- pensive than other density-based clustering techniques. AIso, the use of a grid

can adversely affect the accuracy of the density estimation, and it makes DEN-

CLUE susceptible to problems common to grid-based approaches; e.g., the

difficulty of choosing the proper grid size. More generally, DtrNCLUE shares many of the strengths and limitations of other density-based approaches. For

instance, DENCLUE is good at handling noise and outliers and it can find

clusters of different shapes and size, but it has trouble with high-dimensional data and data that contains clusters of widelv different densities.

9.4 Graph-Based Clustering

Section 8.3 discussed a number of clustering techniques that took a graph-

based view of data, in which data objects are represented by nodes and the proximity between two data objects is represented by the weight of the edge between the corresponding nodes. This section considers some additional graph-based clustering algorithms that use a number of key properties and characteristics of graphs. The following are some key approaches, different subsets of which are employed by these algorithms.

1. Sparsify the proximity graph to keep only the connections of an object with its nearest neighbors. This sparsification is useful for handling noise and outliers. It also allows the use of highly efficient graph partitioning

algorithms that have been developed for sparse graphs.

2. Define a similarity measure between two objects based on the number of nearest neighbors that they share. This approach, which is based on the

observation that an object and its nearest neighbors usually belong to the same class, is useful for overcoming problems with high dimensionality and clusters of varying density.

Define core objects and build clusters around them. To do this for graph-

based clustering, it is necessary to introduce a notion ofdensity-based on a proximity graph or a sparsified proximity graph. As with DBSCAN, building clusters around core objects leads to a clustering technique that can find clusters of differing shapes and sizes.

Use the information in the proximity graph to provide a more sophisti- cated evaluation of whether two clusters should be merged. Specifically,

3.

A

Graph-Based Clustering 613

two clusters are merged only if the resulting cluster will have character- istics similar to the original two clusters.

We begin by discussing the sparsification of proximity graphs, providing two examples of techniques whose approach to clustering is based solely on this technique: MST, which is equivalent to the single link clustering algorithm, and Opossum. We then discuss Chameleon, a hierarchical clustering algorithm that uses a notion of self-similarity to determine if clusters should be merged. We next define Shared Nearest Neighbor (SNN) similarity, a new similarity measure) and introduce the Jarvis-Patrick clustering algorithm, which uses this similarity. Finally, we discuss how to define density and core objects based on SNN similarity and introduce an SNN density-based clustering algorithm, which can be viewed as DBSCAN with a new similarity measure.

9.4.t Sparsification

The m by m proximity matrix for rn data points can be represented as a dense graph in which each node is connected to all others and the weight of the edge between any pair of nodes reflects their pairwise proximity. Although every object has some level of similarity to every other object, for most data sets, objects are highly similar to a small number of objects and weakly similar to most other objects. This property can be used to sparsify the proximity graph (matrix), by setting many of these low-similarity (high-dissimilarity) values to 0 before beginning the actual clustering process. The sparsification may be performed, for example, by breaking all links that have a similarity (dissimilarity) below (above) a specified threshold or by keeping only links to the k nearest neighbors of point. This latter approach creates what is called a k-nearest neighbor graph.

Sparsification has several beneficial effects:

Data size is reduced. The amount of data that needs to be processed to cluster the data is drastically reduced. Sparsification can often elim- inate more than 99% of the entries in a proximity matrix. As a result, the size of problems that can be handled is increased.

Clustering may work better. Sparsification techniques keep the con- nections to their nearest neighbors of an object while breaking the con- nections to more distant objects. This is in keeping with the nearest neighbor principle that the nearest neighbors of an object tend to belong to the same class (cluster) as the object itself. This reduces the impact of noise and outliers and sharpens the distinction between clusters.

9 .4

61.4 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

Construct a Modified

Figure 9.15. ldeal process of clustering using sparsification.

o Graph partitioning algorithms can be used. There has been a considerable amount of work on heuristic algorithms for finding min-cut partitionings of sparse graphs, especially in the areas of parallel comput- ing and the design of integrated circuits. Sparsification of the proximity graph makes it possible to use graph partitioning algorithms for the clustering process. For example, Opossum and Chameleon use graph partitioning.

Sparsification of the proximity graph should be regarded as an initial step before the use of actual clustering algorithms. In theory, a perfect sparsifi- cation could leave the proximity matrix split into connected components cor- responding to the desired clusters, but in practice, this rarely happens. It is easy for a single edge to link two clusters or for a single cluster to be split into several disconnected subclusters. Indeed, as we shall see when we discuss Jarvis-Patrick and SNN density-based clustering, the sparse proximity graph is often modified to yield a new proximity graph. This new proximity graph

can again be sparsified. Clustering algorithms work with the proximity graph that is the result of all these preprocessing steps. This process is summarized in Figure 9.15.

9.4.2 Minimum Spanning T[ee (MST) Clustering

In Section 8.3, where we described agglomerative hierarchical clustering tech- niques, we mentioned that divisive hierarchical clustering algorithms also exist. We saw an example of one such technique, bisecting K-means, in Section 8.2.3. Another divisive hierarchical technique, MST, starts with the minimum span- ning tree of the proximity graph and can be viewed as an application of sparsi- fication for finding clusters. We briefly describe this algorithm. Interestingly, this algorithm also produces the same clustering as single link agglomerative clustering. See Exercise 13 on page 648.

A minimum spanning tree of a graph is a subgraph that (1) has no cycles, i.e., is a tree, (2) contains all the nodes of the graph, and (3) has the minimum total edge weight of all possible spanning trees. The terminology,

9.4 Graph-Based Clustering 615

Figure 9.16. Minimum spanning tree for a set of six two-dimensional points.

minimum spanning tree, assumes that we are working only with dissimilarities or distances, and we will follow this convention. This is not a limitation, however, since we could convert similarities to dissimilarities or modify the notion of a minimum spanning tree to work with similarities. An example of a minimum spanning tree for some two-dimensional points is shown in Figure 9 .16 .

The MST divisive hierarchical algorithm is shown in Algorithm 9.7. The first step is to find the MST of the original dissimilarity graph. Note that a minimum spanning tree can be viewed as a special type of sparsified graph. Step 3 can also be viewed as graph sparsification. Hence, MST can be viewed as a clustering algorithm based on the sparsification of the dissimilarity graph.

Algorithm 9.7 MST divisive hierarchical clustering algorithm, 1: Compute a minimum spanning tree for the dissimilarity graph. 2: repeat 3: Create a new cluster by breaking the link corresponding to the largest dissimi-

Iarity. 4: until Only singleton clusters remain.

616 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities

Using METIS

OPOSSUM is a clustering technique that was specifically designed for cluster-

ing sparse, high dimensional data, such as document or market basket data.

Like MST, it performs clustering based on the sparsification of a proximity

graph. However, OPOSSUM uses the METIS algorithm, which was specifi-

cally created for partitioning sparce graphs. The steps of OPOSSUM are given

in Aleorithm 9.8.

Algorithm 9.8 OPOSSUM clustering algorithm. 1: Compute a sparsified similarity graph. 2: Partition the similarity graph into k distinct components (clusters) using METIS.

The similarity measures used are those appropriate for sparse, high dimen-

sional data, such as the extended Jaccard measure or the cosine measure. The

METIS graph partitioning program partitions a sparse graph into k distinct

components, where k is a user-specified parameter, in order to (1) minimize

the weight of the edges (the similarity) between components and (2) fulfiIl

a balance constraint. OPOSSUM uses one of the following two balance con-

straints: (1) the number of objects in each cluster must be roughly the same,

or (2) the sum of the attribute values must be roughly the same. The second

constraint is useful when, for example, the attribute values represent the cost

of an item.

Strengths and Weaknesses

OPOSSUM is simple and fast. It partitions the data into roughly equal-sized

clusters, which, depending on the goal of the clustering, can be viewed as an

advantage or a disadvantage. Because they are constrained to be of roughly

equal size, clusters can be broken or combined. However, if OPOSSUM is

used to generate a large number of clusters, then these clusters are typically relatively pure pieces of larger clusters. Indeed, OPOSSUM is similar to the initial step of the Chameleon clustering routine, which is discussed next.

9.4.4 Chameleon: Hierarchical Clustering with Dynamic

Modeling

Agglomerative hierarchical clustering techniques operate by merging the two most similar clusters, where the definition of cluster similarity depends on

a ‘

a a t s

(a) (b) (c) (d)

Figure 9.17. Situation in which closeness is not the appropriate merging criterion. @1999, IEEE

the particular algorithm. Some agglomerative algorithms, such as group aver- age, base their notion of similarity on the strength of the connections between the two clusters (e.g., the pairwise similarity of points in the two clusters), while other techniques, such as the single link method, use the closeness of the clusters (e.g., the minimum distance between points in different clusters) to measure cluster similarity. Although there are two basic approaches, using only one of these two approaches may lead to mistakes in merging clusters. Consider Figure 9.17, which shows four clusters. If we use the closeness of clusters (as measured by the closest two points in different clusters) as our merging criterion, then we would merge the two circular clusters, (c) and (d),which almost touch, instead of the rectangular clusters, (a) and (b), which are separated by a small gap. However, intuitively, we should have merged rectangular clusters, (a) and (b). Exercise 15 on page 649 asks for an exam- ple of a situation in which the strength of connections likewise leads to an unintuitive result.

Another problem is that most clustering techniques have a global (static) model of clusters. For instance, K-means assumes that the clusters will be globular, while DBSCAN defines clusters based on a single density threshold. Clustering schemes that use such a global model cannot handle cases in which cluster characteristics, such as size, shape, and density, vary widely between clusters. As an example of the importance of the local (dynamic) modeling of clusters, consider Figure 9.18. If we use the closeness of clusters to determine which pair of clusters should be merged, as would be the case if we used, for example, the single link clustering algorithm, then we would merge clusters (a) and (b). However, we have not taken into account the characteristics of each individual cluster. Specifically, we have ignored the density of the individual clusters. For clusters (a) and (b), which are relatively dense, the distance between the two clusters is significantly larger than the distance between a point and its nearest neighbors within the same cluster. This is not the case for clusters (c) and (d), which are relatively sparse. Indeed, when clusters (c)

9 .4

618 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

and (d) are merged, they yield a cluster that seems more similar to the original clusters than the cluster that results from merging clusters (a) and (b).

Chameleon is an agglomerative clustering algorithm that addresses the is- sues of the previous two paragraphs. It combines an initial partitioning of the data, using an efficient graph partitioning algorithm, with a novel hierarchical clustering scheme that uses the notions of closeness and interconnectivity, to- gether with the local modeling of clusters. The key idea is that two clusters should be merged only if the resulting cluster is similar to the two original clusters. Self-similarity is described first, and then the remaining details of the Chameleon algorithm are presented.

Deciding Which Clusters to Merge

The agglomerative hierarchical clustering techniques considered in Section 8.3 repeatedly combine the two closest clusters and are principally distinguished from one another by the way they define cluster proximity. In contrast, Cha- meleon aims to merge the pair of clusters that results in a cluster that is most similar to the original pair of clusters, as measured by closeness and intercon- nectivity. Because this approach depends only on the pair of clusters and not on a global model, Chameleon can handle data that contains clusters with widely different characteristics.

Following are more detailed explanations of the properties of closeness and interconnectivity. To understand these properties, it is necessary to take a proximity graph viewpoint and to consider the number of the links and the strength of those links among points within a cluster and across clusters.

o Relative Closeness (RC) is the absolute closeness of two clusters nor- malized bv the internal closeness of the clusters. Two clusters are com- bined only if the points in the resulting cluster are almost as close to each other as in each of the original clusters. Mathematically,

D/1 – Sp6 ‘ (C ; ,C i ) , r u – , (e.17)

where rn6 and mi are the sizes of clusters Ct and Cj, respectively, Snc(Cn,Cr) is the average weight of the edges (of the k-nearest neighbor graph) that connect clusters Cr a;nd Ci; Sec(Ct) is the average weight of edges if we bisect cluster C,;; and Ssc(C.i) is the average weight of edges if we bisect cluster Cj. (EC stands for edge cut.) Figure 9.18 illustrates the notion of relative closeness. As discussed previously, while clusters

##ffia I

I a

t l

t

9.4 Graph-Based Clustering 619

l l & s s s

s t & 9

t e *

t q 4

t e

‘ g * *

r r g *

(c) (d)

Figure 9.18. lllustration of the notion of relative closeness. G)1999, IEEE

Figure 9.19. lllustration ot tl’rl notion of retative in,.Jl’onn.r,.oness. @1999, tEEE

(a) and (b) are closer in absolute terms than clusters (c) and (d), this is not true if the nature of the clusters is taken into account.

Relative Interconnectivity (RI) is the absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters. Two clusters are combined if the points in the resulting cluster are almost as strongly connected as points in each of the original clusters. Mathemat- icallv.

RI : EC(Ci ,C1)

(e.18) [ tecQ,) + EC(ciD’

where EC(Ci,C3) is the sum of the edges (of the k-nearest neighbor graph) that connect clusters Ct and C1; EC(C1) is the minimum sum of the cut edges if we bisect cluster Ci; and EC(Cj) is the minimum sum of the cut edges if we bisect cluster C7. Figure 9.19 illustrates the notion of relative interconnectivity. The two circular clusters, (c) and

620 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

(d), have more connections than the rectangular clusters, (a) and (b).

However, merging (c) and (d) produces a cluster that has connectivity quite different from that of (c) and (d). In contrast, merging (a) and (b) produces a cluster with connectivity very similar to that of (a) and (b).

RI and RC can be combined in many different ways to yield an overall measure of self-similarity. One approach used in Chameleon is to merge the pair of clusters that maximizes RI(C6,Ci) * RC(Ci,Cj)o, where a is a user- specified parameter that is typically greater than 1.

Chameleon Algorithm

Chameleon consists of three key steps: sparsification, graph partitioning, and hierarchical clustering. Algorithm 9.9 and Figure 9.20 describe these steps.

Algorithm 9.9 Chameleon algorithm. 1: Build a k-nearest neighbor graph. 2: Partition the graph using a multilevel graph partitioning algorithm. 3: repeat 4: Merge the clusters that best preserve the cluster self-similarity with respect to

relative interconnectivity and relative closeness. 5: until No more clusters can be merged.

Figure 9.20. Overall process by which Chameleon performs clustering. O1999, IEEE

Sparsification The first step in Chameleon is to generate a k-nearest neigh- bor graph. Conceptually, such a graph is derived from the proximity graph,

and it contains links only between a point and its k nearest neighbors, i.e., the points to which it is closest. As mentioned, working with a sparsified prox- imity graph instead of the full proximity graph can significantly reduce the effects of noise and outliers and improve computational efficiency.

9.4 Graph-Based Clustering 62I

Graph Partitioning

Once a sparsified graph has been obtained, an efficient multilevel graph par- titioning algorithm, such as METIS (see bibliographic notes) can be used to partition the data set. Chameleon starts with an all-inclusive graph (cluster) and then bisects the largest current subgraph (cluster) until no cluster has more than MIN-SIZE points, where MIN_SIZE is a user-specified parameter. This process results in a large number of roughly equally sized groups of well- connected vertices (highly similar data points). The goal is to ensure that each partition contains objects mostly from one true cluster.

Agglomerative Hierarchical Clustering As discussed previously, Cha- meleon merges clusters based on the notion of self-similarity. Chameleon can be parameterized to merge more than one pair of clusters in a single step and to stop before all objects have been merged into a single cluster.

Complexity Assume that rn is the number of data points and p is the number of partitions. Performing an agglomerative hierarchical clustering of the p partitions obtained from the graph partitioning requires time O(p2 tosp). (See Section 8.3.1.) The amount of time required for partitioning the graph is O(mp * mlogm). The time complexity of graph sparsification depends on how much time it takes to build the k-nearest neighbor graph. For low- dimensional data, this takes O(mlogrn) time if a k-d tree or a similar type of data structure is used. Unfortunately, such data structures only work well for low-dimensional data sets, and thus, for high-dimensional data sets, the time complexity of the sparsification becomes O(*’). Since only the k-nearest neighbor list needs to be stored, the space complexity is O(km) plus the space required to store the data.