Example 9.11. Chameleon was applied to two data sets that clustering algo- rithms such as K-means and DBSCAN have difficulty clustering. The results of this clustering are shown in Figure 9.21. The clusters are identified by the shading of the points. In Figure 9.21,(a), the two clusters are irregularly shaped and quite close to each other. AIso, noise is present. In Figure 9.21(b), the two clusters are connected by a bridge, and again, noise is present. Nonetheless, Chameleon identifies what most people would identify as the natural clusters. Chameleon has specifically been shown to be very effective for clustering spa- tial data. Finally, notice that Chameleon does not discard noise points, as do other clustering schemes, but instead assigns them to the clusters. I

622 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

(a) (b)

Figure 9.21. Chameleon applied to cluster a pair of two-dimensional sets of points. @1999, IEEE

Strengths and Limitations

Chameleon can effectively cluster spatial data, even though noise and outliers are present and the clusters are of different shapes, sizes, and density. Cha- meleon assumes that the groups of objects produced by the sparsification and graph partitioning process are subclusters; i.e., that most of the points in a partition belong to the same true cluster. If not, then agglomerative hier- archical clustering will only compound the errors since it can never separate objects that have been wrongly put together. (See the discussion in Section 8.3.4.) Thus, Chameleon has problems when the partitioning process does not produce subclusters, as is often the case for high-dimensional data.

9.4.5 Shared Nearest Neighbor Similarity

In some cases, clustering techniques that rely on standard approaches to sim- ilarity and density do not produce the desired clustering results. This section examines the reasons for this and introduces an indirect approach to similarity that is based on the following principle:

If two points are similar to many of the same points, then they are similar to one another, even if a direct measurement of similarity does not indicate this.

We motivate the discussion by first explaining two problems that an SNN version of similarity addresses: low similarity and differences in density.

Problerns with Tladitional Sirnilarity in High-Dirnensional Data

In high-dimensional spaces) it is not unusual for similarity to be low. Con- sider, for example, a set of documents such as a collection of newspaper articles

Graph-Based Clustering 623

Table 9.3. Similarity among documents in different sections of a newspaper.

9 .4

Section Average Cosine Similarity Entertainment Financial Foreign Metro National Sports AII Sections

0.032 0.030 0.030 0.021 0.027 0.036 0.014

that come from a variety of sections of the newspaper: Entertainment, Finan- cial, Foreign, Metro, National, and Sports. As explained in Chapter 2, these documents can be viewed as vectors in a high-dimensional space) where each component of the vector (attribute) records the number of times that each word in a vocabulary occurs in a document. The cosine similarity measure is often used to assess the similarity between documents. For this example, which comes from a collection of articles from the Los Angeles T’imes, Table 9.3 gives the average cosine similarity in each section and among the entire set of documents.

The similarity of each document to its most similar document (the first nearest neighbor) is better, 0.39 on average. However, a consequence of low similarity among objects of the same class is that their nearest neighbor is often not of the same class. In the collection of documents from which Table 9.3 was generated, about 20% of the documents have a nearest neighbor of a different class. In general, if direct similarity is low, then it becomes an unreliable guide for clustering objects, especially for agglomerative hierarchical clustering, where the closest points are put together and cannot be separated afterward. Nonetheless, it is still usually the case that a large majority of the nearest neighbors of an object belong to the same class; this fact can be used to define a proximity measure that is more suitable for clustering.

Problems with Differences in Density

Another problem relates to differences in densities between clusters. Figure 9.22 shows a pair of two-dimensional clusters of points with differing density. The lower density of the rightmost cluster is reflected in a lower average dis- tance among the points. Even though the points in the less dense cluster form an equally valid cluster, typical clustering techniques will have more difficulty finding such clusters. Also, normal measures of cohesion, such as SSE, will in- dicate that these clusters are less cohesive. To illustrate with a real example.

624 Chapter 9 Cluster Analysis:

Figure 9.22. Two circular clusters of 200 uniformly distributed points.

the stars in a galaxy are no less real clusters of stellar objects than the planets

in a solar system? even though the planets in a solar system are considerably closer to one another on average, than the stars in a galaxy.

SNN Similarity Computation

In both situations, the key idea is to take the context of points into account in defining the similarity measure. This idea can be made quantitative by using a shared nearest neighbor definition of similarity in the manner indicated by Algorithm 9.10. Essentially, the SNN similarity is the number of shared neighbors as long as the two objects are on each other’s nearest neighbor lists. Note that the underlying proximity measure can be any meaningful similarity or dissimilarity measure.

Algorithm 9.10 Computing shared nearest neighbor similarity 1: Find the k-nearest neighbors of all points. 2: if two points, x and y are not among the k-nearest neighbors of each other then 3: si,mi,lari,ty(x, y) – 0 4: else 5: si,rni,lari,ty(x, y) – number of shared neighbors 6: end if

The computation of SNN similarity is described by Algorithm 9.10 and graphically illustrated by Figure 9.23. Each of the two black points has eight nearest neighbors, including each other. Four of those nearest neighbors the points in gray are shared. Thus, the shared nearest neighbor similarity between the two points is 4.

9.4 Graph-Based Clustering 625

+

Figure 9.23. Computation of SNN similarity between two points.

The similarity graph of the SNN similarities among objects is called the SNN similarity graph. Since many pairs of objects will have an SNN simi- larity of 0, this is a very sparse graph.

SNN Similarity versus Direct Similarity

SNN similarity is useful because it addresses some of the problems that occur with direct similarity. First, since it takes into account the context of an object by using the number of shared nearest neighbors, SNN similarity handles the situation in which an object happens to be relatively close to another object, but belongs to a different class. In such cases, the objects typically do not share many near neighbors and their SNN similarity is low.

SNN similarity also addresses problems with clusters of varying density. In a low-density region, the objects are farther apart than objects in denser regions. However, the SNN similarity of a pair of points only depends on the number of nearest neighbors two objects share, not how far these neighbors are from each object. Thus, SNN similarity performs an automatic scaling with respect to the density of the points.

9.4.6 The Jarvis-Patrick Clustering Algorithm

Algorithm 9.11 expresses the Jarvis-Patrick clustering algorithm using the con- cepts of the last section. The JP clustering algorithm replaces the proximity between two points with the SNN similarity, which is calculated as described in Algorithm 9.10. A threshold is then used to sparsify this matrix of SNN similarities. In graph terms, an SNN similarity graph is created and sparsified. Clusters are simply the connected components of the SNN graph.

Algorithm 9.11 Jarvis-Patrick clustering algorithm. 1: Compute the SNN similarity graph. 2: Sparsify the SNN similarity graph by applying a similarity threshold. 3: Find the connected components (clusters) of the sparsified SNN similarity graph.

626 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

‘q?

v v

vvw

“;qw j

notion of SNN similarity, it is good can handle clusters of different sizes,

(a) Original data (b) Clusters found by Jarvis-Patrick

Figure 9.24. Jarvis-Patrick clustering of a two-dimensional point set.

The storage requirements of the JP clustering algorithm are only O(km), since it is not necessary to store the entire similarity matrix, even initially. The basic time complexity of JP clustering is Olm2), since the creation of

the k-nearest neighbor list can require the computation of O(*’) proximities.

However, for certain types of data, such as low-dimensional Euclidean data, special techniques, €.8., D k-d tree, can be used to more efficiently find the

k-nearest neighbors without computing the entire similarity matrix. This can reduce the time complexity frorn Olm2) to O(mlogrn).

Example 9.12 (JP Clustering of a Two-Dimensional Data Set). We applied JP clustering to the “fish” data set shown in Figure 9.2a@) to find

the clusters shown in Figure 9.24(b). The size of the nearest neighbor list was 20, and two points were placed in the same cluster if they shared at least 10 points. The different clusters are shown by the different markers and different shading. The points whose marker is an “x” were classified as noise by Jarvis- Patrick. They are mostly in the transition regions between clusters of different density.

Strengths and Limitations