Because JP clustering is based on the at dealing with noise and outliers and

Graph-Based Clustering 627

shapes, and densities. The algorithm works well for high-dimensional data

and is particularly good at finding tight clusters of strongly related objects. However, JP clustering defines a cluster as a connected component in the

SNN similarity graph. Thus, whether a set of objects is split into two clusters

or left as one may depend on a single link. Hence, JP clustering is somewhat

brittle; i.e., it may split true clusters or join clusters that should be kept

separate. Another potential limitation is that not all objects are clustered. However,

these objects can be added to existing clusters, and in some cases, there is no

requirement for a complete clustering. JP clustering has a basic time complex-

ity of O(*2), which is the time required to compute the nearest neighbor list

for a set of objects in the general case. In certain cases, e.g., low-dimensional

data, special techniques can be used to reduce the time complexity for finding

nearest neighbors to O(mlogm). Finally, as with other clustering algorithms,

choosing the best values for the parameters can be challenging.

9.4.7 SNN Density

As discussed in the introduction to this chapter, traditional Euclidean density

becomes meaningless in high dimensions. This is true whether we take a grid-

based view, such as that used by CLIQUtr, a center-based view, such as that

used by DBSCAN, or a kernel-density estimation approach, such as that used

by DENCLUE. It is possible to use the center-based definition of density with a

similarity measure that works well for high dimensions, e.g., cosine or Jaccard,

but as described in Section 9.4.5, such measures still have problems. However,

since the SNN similarity measure reflects the local configuration of the points

in the data space, it is relatively insensitive to variations in density and the

dimensionality of the space, and is a promising candidate for a new measure

of density. This section explains how to define a concept of SNN density by using

SNN similarity and following the DBSCAN approach described in Section

8.4. For clarity, the definitions of that section are repeated, with appropriate

modification to account for the fact that we are using SNN similarity.

Core points. A point is a core point if the number of points within a given

neighborhood around the point, as determined by SNN similarity and a

supplied parameter -Eps exceeds a certain threshold MinPts, which is

also a supplied parameter.

9 .4

628 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

n$$fiiis (a) All points. (b) High SNN (c) Medium SNN (d) Low SNN

density. density. density.

Figure 9.25. SNN density of two-dimensional points.

Border points. A border point is a point that is not a core point, i.e., there are not enough points in its neighborhood for it to be a core point, but it falls within the neighborhood of a core point.

Noise points. A noise point is any point that is neither a core point nor a border point.

SNN density measures the degree to which a point is surrounded by similar points (with respect to nearest neighbors). Thus, points in regions of high and low density will typically have relatively high SNN density, while points in regions where there is a transition from low to high density-points that are between clusters will tend to have low SNN density. Such an approach may be better suited for data sets in which there are wide variations in densitv” but clusters of low density are still interesting.

Example 9.13 (Core, Border, and Noise Points). To make the preceding discussion of SNN density more concrete, we provide an example of how SNN density can be used to find core points and remove noise and outliers. There are 10,000 points in the 2D point data set shown in Figure 9.25(a). Figures 9.25(b-d) distinguish between these points based on their SNN density. Figure 9.25(b) shows the points with the highest SNN density, while Figure 9.25(c) shows points of intermediate SNN density, and Figure 9.25(d) shows figures of the lowest SNN density. Frorn these figures, we see that the points that have high density (i.e., high connectivity in the SNN graph) are candidates for being representative or core points since they tend to be located well inside the cluster, while the points that have low connectivity are candidates for being noise points and outliers, as they are mostly in the regions surrounding the clusters.

Graph-Based Clustering 629

9.4.8 SNN Density-Based Clustering

The SNN density defined above can be combined with the DBSCAN algorithm to creaJe a new clustering algorithm. This algorithm is similar to the JP clustering algorithm in that it starts with the SNN similarity graph. However, instead of using a threshold to sparsify the SNN similarity graph and then taking connected components as clusters, the SNN density-based clustering algorithm simply applies DBSCAN.

The SNN Density-based Clustering Algorithm

The steps of the SNN density-based clustering algorithm are shown in AIgo- r i thm 9.12.

Algorithm 9.12 SNN density-based clustering algorithm.

9 .4

1: Compute the SNN similarity graph. 2: Apply DBSCAN with user-specified parameters for Eps and Mi’nPts

The algorithm automatically determines the number of clusters in the data. Note that not all the points are clustered. The points that are discarded include noise and outliers, as well as points that are not strongly connected to a group of points. SNN density-based clustering finds clusters in which the points are strongly related to one another. Depending on the application, we might want to discard many of the points. For example, SNN density-based clustering is good for finding topics in groups of documents.

Example 9.14 (SNN Density-based Clustering of Time Series). The

SNN density-based clustering algorithm presented in this section is more flex- ible than Jarvis-Patrick clustering or DBSCAN. Unlike DBSCAN, it can be used for high-dimensional data and situations in which the clusters have dif- ferent densities. Unlike Jarvis-Patrick, which performs a simple thresholding and then takes the connected components as clusters, SNN density-based clus- tering uses a less brittle approach that relies on the concepts of SNN density and core points.

To demonstrate the capabilities of SNN density-based clustering on high-

dimensional data, we applied it to monthly time series data of atmospheric pressure at various points on the Earth. More specifically, the data consists of the average monthly sea-level pressure (SLP) for a period of 41 years at each point on a 2.5″ longitude-latitude grid. The SNN density-based clustering algorithm found the clusters (gray regions) indicated in Figure 9.26. Note

630 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

that these are clusters of time series of length 492 months) even though they are visualized as two-dimensional regions. The white areas are regions in which the pressure was not as uniform. The clusters near the poles are elongated because of the distortion of mapping a spherical surface to a rectangle.

Using SLP, Earth scientists have defined time series, called climate in- dices, that are useful for capturing the behavior of phenomena involving the Earth’s climate. For example, anomalies in climate indices are related to abnormally low or high precipitation or temperature in various parts of the world. Some of the clusters found by SNN density-based clustering have a strong connection to some of the climate indices known to Earth scientists.

Figure 9.27 shows the SNN density structure of the data from which the clusters were extracted. The density has been normalized to be on a scale between 0 and 1. The density of a time series may seem like an unusual concept, but it measures the degree to which the time series and its nearest neighbors have the same nearest neighbors. Since each time series is associated with a location, it is possible to plot these densities on a two-dimensional plot. Because of temporal autocorrelation, these densities form meaningful patterns, e.g., it is possible to visually identify the clusters of Figure 9.27.

Strengths and Limitations

The strengths and limitations of SNN density-based clustering are similar to those of JP clustering. However, the use of core points and SNN density adds considerable power and flexibility to this approach.

9.5 Scalable Clustering Algorithms

Even the best clustering algorithm is of little value if it takes an unacceptably long time to execute or requires too much memory. This section examines clustering techniques that place significant emphasis on scalability to the very large data sets that are becoming increasingly common. We start by discussing some general strategies for scalability, including approaches for reducing the number of proximity calculations, sampling the data, partitioning the data, and clustering a summarized representation of the data. We then discuss two specific examples of scalable clustering algorithms: CURE and BIRCH.

9.5.1 Scalability: General Issues and Approaches

The amount of storage required for many clustering algorithms is more than linear; e.g., with hierarchical clustering) memory requirements are usually

F-{^ 6m sffi P

AE

t 4

1*rq ,

t*”

\ fc J-

t ffi

J

t 7r! &-

!

A N

n Q

F I

# k h JT

J E x( llFA’

x -X \ f

\J

ffi 4L k \ \ )

V “{t

I ,

rU r r F

TJ h . _:t

ht \ ‘ ,(

{t til

F- “+_E\ “LT

t, 1 a \!

t-

* p

‘ \

*K # x

q

9.5 Scalable Clustering Algorithms 631

90

60

30

o

.Eo j

-30

-60

-90 -180-150-120-90 -60 -30 0 30 60 90 120 150 180

Longitude

Figure 9.26. Clusters of pressure time series found using SNN density-based clustering.

Density

Longitude

Figure 9.27. SNN density of pressure time series.

-180-150-120 -90 -60 -30 0 30 60 90 120 150

632 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

O(*’), where m is the number of objects. For 10,000,000 objects, for ex- ample, the amount of memory required is proportional to 1014, a number still well beyond the capacities of current systems. Note that because of the re- quirement for random data access, many clustering algorithms cannot easily be modified to efficiently use secondary storage (disk), for which random data access is slow. Likewise, the amount of computation required for some clus- tering algorithms is more than linear. In the remainder of this section, we discuss a variety of techniques for reducing the amount of computation and storage required by a clustering algorithm. CURtr and BIRCH use some of these techniques.

Multidimensional or Spatial Access Methods Many techniques-K- means, Jarvis Patrick clustering, and DBSCAN-need to find the closest cen- troid, the nearest neighbors of a point, or all points within a specified distance. It is possible to use special techniques called multidimensional or spatial access methods to more efficiently perform these tasks, at least for low-dimensional data. These techniques, such as the k-d tree or R*-tree, typically produce a hierarchical partition of the data space that can be used to reduce the time re- quired to find the nearest neighbors of a point. Note that grid-based clustering schemes also partition the data space.