The strengths and weakness of specific agglomerative hierarchical clustering algorithms were discussed above. More generally, such algorithms are typi- cally used because the underlying application, e.g., creation of a taxonomy, requires a hierarchy. Also, there have been some studies that suggest that these algorithms can produce better-quality clusters. However, agglomerative hierarchical clustering algorithms are expensive in terms of their computa- tional and storage requirements. The fact that all merges are final can also cause trouble for noisy, high-dimensional data, such as document data. In turn, these two problems can be addressed to some degree by first partially clustering the data using another technique, such as K-means.

8.4 DBSCAN

Density-based clustering locates regions of high density that are separated from one another by regions of low density. DBSCAN is a simple and effec- tive density-based clustering algorithm that illustrates a number of important concepts that are important for any density-based clustering approach. In this section, we focus solely on DBSCAN after first considering the key notion of density. Other algorithms for finding density-based clusters are described in the next chapter.

8.4 DBSCAN 527

8.4.1 Tladitional Density: Center-Based Approach

Although there are not as many a,pproaches for defining density as there are for defining similarity, there are several distinct methods. In this section we dis- cuss the center-based approach on which DBSCAN is based. Other definitions of density will be presented in Chapter 9.

In the center-based approach, density is estimated for a particular point in the data set by counting the nurnber of points within a specified radius, .Eps, of that point. This includes the point itself. This technique is graphically illustrated by Figure 8.20. The number of points within a radius of Eps of point A is 7, including A itself.

This method is simple to irnplement, but the density of any point will depend on the specified radius. For instance, if the radius is large enough, then all points will have a density of rn, the number of points in the data set. Likewise, if the radius is too small, then all points will have a density of 1. An approach for deciding on ther appropriate radius for low-dimensional data is given in the next section in the context of our discussion of DBSCAN.

Classification of Points According to Center-Based Density

The center-based approach to density allows us to classify a point as being (1)

in the interior ofa dense region (a core point), (2) on the edge ofa dense region (a border point), or (3) in a sparsely occupied region (a noise or background point). Figure 8.21 graphically illustrates the concepts of core, border, and noise points using a collection of two-dimensional points. The following text provides a more precise description.

Core points: These points are in the interior of a density-based cluster. A point is a core point if the number of points within a given neighborhood around the point as determined by the distance function and a user-

specified distance parametr:r, -Eps, exceeds a certain threshold, Mi’nPts, which is also a user-speci{ied parameter. In Figure 8.21, point A is a

core point, for the indicated radius (Ept) 1f MinPts < 7.

Border points: A border point is not a core point, but falls within the neigh- borhood of a core point. In Figure 8.21, point B is a border point. A

border point can fall within the neighborhoods of several core points.

Noise points: A noise point is any point that is neither a core point nor a

border point. In Figure 8.21, point C is a noise point.

528 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

core point

Figure 8.20. Center-based density. Figure 8.21. Core, border, and noise points.

8.4.2 The DBSCAN Algorithm

Given the previous definitions of core points, border points, and noise points, the DBSCAN algorithm can be informally described as follows. Any two core points that are close enough-within a distance Eps of one another-are put in the same cluster. Likewise, any border point that is close enough to a core point is put in the same cluster as the core point. (Ties may need to be resolved if a border point is close to core points from different clusters.) Noise points are discarded. The formal details are given in Algorithm 8.4. This algorithm uses the same concepts and finds the same clusters as the original DBSCAN, but is optimized for simplicity, not efficiency.

Algorithm 8.4 DBSCAN algorithm. 1: Label all points as core, border, or noise points. 2: Eliminate noise points, 3: Put an edge between all core points that are within Eps of each other. 4: Make each group of connected core points into a separate cluster. 5: Assign each border point to one of the clusters of its associated core points.

Time and Space Complexity

The basic time complexity of the DBSCAN algorithm is O(m x time to find points in the -Eps-neighborhood), where rn is the number of points. In the worst case, this complexity is O(^’). However, in low-dimensional spaces, there are data structures, such as kd-trees, that allow efficient retrieval of all

8.4 DBSCAN 529

points within a given distance of a specified point, and the time complexity can be as low as O(mlogm). The space requirement of DBSCAN, even for high-dimensional data, is O(m) because it is only necessary to keep a small amount of data for each point, i.e., the cluster label and the identification of each point as a core, border, or noise point.

Selection of DBSCAN Parameters

There is, of course, the issue of how to determine the parameters Eps and MinPts. The basic approach is to look at the behavior of the distance from a point to its kth nearest neighbor, which we will call the k-dist. For points that belong to some cluster, the value of k-dist will be small if k is not larger than the cluster size. Note that there will be some variation, depending on the density of the cluster and the r:r,ndom distribution of points, but on average, the range of variation will not be huge if the cluster densities are not radically different. However, for points that are not in a cluster, such as noise points, the k-dist will be relatively large. Therefore, if we compute the k-dist for all the data points for some ,k, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of k-dist that corresponds to a suitable value of Eps. lf we select this distance as the Eps parameter and take the value oi k as the MinPts parameter, then points for which ,k-dist is less than -Ops will be labeled as core points, while other points will be labeled as noise or borderr points.

Figure 8.22 shows a sample data set, while the k-dist graph for the data is given in Figure 8.23. The value of Eps that is determined in this way depends on k, but does not change dramatically as k changes. If the value of k is too small, then even a small number of closely spaced points that are noise or outliers will be incorrectly labeled as clusters. If the value of k is too large, then small clusters (of size less than k) are likely to be labeled as noise. The original DBSCAN algorithm used a value of k : 4, which appears to be a reasonable value for most two-dimensional data sets.

Clusters of Varying Density

DBSCAN can have trouble with density if the density of clusters varies widely. Consider Figure 8.24, which shows four clusters embedded in noise. The den- sity of the clusters and noise regions is indicated by their darkness. The noise around the pair of denser clusters, A and B, has the same density as clusters C and D. If the -Eps threshold is low enough that DBSCAN finds C and D as clusters, then A and B and the points surrounding them will become a single

530 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

Figure 8.22. Sample data.

Poinb Sodd by Disan€ro 4th Neared Neighbor

Figure 8,23. K-dist plot for sample data.

Figure 8.24. Fourclusters embedded in noise.

cluster. If the .Ops threshold is high enough that DBSCAN finds A and B as separate clusters, and the points surrounding them are marked as noise, then C and D and the points surrounding them will also be marked as noise.

An Example

To illustrate the use of DBSCAN, we show the clusters that it finds in the relatively complicated two-dimensional data set shown in Figure 8.22. This data set consists of 3000 two-dimensional points. The Eps threshold for this data was found by plotting the sorted distances ofthe fourth nearest neighbor of each point (Figure 8.23) and identifying the value at which there is a sharp increase. We selected Eps :10, which corresponds to the knee of the curve. The clusters found by DBSCAN using these parameters, i.e., Mi,nPts: 4 and Eps : 10, are shown in Figure 8.25(a). The core points, border points, and noise points are displayed in Figure 8.25(b).

8.4.3 Strengths and Weaknesses

Because DBSCAN uses a density-based definition of a cluster, it is relatively resistant to noise and can handle clusters of arbitrary shapes and sizes. Thus,

E

z 6

z

1 i+-I+.tJtlr, I I

DBSCAN 531

I x x

X

X

X

” ” t

X

X

t r X

€ I

X

i. X

8.4

, j f l a

I

X’i+fX

X

x x X X X

X X

X

x X

X*. X

xr

X

(a) Clusl;ers found by DBSCAN

x – Noise Point + – Border Point

(b) Core, border, and

e – Core Point

points.nolse

Figure 8.25, DBSCAN clustering of 3000 two-dimensional points.

“1

532 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

DBSCAN can find many clusters that could not be found using K-means, such as those in Figure 8.22. As indicated previously, however, DBSCAN has trouble when the clusters have widely varying densities. It also has trouble with high-dimensional data because density is more difficult to define for such data. One possible approach to dealing with such issues is given in Section 9.4.8. Finally, DBSCAN can be expensive when the computation of nearest neighbors requires computing all pairwise proximities, as is usually the case for high-dimensional data.