CURE takes advantage of certain characteristics of the hierarchical clus- tering process to eliminate outliers at two different points in the clustering process. First, if a cluster is growing slowly, then this may mean that it con- sists mostly of outliers, since by definition, outliers are far from others and will not be merged with other points very often. In CURE, this first phase of outlier elimination typically occurs when the number of clusters is 1/3 the original number of points. The second phase of outlier elimination occurs when the number of clusters is on the order of K, the number of desired clusters. At this point, small clusters are again eliminated.

9.5 Scalable Clustering Algorithms 637

Since the worst-case complexity of CURE is O(m2logm), it cannot be ap- plied directly to large data sets. For this reason, CURE uses two techniques to speed up the clustering process. The first technique takes a random sample and performs hierarchical clustering on the sampled data points. This is fol- Iowed by a final pass that assigns each remaining point in the data set to one of the clusters by choosing the cluster with the closest representative point. We discuss CURE’s sampling approach in more detail later.

In some cases, the sample required for clustering is stil too large and a second additional technique is required. In this situation, CURtr partitions the sample data and then clusters the points in each partition. This preclustering step is then followed by a clustering of the intermediate clusters and a final pass that assigns each point in the data set to one of the clusters. CURE’s partitioning scheme is also discussed in more detail later.

Algorithm 9.14 summarizes CURE. Note that 1( is the desired number of clusters, m is the number of points, p is the number of partitions, and q is the desired reduction of points in a partition, i.e., the number of clusters in a partition is ffi. Therefore, the total number of clusters is ffi. For example, if m : 10,000, P : I0, and q : 100, then each partition contains 10,000/10 :

1000 points, and there would be 1000/100 : 10 clusters in each partition and 10.000/100 : 100 clusters overall.

Algorithm 9.14 CURE. 1: Draw a random sample from the data set. The CURE paper is notable for

explicitly deriving a formula for what the size of this sample should be in order to guarantee, with high probability, that all clusters are represented by a minimum number of points.

2: Partition the sample irft,o p equal-sized partitions. 3: Cluster the points in each partition into ffi clusters using CURE’s hi-

erarchical clustering algorithm to obtain a total of f clusters. Note that some outlier elimination occurs during this process. Use CURE’s hierarchical clustering algorithm to cluster the I clusters found in the previous step until only K clusters remain. Eliminate outliers. This is the second phase of outlier elimination.

Assign all remaining data points to the nearest cluster to obtain a

complete clustering.

638 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

Sampling in CURE

A key issue in using sampling is whether the sample is representative, that is, whether it captures the characteristics of interest. For clustering, the issue is whether we can find the same clusters in the sample as in the entire set of objects. Ideally, we would like the sample to contain some objects for each cluster and for there to be a separate cluster in the sample for those objects that belong to separate clusters in the entire data set.

A more concrete and attainable goal is to guarantee (with a high probabil- ity) that we have at least some points from each cluster. The number of points required for such a sample varies from one data set to another and depends on the number of objects and the sizes of the clusters. The creators of CURtr derived a bound for the sample size that would be needed to ensure (with high probability) that we obtain at least a certain number of points from a cluster. Using the notation of this book, this bound is given by the following theorem.

Theorem 9.L. Let f be a fraction, 0 < “f < 1. For cluster Ct of s’ize mi, we wi,Il obtai,n at least f * mt objects from cluster C6 wi,th a probabi,li,ty of 1- d,0 < d < 7, ‘if our sample s’ize s ‘is gi,uen by the following:

(e.1e)

where rn i,s the number of objects.

While this expression may look intimidating, it is reasonably easy to use. Suppose that there are 100,000 objects and that the goal is to have an 80% chance of obtainingT0% of the objects in cluster Ci, which has a size of 1000. In this case, / : 0.1, 6 :0.2, rn :100,000, mi:1000, and thus s :17,962. I f the goal is a 5To sample of Ci, which is 50 objects, then a sample size of 6440 will suffice.

Again, CURE uses sampling in the following way. First a sample is drawn, and then CURE is used to cluster this sample. After clusters have been found, each unclustered point is assigned to the closest cluster.

Partitioning

When sampling is not enough, CURE also uses a partitioning approach. The idea is to divide the points into p groups of size m f p and to use CURE to cluster each partition in order to reduce the number of objects by a factor of q > I, where q can be roughly thought of as the average size of a cluster in a partition.

9.6 Which Clustering Algorithm? 639

Overall, ffi clusters are produced. (Note that since CURE represents each

cluster by a number of representative points, the reduction in the number of

objects is not pq.) This preclustering step is then followed by a final clustering

of the mf pq intermediate clusters to produce the desired number of clusters (K). Both clustering passes use CURE’s hierarchical clustering algorithm and

are followed by a final pass that assigns each point in the data set to one of

the clusters. The key issue is how p and q should be chosen. Algorithms such as CURE

have a time complexity of O(^’) or higher, and furthermore, require that all

the data be in main memory. We therefore want to choose p small enough so

that an entire partition can be processed in main memory and in a’reasonable’

amount of time. At the current time, a typical desktop computer can perform

a hierarchical clustering of a few thousand objects in a few seconds. Another factor for choosing p, and also q, concerns the quality of the

clustering. Specifically, the objective is to choose the values of p and q such

that objects from the same underlying cluster end up in the same clusters

eventually. To illustrate, suppose there are 1000 objects and a cluster of size

100. If we randomly generate 100 partitions, then each partition will, on

average, have only one point from our cluster. These points will likely be put

in clusters with points from other clusters or will be discarded as outliers. If

we generate only 10 partitions of 100 objects, but q is 50, then the 10 points

from each cluster (on average) will likely still be combined with points from

other clusters, since there are only (on average) 10 points per cluster and we

need to produce, for each partition, two clusters. To avoid this last problem’

which concerns the proper choice of q, a suggested strategy is not to combine

clusters if they are too dissimilar.

9.6 Which Clustering Algorithm?

A variety of factors need to be considered when deciding which type of cluster-

ing technique to use. Many, if not all, of these factors have been discussed to

some extent in the current and previous chapters. Our goal in this section is

to succinctly summarize these factors in a way that sheds some light on which

clustering algorithm might be appropriate for a particular clustering task.

Type of Clustering One important factor in making sure that the type of

clustering matches the intended use is the type of clustering produced by the

algorithm. For some applications, such as creating a biological taxonomy, a

640 Chapter I Cluster Analysis: Additional Issues and Algorithms

hierarchy is preferred. In the case of clustering for summarization, a partitional clustering is typical. In yet other applications, both may prove useful.

Most clustering applications require a clustering of all (or almost all) of the objects. For instance, if clustering is used to organize a set of documents for browsing, then we would like most documents to belong to a group. However, if we wanted to find the strongest themes in a set of documents, then we might prefer to have a clustering scheme that produces only very cohesive clusters, even if many documents were left unclustered.

Finally, most applications of clustering assume that each object is assigned to one cluster (or one cluster on a level for hierarchical schemes). As we have seen, however, probabilistic and fitzzy schemes provide weights that indicate the degree or probability of membership in various clusters. Other techniques, such as DBSCAN and SNN density-based clustering, have the notion of core points, which strongly belong to one cluster. Such concepts may be useful in certain applications.

Type of Cluster Another key aspect is whether the type of cluster matches the intended application. There are three commonly encountered types of clusters: prototype-, graph-, and density-based. Prototype-based clustering schemes, as well as some graph-based clustering schemes-complete link, cen- troid, and Ward’s-tend to produce globular clusters in which each object is sufficiently close to the cluster’s prototype or to the other objects in the clus- ter. If, for example, we want to summarize the data to reduce its size and we want to do so with the minimum amount of error, then one of these types of techniques would be most appropriate. In contrast, density-based clustering techniques, as well as some graph-based clustering techniques, such as single link, tend to produce clusters that are not globular and thus contain many ob- jects that are not very similar to one another. If clustering is used to segment a geographical area into contiguous regions based on the type of land cover, then one of these techniques is more suitable than a prototype-based scheme such as K-means.

Characteristics of Clusters Besides the general type of cluster, other clus- ter characteristics are important. If we want to find clusters in subspaces of the original data space, then we must choose an algorithm such as CLIQUE, which explicitly looks for such clusters. Similarly, if we are interested in enforc- ing spatial relationships between clusters, then SOM or some related approach would be appropriate. Also, clustering algorithms differ widely in their ability to handle clusters of varying shapes, sizes, and densities.

9.6 Which Clustering Algorithm? 64L