Bounds on Proximities Another approach to avoiding proximity compu- tations is to use bounds on proximities. For instance, when using Euclidean distance, it is possible to use the triangle inequality to avoid many distance calculations. To illustrate, at each stage of traditional K-means, it is necessary to evaluate whether a point should stay in its current cluster or be moved to a new cluster. If we know the distance between the centroids and the distance of a point to the (newly updated) centroid of the cluster to which it currently belongs, then we may be able to use the triangle inequality to avoid computing the distance ofthe point to any ofthe other centroids. See Exercise 21 on page 0br.r.

Sampling Another approach to reducing the time complexity is to sample. In this approach, a sample of points is taken, these points are clustered, and then the remaining points are assigned to the existing clusters typically to the closest cluster. If the number of points sampled is Jrn, then the time complexity of an O(m2) algorithm is reduced to O(m). A key problem with sampling, though, is that small clusters can be lost. When we discuss CURE,

9.5 Scalable Clustering Algorithms 633

we will provide a technique for investigating how frequently such problems

occur.

Partitioning the Data Objects Another common approach to reducing

time complexity is to use some efficient technique to partition the data into

disjoint sets and then cluster these sets separately. The fi.nal set of clusters

either is the union of these separate sets of clusters or is obtained by combining

and/or refining the separate sets of clusters. We only discuss bisecting K-

means (Section 8.2.3) in this section, although many other approaches based

on partitioning are possible. One such approach will be described, when we

describe CURE later on in this section. If K-means is used to find l( clusters, then the distance of each point to

each cluster centroid is calculated at each iteration. When K is large, this can

be very expensive. Bisecting K-means starts with the entire set of points and

uses K-means to repeatedly bisect an existing cluster until we have obtained

K clusters. At each step, the distance of points to two cluster centroids is

computed. Except for the first step, in which the cluster being bisected consists

of all the points, we only compute the distance of a subset of points to the two

centroids being considered. Because of this fact, bisecting K-means can run

significantly faster than regular K-means.

Summarization Another approach to clustering is to summarize the data,

typically in a single pass, and then cluster the summarized data. In particular,

the leader algorithm (see Exercise 12 on page 562) either puts a data object

in the closest cluster (if that cluster is sufficiently close) or starts a new clus-

ter that contains the current object. This algorithm is linear in the number

of objects and can be used to summarize the data so that other clustering

techniques can be used. The BIRCH algorithm uses a similar concept.

Parallel and Distributed Computation If it is not possible to take ad-

vantage of the techniques described earlier, or if these approaches do not yield

the desired acculacy or reduction in computation time, then other approaches

are needed. A highly effective approach is to distribute the computation among

multiple processors.

9.5.2 BIRCH

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a

highly efificient clustering technique for data in Euclidean vector spaces, i’e.,

634 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

data for which averages make sense. BIRCH can efficiently cluster such data with one pass and can improve that clustering with additional passes. BIRCH can also deal effectively with outliers.

BIRCH is based on the notion of a Clustering Feature (CF) and a CF tree. The idea is that a cluster of data points (vectors) can be represented by a triple of numbers (N, LS, SS), where N is the number of points in the cluster, ,L,9 is the linear sum of the points, and ,9,9 is the sum of squares of the points. These are common statistical quantities that can be updated incrementally and that can be used to compute a number of important quantities, such as the centroid of a cluster and its variance (standard deviation). The variance is used as a measure of the diameter of a cluster.

These quantities can also be used to compute the distance between clusters. The simplest approach is to calculate an L1 (city block) or L2 (Euclidean) distance between centroids. We can also use the diameter (variance) of the merged cluster as a distance. A number of different distance measures for clusters are defined by BIRCH, but all can be computed using the summary statistics.

A CF tree is a height-balanced tree. Each interior node has entries of the form [CF;, childi], where child; is a pointer to the i,th child node. The space that each entry takes and the page size determine the number of entries in an interior node. The space of each entry is, in turn, determined by the number of attributes of each point.

Leaf nodes consist of a sequence of clustering features, CFi, where each clustering feature represents a number of points that have been previously scanned. Leaf nodes are subject to the restriction that each leaf node must have a diameter that is less than a parameterized threshold, ?. The space that each entry takes, together with the page size, determines the number of entries in a leaf.

By adjusting the threshold parameter T, the height of the tree can be controlled. 7 controls the fineness of the clustering, i.e., the extent to which the data in the original set of data is reduced. The goal is to keep the CF tree in main memory by adjusting the ? parameter as necessary.

A CF tree is built as the data is scanned. As each data point is encountered, the CF tree is traversed, starting from the root and choosing the closest node at each level. When the closest leaf cluster for the current data point is finally identified, a test is performed to see if adding the data item to the candidate cluster will result in a new cluster with a diameter greater than the given threshold, ?. If not, then the data point is added to the candidate cluster bv

9.5 Scalable Clustering Algorithms 635

updating the CF information. The cluster information for all nodes from the

leaf to the root is also updated. If the new cluster has a diameter greater than T, then a new entry is

created if the leaf node is not full. Otherwise the leaf node must be split.

The two entries (clusters) that are farthest apart are selected as seeds and the

remaining entries are distributed to one of the two new leaf nodes, based on

which leaf node contains the closest seed cluster. Once the leaf node has been

split, the parent node is updated and split if necessary; i.e., if the parent node

is full. This process may continue all the way to the root node.

BIRCH follows each split with a merge step. At the interior node where the

split stops, the two closest entries are found. Ifthese entries do not correspond

to the two entries that just resulted from the split, then an attempt is made to

merge these entries and their corresponding child nodes. This step is intended

to increase space utilization and avoid problems with skewed data input order.

BIRCH also has a procedure for removing outliers. When the tree needs

to be rebuilt because it has run out of memory, then outliels can optionally

be written to disk. (An outlier is defined to be a node that has far fewer data points than average.) At certain points in the process, outliers are scanned

to see if they can be absorbed back into the tree without causing the tree to

grow in size. If so, they are reabsorbed. If not, they are deleted.

BIRCH consists of a number of phases beyond the initial creation of the

CF tree. All the phases of BIRCH are described briefly in Algorithm 9.13.

9.5.3 CURE

CURE (Clustering Using REpresentatives) is a clustering algorithm that uses

a variety of different techniques to create an approach that can handle large

data sets, outliers, and clusters with non-spherical shapes and non-uniform

sizes. CURE represents a cluster by using multiple representative points from

the cluster. These points will, in theory, capture the geometry and shape of the

cluster. The first representative point is chosen to be the point farthest from

the center of the cluster, while the remaining points are chosen so that they are

farthest from all the previously chosen points. In this way, the representative points are naturally relatively well distributed. The number of points chosen

is a parameter, but it was found that a value of 10 or more worked well.

Once the representative points are chosen, they are shrunk toward the

center by a factor, a. This helps moderate the effect of outliers, which are

usually farther away from the center and thus, are shrunk more. For example,

a representative point that was a distance of 10 units from the center would

636 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

Algorithm 9.13 BIRCH. 1: Load the data into memory by creating a CF tree that summarizes the

data. 2: Build a smaller CF tree if it is necessary for phase 3. 7 is increased, and

then the leaf node entries (clusters) are reinserted. Since ? has increased, some clusters will be merged.

3: Perform global clustering. Different forms of global clustering (clustering that uses the pairwise distances between all the clusters) can be used. However, an agglomerative, hierarchical technique was selected. Because the clustering features store summary information that is important to certain kinds of clustering, the global clustering algorithm can be applied as if it were being applied to all the points in a cluster represented by the CF.

4: Redistribute the data points using the centroids of clusters discovered in step 3, and thus, discover a new set of clusters. This overcomes certain problems that can occur in the first phase of BIRCH. Because of page size con- straints and the 7 parameter, points that should be in one cluster are sometimes split, and points that should be in different clusters are sometimes combined. Also, if the data set contains duplicate points, these points can sometimes be clustered differently, depending on the order in which they are encountered. By repeating this phase multiple times, the process converges to a locally optimum solution.

move by 3 units (for a :0.7), while a representative point at a distance of 1 unit would only move 0.3 units.

CURtr uses an agglomerative hierarchical scheme to perform the actual clustering. The distance between two clusters is the minimum distance be- tween any two representative points (after they are shrunk toward their re- spective centers). While this scheme is not exactly like any other hierarchical scheme that we have seen, it is equivalent to centroid-based hierarchical clus- tering if a :0, and roughly the same as single link hierarchical clustering if a : r. Notice that while a hierarchical clustering scheme is used, the goal of CURE is to find a given number of clusters as specified by the user.