Bounds on Proximities Another approach to avoiding proximity compu- tations is to use bounds on proximities. For instance, when using Euclidean distance, it is possible to use the triangle inequality to avoid many distance calculations. To illustrate, at each stage of traditional K-means, it is necessary to evaluate whether a point should stay in its current cluster or be moved to a new cluster. If we know the distance between the centroids and the distance of a point to the (newly updated) centroid of the cluster to which it currently belongs, then we may be able to use the triangle inequality to avoid computing the distance ofthe point to any ofthe other centroids. See Exercise 21 on page 0br.r.
Sampling Another approach to reducing the time complexity is to sample. In this approach, a sample of points is taken, these points are clustered, and then the remaining points are assigned to the existing clusters typically to the closest cluster. If the number of points sampled is Jrn, then the time complexity of an O(m2) algorithm is reduced to O(m). A key problem with sampling, though, is that small clusters can be lost. When we discuss CURE,
9.5 Scalable Clustering Algorithms 633
we will provide a technique for investigating how frequently such problems
occur.
Partitioning the Data Objects Another common approach to reducing
time complexity is to use some efficient technique to partition the data into
disjoint sets and then cluster these sets separately. The fi.nal set of clusters
either is the union of these separate sets of clusters or is obtained by combining
and/or refining the separate sets of clusters. We only discuss bisecting K-
means (Section 8.2.3) in this section, although many other approaches based
on partitioning are possible. One such approach will be described, when we
describe CURE later on in this section. If K-means is used to find l( clusters, then the distance of each point to
each cluster centroid is calculated at each iteration. When K is large, this can
be very expensive. Bisecting K-means starts with the entire set of points and
uses K-means to repeatedly bisect an existing cluster until we have obtained
K clusters. At each step, the distance of points to two cluster centroids is
computed. Except for the first step, in which the cluster being bisected consists
of all the points, we only compute the distance of a subset of points to the two
centroids being considered. Because of this fact, bisecting K-means can run
significantly faster than regular K-means.
Summarization Another approach to clustering is to summarize the data,
typically in a single pass, and then cluster the summarized data. In particular,
the leader algorithm (see Exercise 12 on page 562) either puts a data object
in the closest cluster (if that cluster is sufficiently close) or starts a new clus-
ter that contains the current object. This algorithm is linear in the number
of objects and can be used to summarize the data so that other clustering
techniques can be used. The BIRCH algorithm uses a similar concept.
Parallel and Distributed Computation If it is not possible to take ad-
vantage of the techniques described earlier, or if these approaches do not yield
the desired acculacy or reduction in computation time, then other approaches
are needed. A highly effective approach is to distribute the computation among
multiple processors.
9.5.2 BIRCH
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a
highly efificient clustering technique for data in Euclidean vector spaces, i’e.,
634 Chapter 9 Cluster Analysis: Additional Issues and Algorithms
data for which averages make sense. BIRCH can efficiently cluster such data with one pass and can improve that clustering with additional passes. BIRCH can also deal effectively with outliers.
BIRCH is based on the notion of a Clustering Feature (CF) and a CF tree. The idea is that a cluster of data points (vectors) can be represented by a triple of numbers (N, LS, SS), where N is the number of points in the cluster, ,L,9 is the linear sum of the points, and ,9,9 is the sum of squares of the points. These are common statistical quantities that can be updated incrementally and that can be used to compute a number of important quantities, such as the centroid of a cluster and its variance (standard deviation). The variance is used as a measure of the diameter of a cluster.
These quantities can also be used to compute the distance between clusters. The simplest approach is to calculate an L1 (city block) or L2 (Euclidean) distance between centroids. We can also use the diameter (variance) of the merged cluster as a distance. A number of different distance measures for clusters are defined by BIRCH, but all can be computed using the summary statistics.
A CF tree is a height-balanced tree. Each interior node has entries of the form [CF;, childi], where child; is a pointer to the i,th child node. The space that each entry takes and the page size determine the number of entries in an interior node. The space of each entry is, in turn, determined by the number of attributes of each point.
Leaf nodes consist of a sequence of clustering features, CFi, where each clustering feature represents a number of points that have been previously scanned. Leaf nodes are subject to the restriction that each leaf node must have a diameter that is less than a parameterized threshold, ?. The space that each entry takes, together with the page size, determines the number of entries in a leaf.
By adjusting the threshold parameter T, the height of the tree can be controlled. 7 controls the fineness of the clustering, i.e., the extent to which the data in the original set of data is reduced. The goal is to keep the CF tree in main memory by adjusting the ? parameter as necessary.
A CF tree is built as the data is scanned. As each data point is encountered, the CF tree is traversed, starting from the root and choosing the closest node at each level. When the closest leaf cluster for the current data point is finally identified, a test is performed to see if adding the data item to the candidate cluster will result in a new cluster with a diameter greater than the given threshold, ?. If not, then the data point is added to the candidate cluster bv
9.5 Scalable Clustering Algorithms 635
updating the CF information. The cluster information for all nodes from the
leaf to the root is also updated. If the new cluster has a diameter greater than T, then a new entry is
created if the leaf node is not full. Otherwise the leaf node must be split.
The two entries (clusters) that are farthest apart are selected as seeds and the
remaining entries are distributed to one of the two new leaf nodes, based on
which leaf node contains the closest seed cluster. Once the leaf node has been
split, the parent node is updated and split if necessary; i.e., if the parent node
is full. This process may continue all the way to the root node.
BIRCH follows each split with a merge step. At the interior node where the
split stops, the two closest entries are found. Ifthese entries do not correspond
to the two entries that just resulted from the split, then an attempt is made to
merge these entries and their corresponding child nodes. This step is intended
to increase space utilization and avoid problems with skewed data input order.
BIRCH also has a procedure for removing outliers. When the tree needs
to be rebuilt because it has run out of memory, then outliels can optionally
be written to disk. (An outlier is defined to be a node that has far fewer data points than average.) At certain points in the process, outliers are scanned
to see if they can be absorbed back into the tree without causing the tree to
grow in size. If so, they are reabsorbed. If not, they are deleted.
BIRCH consists of a number of phases beyond the initial creation of the
CF tree. All the phases of BIRCH are described briefly in Algorithm 9.13.
9.5.3 CURE
CURE (Clustering Using REpresentatives) is a clustering algorithm that uses
a variety of different techniques to create an approach that can handle large
data sets, outliers, and clusters with non-spherical shapes and non-uniform
sizes. CURE represents a cluster by using multiple representative points from
the cluster. These points will, in theory, capture the geometry and shape of the
cluster. The first representative point is chosen to be the point farthest from
the center of the cluster, while the remaining points are chosen so that they are
farthest from all the previously chosen points. In this way, the representative points are naturally relatively well distributed. The number of points chosen
is a parameter, but it was found that a value of 10 or more worked well.
Once the representative points are chosen, they are shrunk toward the
center by a factor, a. This helps moderate the effect of outliers, which are
usually farther away from the center and thus, are shrunk more. For example,
a representative point that was a distance of 10 units from the center would
636 Chapter 9 Cluster Analysis: Additional Issues and Algorithms
Algorithm 9.13 BIRCH. 1: Load the data into memory by creating a CF tree that summarizes the
data. 2: Build a smaller CF tree if it is necessary for phase 3. 7 is increased, and
then the leaf node entries (clusters) are reinserted. Since ? has increased, some clusters will be merged.
3: Perform global clustering. Different forms of global clustering (clustering that uses the pairwise distances between all the clusters) can be used. However, an agglomerative, hierarchical technique was selected. Because the clustering features store summary information that is important to certain kinds of clustering, the global clustering algorithm can be applied as if it were being applied to all the points in a cluster represented by the CF.
4: Redistribute the data points using the centroids of clusters discovered in step 3, and thus, discover a new set of clusters. This overcomes certain problems that can occur in the first phase of BIRCH. Because of page size con- straints and the 7 parameter, points that should be in one cluster are sometimes split, and points that should be in different clusters are sometimes combined. Also, if the data set contains duplicate points, these points can sometimes be clustered differently, depending on the order in which they are encountered. By repeating this phase multiple times, the process converges to a locally optimum solution.
move by 3 units (for a :0.7), while a representative point at a distance of 1 unit would only move 0.3 units.
CURtr uses an agglomerative hierarchical scheme to perform the actual clustering. The distance between two clusters is the minimum distance be- tween any two representative points (after they are shrunk toward their re- spective centers). While this scheme is not exactly like any other hierarchical scheme that we have seen, it is equivalent to centroid-based hierarchical clus- tering if a :0, and roughly the same as single link hierarchical clustering if a : r. Notice that while a hierarchical clustering scheme is used, the goal of CURE is to find a given number of clusters as specified by the user.