Defining Grid Cells

This is a key step in the process, but also the least well defined, as there are many ways to split the possible values of each attribute into a number of contiguous intervals. For continuous attributes, one common approach is to split the values into equal width intervals. If this approach is applied to each attribute, then the resulting grid cells all have the same volume, and the density of a cell is conveniently defined as the number of points in the cell.

However, more sophisticated approaches can also be used. In particular, for continuous attributes any of the techniques that are commonly used to discretize attributes can be applied. (See Section 2.3.6.) In addition to the equal width approach already mentioned, this includes (1) breaking the values of an attribute into intervals so that each interval contains an equal number of points, i.e., equal frequency discretization, or (2) using clustering. Another approach, which is used by the subspace clustering algorithm MAFIA, initially

602 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

breaks the set of values of an attribute into a large number of equal width intervals and then combines intervals of similar density.

Regardless of the approach taken, the definition of the grid has a strong impact on the clustering results. We will consider specific aspects of this later.

The Density of Grid Cells

A natural way to define the density of a grid cell (or a more generally shaped region) is as the number of points divided by the volume of the region. In other words, density is the number of points per amount of space, regardless of the dimensionality of that space. Specific, low-dimensional examples of density are the number of road signs per mile (one dimension), the number of eagles per square kilometer of habitat (two dimensions), and the number of molecules of a gas per cubic centimeter (three dimensions). Ar mentioned, however, a common approach is to use grid cells that have the same volume so that the number of points per cell is a direct measure of the cell’s density.

Example 9.8 (Grid-Based Density). Figure 9.10 shows two sets of two- dimensional points divided into 49 cells using a 7-by-7 grid. The first set contains 200 points generated from a uniform distribution over a circle centered at (2, 3) of radius 2, while the second set has 100 points generated from a uniform distribution over a circle centered at (6, 3) of radius 1. The counts for the grid cells are shown in Table 9.2. Since the cells have equal volume (area), we can consider these values to be the densities of the cells. r

0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 7 1 8 6 0 0 0

L 4 L 4 1 3 1 3 0 L 8 2 7 1 1 1 8 1 0 2 L 0 2 4 3 1 3 2 0 1 4 4 0 0 0 0 0 0 0 0 0 0

Figute 9.10. Grid-based density. Table 9.2. Point counts for orid cells.

Density-Based Clustering 603

Forming Clusters from Dense Grid Cells

Forming clusters from adjacent groups of dense cells is relatively straightfor- ward. (In Figure 9.10, for example, it is clear that there would be two clusters.) There are, however, some issues. We need to define what we mean by adjacent cells. For example, does a two-dimensional grid cell have 4 adjacent cells or 8? Also, we need an efficient technique to find the adjacent cells, particularly when only occupied cells are stored.

The clustering approach defined by Algorithrn 9.4 has some limitations that could be addressed by making the algorithm slightly more sophisticated. For example, there are likely to be partially empty cells on the boundary of a cluster. Often, these cells are not dense. If so, they will be discarded and parts of a cluster will be lost. Figure 9.10 and Table 9.2 show that four parts of the larger cluster would be lost if the density threshold is 9. The clustering process could be modified to avoid discarding such cells, although this would require additional processing.

It is also possible to enhance basic grid-based clustering by using more than just density information. In many cases, the data has both spatial and non- spatial attributes. In other words, some of the attributes describe the location of objects in time or space, while other attributes describe other aspects of the objects. A common example is houses, which have both a location and a number of other characteristics, such as price or floor space in square feet. Because of spatial (or temporal) autocorrelation, objects in a particular cell often have similar values for their other attributes. In such cases, it is possible to filter the cells based on the statistical properties of one or more non-spatial attributes, e.9., average house price, and then form clusters based on the density of the remaining points.

Strengths and Limitations

On the positive side, grid-based clustering can be very efficient and effective. Given a partitioning of each attribute, a single pass through the data can determine the grid cell of every object and the count of every grid. Also, even though the number of potential grid cells can be high, grid cells need to be created only for non-empty cells. Thus, the time and space complexity of defining the grid, assigning each object to a cell, and computing the density of each cell is only O(m), where m is the number of points. If adjacent, occupied cells can be efficiently accessed, for example, by using a search tree, then the entire clustering process will be highly efficient, e.g., with a time complexity of O(m log rz). For this reason, the grid-based approach to density

9 .3

604 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

clustering forms the basis of a number of clustering algorithms, such as STIlttrG,

GRIDCLUS, WaveCluster, Bang-Clustering, CLIQUE, and MAFIA.

On the negative side, grid-based clustering, like most density-based clus-

tering schemes, is very dependent on the choice of the density threshold z’

If r is too high, then clusters will be lost. If r is too low, two clusters that

should be separate may be joined. Furthermore, if there are clusters and noise

of differing densities, then it may not be possible to fi.nd a single value of r

that works for all parts of the data space. There are also a number of issues related to the grid-based approach. In

Figure 9.10, for example, the rectangular grid cells do not accurately capture

the density of the circular boundary areas. We could attempt to alleviate

this problem by making the grid finer, but the number of points in the grid

cells associated with a cluster would likely show more fluctuation since points

in the cluster are not evenly distributed. Indeed, some grid cells, including

those in the interior of the cluster, might even be empty. Another issue is

that, depending on the placement or size of the cells, a group of points may

appear in just one cell or be split between several different cells. The same group of points might be part of a cluster in the first case, but be discarded

in the second. Finally, as dimensionality increases, the number of potential

grid cells increases rapidly-exponentially in the number of dimensions. Even

though it is not necessary to explicitly consider empty grid cells, it can easily

happen that most grid cells contain a single object. In other words, grid-based

clustering tends to work poorly for high-dimensional data.

9.3.2 Subspace Cluster ing

The clustering techniques considered until now found clusters by using all

of the attributes. However, if only subsets of the features are considered, i.e.,

subspaces ofthe data, then the clusters that we find can be quite different from

one subspace to another. There are two reasons that subspace clusters may

be interesting. First, the data may be clustered with respect to a small set of

attributes, but randomly distributed with respect to the remaining attributes.

Second, there are cases in which different clusters exist in different sets of

dimensions. Consider a data set that records the sales of various items at various times. (The times are the dimensions and the items are the objects.)

Some items might show similar behavior (cluster together) for particular sets

of months, e.g.) summer, but different clusters would likely be characterized

by different months (dimensions).

0 6 0 8 1 1 2 1 4 1 6 1 8 x

(a) Four clusters in three dimensions.

9.3 Density-Based Clustering 605

(b) View in the rg plane

a

t

a

. ! ttt ,a’

r . t l : a : , a

a l

i

(c) View in the rE plane. (d) View in the rg plane.

Figure 9.1 1. Example f igures for subspace clustering.

Example 9.9 (Subspace Clusters). Figure 9.11(a) shows a set of points in three-dimensional space. There are three clusters of points in the full space, which are represented by squares) diamonds, and triangles. In addition, there is one set of points, represented by circles, that is not a cluster in three- dimensional space. Each dimension (attribute) of the example data set is split into a fixed number (rl) of equal width intervals. There are n – 20 intervals, each of size 0.1. This partitions the data space into rectangular cells of equal volume, and thus, the density of each unit is the fraction of points it contains. Clusters are contiguous groups of dense cells. To illustrate, if the threshold

606 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

0 .14

o .12

0.04

0.02

0 0 . 5 1 * t . u 2 2 . 5

Figure 9.12, Histogram showing the distribution of points for the r attribute.

for a dense cell is ( : 0.06, or 6%o of the points, then three one-dimensional clusters can be identified in Figure 9.12, which shows a histogram of the data points of Figure 9.11(a) for the r attribute.

Figure 9.11(b) shows the points plotted in the ry plane. (The z attribute is ignored.) This figure also contains histograms along the r and g axes that show the distribution of the points with respect to their r and g coordinates, respectively. (A higher bar indicates that the corresponding interval contains relatively more points, and vice versa.) When we consider the gr axis, we see three clusters. One is from the circle points that do not form a cluster in the full space, one consists of the square points, and one consists of the diamond and triangle points. There are also three clusters in the r dimension; they correspond to the three clusters diamonds, triangles, and squares in the full space. These points also form distinct clusters in the ry plane. Figure 9.11(c) shows the points plotted in the rz plane. There are two clusters, if we consider only the z attribute. One cluster corresponds to the points

represented by circles, while the other consists of the diamond, triangle, and square points. These points also form distinct clusters in the rz plane. In Figure 9.11(d), there are three clusters when we consider both the y and z coordinates. One of these clusters consists of the circlesl another consists

U’ c o(L

o(U LL

9.3 Density-Based Clustering 60T

of the points marked by squares. The diamonds and triangles form a single cluster in the yz plane. I

These figures illustrate a couple of important facts. First, a set of points- the circles may not form a cluster in the entire data space, but may form a cluster in a subspace. Second, clusters that exist in the full data space (or even a subspace) show up as clusters in lower-dimensional spaces. The first fact tells us that we may need to look in subsets of dimensions to find clusters, while the second fact tells us that many of the clusters we find in subspaces may only be “shadows” (projections) of higher-dimensional clusters. The goal is to find the clusters and the dimensions in which they exist, but we are typically not as interested in clusters that are projections of higher-dimensional clusters.

CLIQUE

CLIQUtr (Clustering In QUEst) is a grid-based clustering algorithm that me- thodically finds subspace clusters. It is impractical to check each subspace for clusters since the number of such subspaces is exponential in the number of dimensions. Instead, CLIQUE relies on the following property;

Monotonicity property of density-based clusters If a set of points forms a density-based cluster in k dimensions (attributes), then the same set of points is also part of a density-based cluster in all possible subsets of those dimensions.