Definition 10.7 (Count of Points within a Given Radius). The density around an object is equal to the number of objects that are within a specified distance d of the object.

The parameter d needs to be chosen carefully. If d is too small, then many normal points may have low density and thus a high outlier score. If d is chosen to be large, then many outliers may have densities (and outlier scores) that are similar to normal points.

Detecting outliers using any of the definitions of density has similar char- acteristics and limitations to those of the proximity-based outlier schemes discussed in Section 10.3. In particular, they cannot identify outliers correctly when the data contains regions of differing densities. (See Figure 10.7.) To correctly identify outliers in such data sets, we need a notion of density that is relative to the neighborhood of the object. For example, point D in Figure

LO.4 Density-Based Outlier Detection 669

10.7 has a higher absolute density, according to Definitions 10.6 and 10.7, than point A, but its density is lower relative to its nearest neighbors.

There are many ways to define the relative density of an object. One method that is used by the SNN density-based clustering algorithm is discussed in Section 9.4.8. Another method is to compute the relative density as the ratio of the density of a point x and the average density of its nearest neighbors y as follows:

&uerage relat’iue densi,ty(x,k) : density(x, k)

(10.7) Dye n(*,r) densi’tv(v, k) l lN (x’ k)l ‘

10.4.1 Detection of Outliers Using Relative Density

In this section, we describe a technique that is based on the notion of relative density. This technique, which is a simplified version of the Local Outlier Factor (LOF) technique (see bibliographic notes), is described in Algorithm 10.2. The details of the algorithm are examined in more detail below, but in summary, it works as follows. We calculate the outlier score for each object for a specified number of neighbors (k) by first computing the density of an object density(x,k) based on its nearest neighbors. The average density of the neighbors of a point is then calculated and used to compute the average relative density of the point as indicated in Equation 10.7. This quantity provides an indication of whether x is in a denser or sparser region of the neighborhood than its neighbors and is taken as the outlier score of x.

Algorithm 10.2 Relative density outlier score algorithm. 1: {k is the number of riearest neighbors} 2: for all objects x do 3: Determine l[(x, k), the k-nearest neighbors of x. 4: Determine dens’ity(x,k), the density of x usinlg its nearest neighbors, i.e., the

objects in l[(x, k). end for for all objects x do

Set the outli.er score(x,k) : ou”roge relat’iae dens’ity(x,k) from Equation r0.7.

8: end for

Example 10.2 (Relative Density Outlier Detection). We illustrate the performance of the relative density outlier detection method by using the ex- ample data set shown in Figure 10.7. Here, k : 10. The outlier scores for

5 : o :

670 Chapter 10 Anomaly Detection

o o

Figure 10.8. Relative density (LOF) outlier scores for two-dimensional points of Figure 10.7.

these points are shown in Figure 10.8. The shading of each point is determined by its score; i.e., points with a high score are darker. We have labeled points A, C, and D, which have the largest outlier scores, with these values. Respec- tively, these points are the most extreme outlier, the most extreme point with respect to the compact set of points, and the most extreme point in the loose set of points.

LO.4.2 Strengths and Weaknesses

Outlier detection based on relative density gives a quantitative measure of the degree to which an object is an outlier and can work well even if data has regions of differing density. Like distance-based approaches, these approaches naturally have O(m2) time complexity (where rn is the number of objects), although this can be reduced to O(rnlog rn) for low-dimensional data by using special data structures. Parameter selection can also be difficult, although the standard LOF algorithm addresses this by looking at a variety of values for k and then taking the maximum outlier scores. However, the upper and lower bounds of these values still need to be chosen.

o.uc

o

O n t

Bo

B

10.5 Clustering-Based Techniques 67L

10.5 Clustering-Based Techniques

Cluster analysis finds groups of strongly related objects, while anomaly detec- tion finds objects that are not strongly related to other objects. It should not be surprising, then, that clustering can be used for outlier detection. In this section, we will discuss several such techniques.

One approach to using clustering for outlier detection is to discard small clusters that are far from other clusters. This approach can be used with any clustering technique, but requires thresholds for the minimum cluster size and the distance between a small cluster and other clusters. Often, the process is simplified by discarding all clusters smaller than a minimum size. This scheme is highly sensitive to the number of clusters chosen. AIso, it is hard to attach an outlier score to objects using this scheme. Note that considering groups of objects as outliers extends the notion of outliers from individual objects to groups of objects, but does not change anything essential.

A more systematic approach is to first cluster all objects and then assess the degree to which an object belongs to any cluster. For prototype-based clustering, the distance of an object to its cluster center can be used to mea- sure the degree to which the object belongs to a cluster. More generally, for clustering techniques that are based on an objective function, we can use the objective function to assess how well an object belongs to any cluster. In par- ticular, if the elimination of an object results in a substantial improvement in the objective, then we would classify the object as an outlier. To illustrate, for K-means, eliminating an object that is far from the center of its associated cluster can substantially improve the sum of the squared error (SSE) of the cluster. In summary, clustering creates a model of the data and anomalies distort that model. This idea is captured in Definition 10.8.

Definition 10.8 (Clustering-Based Outlier). An object is a cluster-based outlier if the object does not strongly belong to any cluster.

When used with clustering schemes that have an objective function, this definition is a special case of the definition of a model-based anomaly. At- though Definition 10.8 is more natural for prototype-based schemes or schemes that have an objective function, it can also encompass density- and connectivity- based clustering approaches to outlier detection. In particular, for density- based clustering, an object does not strongly belong to any cluster ifits density is too low, while for connectivity-based clustering, an object does not strongly belong to any cluster if it is not strongly connected.

672 Chapter 10 Anomaly Detection

Below, we will discuss issues that need to be addressed by any technique for clustering-based outlier detection. Our discussion will focus on prototype- based clustering techniques, such as K-means.

10.5.1 Assessing the Extent to Which an Object Belongs to a Cluster

For prototype-based clusters, there are several ways to assess the extent to which an object belongs to a cluster. One method is to measure the distance of the object to the cluster prototype and take this as the outlier score of the object. However, if the clusters are of differing densities, then we can construct an outlier score that measures the relative distance of an object from the cluster prototype with respect to the distances of the other objects in the cluster. Another possibility, provided that the clusters can be accurately modeled in terms of Gaussian distributions, is to use the Mahalanobis distance.

For clustering techniques that have an objective function, we can assign an outlier score to an object that reflects the improvement in the objective function when that obiect is eliminated. However, assessing the degree to which a point is an outlier based on the objective function can be compu- tationally intensive. For that reason) the distance-based approaches of the previous paragraph are often preferred.

Example 10.3 (Clustering-Based Example). This example is based on the set of points shown in Figure 10.7. Prototype-based clustering uses the K-means algorithm, and the outlier score of a point is computed in two ways: (t) bV the point’s distance from its closest centroid, and (2) by the point’s relative distance from its closest centroid, where the relative distance is the ratio of the point’s distance from the centroid to the median distance of all points in the cluster from the centroid. The latter approach is used to adjust for the large difference in density between the compact and loose clusters.

The resulting outlier scores are shown in Figures 10.9 and 10.10. As before, the outlier score, measured in this case by the distance or relative distance, is indicated by the shading. We use two clusters in each case. The approach based on raw distance has problems with the differing densities of the clusters, e.8., D is not considered an outlier. For the approach based on relative dis- tances, the points that have previously been identified as outliers using LOF (A, C, and D) also show up as outliers here. r

10.5 Clustering-Based Techniques 673

3.5

2.5

1 . 5

1

0 5

Figure 10.9. Distance of points from closest centroid.

4.6 a

@ @

O a

I

B a

7 6 9 o

70

60

50

40

30

20

1 0

f:) 6x

BiJXffi Figure 10.10. Relative distance of points from closest centroid.

Distance

674 Chapter 10 Anomaly Detection

LO.5.2 Impact of Outliers on the Initial Clustering

If outliers are detected by clustering, there is a question of whether the results are valid since outliers affect the clustering. To address this issue, the fbllowing approach can be used: objects are clustered, outliers are removed, and then the objects are clustered again. While there is no guarantee that this approach will yield optimal results, it is easy to use. A more sophisticated approach is to have a special group for objects that do not currently fit well in any cluster. This group represents potential outliers. As the clustering process proceeds, clusters change. Objects that no longer belong strongly to any cluster are added to the set of potential outliers, while objects currently in the set are tested to see if they now strongly belong to a cluster and can be removed from the set of potential outliers. The objects remaining in the set at the end of the clustering are classified as outliers. Again, there is no guarantee of an optimal solution or even that this approach will work better than the simpler one described previously. For example, a cluster of noise points may look like a real cluster with no outliers. This problem is particularly serious if the outlier score is computed using the relative distance.

10.5.3 The Number of Clusters to IJse

Clustering techniques such as K-means do not automatically determine the number of clusters. This is a problem when using clustering in outlier detec- tion, since whether an object is considered an outlier or not may depend on the number of clusters. For instance, a group of 10 objects may be relatively close to one another, but may be included as part of a larger cluster if only a few large clusters are found. In that case, each of the 10 points could be regarded as an outlier, even though they would have formed a cluster if a large enough number of clusters had been specified.

As with some of the other issues, there is no simple answer to this problem. One strategy is to repeat the analysis for different numbers of clusters. Another approach is to find a large number of small clusters. The idea here is that (1) smaller clusters tend to be more cohesive and (2) if an object is an outlier even when there are a large number of small clusters, then it is likely a true outlier. The downside is that groups of outliers may form small clusters and thus escape detection.

LO.5.4 Strengths and Weaknesses

Some clustering techniques, such as K-means, have linear or near-linear time and space complexity and thus, an outlier detection technique based on such

10.6 Bibliographic Notes 675

. the definition of a cluster is often com-

plementary to that of an outlier and thus, it is usually possible to find both

clusters and outliers at the same time. On the negative side, the set of outliers

procluced and their scores can be heavily dependent upon the number of clus-

ters used as well as the plesence of outliers in the data. For example, clusters

produced by prototype-based algorithms can be distorted by the presence of

outliers. The quality of outliers produced by a clustering algorithm is heavily

impacted by the quality of clusters produced by the algorithm. As discussed

in Chapters 8 and 9, each clustering algorithm is suitable only for a certain

type of d.ata; hence the clustering algorithm needs to be chosen carefully.

10.6 Bibliographic Notes