Figure 10,3. Mahalanobis distance of points from the center of a two-dimensional set of 2002 points.

664 Chapter 10 Anomaly Detection

objects are put in a set of normal objects and the set of anomalous objects is empty. An iterative procedure then transfers objects from the ordinary set to the anomalous set as long as the transfer increases the overall likelihood of the data.

Assume that the data set D contains objects from a mixture of two prob-

ability distributions: M, the distribution of the majority of (normal) objects, and A, the distribution of anomalous objects. The overall probability distri- bution of the data can be written as

D(x) : (1 – ) )M(x) +. la ix; . (10.3)

where x is an object and ,\ is a number between 0 and 1 that gives the expected fraction of outliers. The distribrtion M is estimated from the data, while the distribution A is often taken to be uniform. Let Mt and A1 be the set of normal and anomalous objects, respectively, at time t. Initially, at time f : 0, M0 : D and As is empty. At an arbitrary time l, the likelihood and log likelihood of the entire data set D are given by the following two equations, respectively:

L,(D) II x i e D

Lh(D) lMltog(I – ^) + !-

log Pv, (*u) + lA| Iog.\ * ! log Pa, (x;) (to.s) x , € M t x i € 4 7

where Po, Pnr, and P4, are the probability distribution functions for D, M1 and A1, respectively. This equation can be derived from the general definition of a mixture model given in Equation 9.6 (Section 9.2.2). To do so, it is necessary to make the simplifying assumption that the probability is 0 for both of the following situations: (1) an object in ,4 is a normal object, and (2) an object in M is an outlier. Algorithm 10.1 gives the details.

Because the number of normal objects is large compared to the number of anomalies, the distribution of the normal objects may not change much when an object is moved to the set of anomalies. In that case, the contribution of each normal object to the overall likelihood of the normal objects will remain relatively constant. Furthermore, if a uniform distribution is assumed for anomalies, then each object moved to the set of anomalies contributes a fixed amount to the likelihood of the anomalies. Thus, the overall change in the total likelihood of the data when an object is moved to the set of anomalies is roughly equal to the probability of the object under a uniform distribution

po(xe): (,t – ,l),-,,*,ll. r,,,r*,)) (^,^,,_,U, e,,i*,)){ro.a)

LO.2 Statistical Approaches 665

Algorithm 10.1 Likelihood-based outlier detection. 1: Initialization: At time t : 0, let Mr contain all the objects, while A1 is empty.

Let LL1(D) : LL(MI) + LL(AI) be the log likelihood of all the data. 2: for each point x that belongs to M1 do 3: Move x from M1 to A7 to produce the new data sets A1a1 and M7a1. 4: Compute the new log l ikelihood of D, LLt+r(D) : LL(Mt+r) + LL(At+r) 5: Compute the difference, A : LL^D) – Lh+r(D) 6: if A > c, where c is some threshold then 7: x is classified as an anomaly, i.e., M11y and ,4111 are left unchanged and

become the current normal and anomaly sets. 8: end if 9: end for

(weighted by ,\) minus the probability of the object under the distribution of the normal data points (weighted by 1- )). Consequently, the set of anomalies will tend to consist of those objects that have significantly higher probability under a uniform distribution rather than under the distribution of the normal objects.

In the situation just discussed, the approach described by Algorithm 10.1 is roughly equivalent to classifying objects with a low probability under the distribution of normal objects as outliers. For example, when applied to the points in Figure 10.3, this technique would classify points A and B (and other points far from the mean) as outliers. However, if the distribution of the nor- mal objects changes significantly as anomalies are removed or the distribution of the anomalies can be modeled in a more sophisticated manner, then the results produced by this approach will be different than the results of simply classifying low-probability objects as outliers. Also, this approach can work even when the distribution of obiects is multimodal.

IO.2.4 Strengths and Weaknesses

Statistical approaches to outlier detection have a firm foundation and build on standard statistical techniques, such as estimating the parameters of a distribution. When there is sufficient knowledge of the data and the type of test that should be applied these tests can be very effective. There are a wide variety of statistical outliers tests for single attributes. Fewer options are available for multivariate data, and these tests can perform poorly for high-dimensional data.

666 Chapter 10 Anomaly Detection

10.3 Proximity-Based Outlier Detection

Although there are several variations on the idea of proximity-based anomaly detection, the basic notion is straightforward. An object is an anomaly if it is distant from most points. This approach is more general and more easily applied than statistical approaches, since it is easier to determine a meaningful proximity measure for a data set than to determine its statistical distribution.

One of the simplest ways to measure whether an object is distant from most points is to use the distance to the k-nearest neighbor. This is captured by Definition 10.4. The lowest value of the outlier score is 0, while the highest value is the maximum possible value of the distance function-usually infinity.

Definition 10.4 (Distance to k-Nearest Neighbor). The outlier score of an object is given by the distance to its k-nearest neighbor.

Figure 10.4 shows a set of two-dimensional points. The shading of each point indicates its outlier score using a value of k : 5. Note that outlying point C has been correctly assigned a high outlier score.

The outlier score can be highly sensitive to the value of k. If k is too small, e.8., 1, then a small number of nearby outliers can cause a low outlier score. For example, Figure 10.5 shows a set of two-dimensional points in which another point is close to C. The shading reflects the outlier score using a value of k : 1. Note that both C and its neighbor have a low outlier score. If k is too large, then it is possible for all objects in a cluster that has fewer objects than k to become outliers. For example, Figure 10.6 shows a two-dimensional data set that has a natural cluster of size 5 in addition to a larger cluster of size 30. For k : 5, the outlier score of all points in the smaller cluster is very high. To make the scheme more robust to the choice of k, Definition 10.4 can be modified to use the average ofthe distances to the first k-nearest neighbors.

10.3.1 Strengths and Weaknesses

The distance-based outlier detection scheme described above, and other re- lated schemes, are simple. However, proximity-based approaches typically take O(m2) time. For large data sets this can be too expensive, although specialized algorithms can be used to improve performance in the case of low- dimensional data. Also, the approach is sensitive to the choice of parameters. Furthermore, it cannot handle data sets with regions of widely differing den- sities because it uses global thresholds that cannot take into account such density variations.

10.3 Proximity-Based Outlier Detection 667

Outlier Score

Outlier Score

Outlier Score

Outlier Score

a

. c a

a a o

I

& 6

6 n

o O O

o ^ ^ o t-J

o 6

@ o

@

6 @ e

@

@

@

@ o o

@

@

o o o

a

B

B o

@ .

@ 6 @

@ o o ^ @ o

6 o

a @

Figure 10,4. Outlier score based on the distance to fifth nearest neiohbor.

@ @

& @ 6

o E o O

o ^ŵ ^ o o 0 8

6 o 6

@ @

Figure 10.6. Outlier score based on distance to the fifth nearest neighbor. A small cluster becomes an outlier.

Figure 10.5. Outlier score based on the dis- tance to the first nearest neighbor. Nearby out- liers have low outlier scores.

@ @ B

@ @ @

Figure 10.7. Outlier score based on the dis- tance to the fifth nearest neiohbor. Clusters of differing density.

c a

To illustrate this, consider the set of two-dimensional points in Figure 10.7. This figure has one rather loose cluster of points, another dense cluster of points, and two points, C and D, that are quite far from these two clusters. Assigning the outlier score to points according to Definition 10.4 for k : 5, correctly identifies point C to be an outlier, but shows a low outlier score for

668 Chapter 10 Anomaly Detection

point D. In fact, the outlier score for D is much lower than many points that are oart of the loose cluster.

LO.4 Density-Based Outlier Detection

From a density-based viewpoint, outliers are objects that are in regions of low density.

Definition 10.5 (Density-Based Outlier). The outlier score of an object is the inverse of the density around an object.

Density-based outlier detection is closely related to proximity-based outlier detection since density is usually defined in terms of proximity. One common approach is to define density as the reciprocal of the average distance to the k nearest neighbors. If this distance is small, the density is high, and vice versa. This is captured by Definition 10.6.

Definition 10.6 (Inverse Distance).

where l/(x, k) is the set containing the k-nearest the size of that set, and y is a nearest neighbor.

Another definition of density is the one used

(10.6)

neighbors of x, ll/(x, k)l is

by the DBSCAN clustering

d,ensi,ty(x,-l :(ffi)-‘

algorithm. See Section 8.4.