Identifying One Anomaly at a Time versus Many Anomalies at Once In some techniques, anomalies are removed one at a time; i.e., the most anoma- lous instance is identified and removed and then the process repeats. For other techniques, a collection of anomalies is identified together. Techniques that attempt to identify one anomaly at a time are often subject to a problem known as masking, where the presence of several anomalies masks the pres- ence of all. On the other hand, techniques that detect multiple outliers at once can experience swamping, where normal objects are classified as outliers. In model-based approaches, these effects can happen because the anomalies dis- tort the data model.

Evaluation If class labels are available to identify anomalies and normal data, then the effectiveness of an anomaly detection scheme can be evaluated by using measures of classification performance discussed in Section 5.7. But since the anomalous class is usually much smaller than the normal class, mea- sures such as precision, recall, and false positive rate are more appropriate than accuracy. If class labels are not available, then evaluation is difficult. However, for model-based approaches, the effectiveness of outlier detection can be judged with respect to the improvement in the model once anomalies are eliminated.

Efficiency There are significant differences in the computational cost of var- ious anomaly detection schemes. Classification-based schemes can require sig- nificant resources to create the classification model, but are usually inexpensive to apply. Likewise, statistical approaches create a statistical model and can

658 Chapter 10 Anomaly Detection

then categorize an object in constant time. Proximity-based approaches nat- urally have a time complexity of O(*’), where rn is the number of objects, because the information they require can usually only be obtained by com- puting the proximity matrix. This time complexity can be reduced in specific casesl such as low-dimensional data, by the use of special data structure and algorithms. The time complexity of other approaches is considered in Exercise 6 on page 681.

Road Map

The next four sections describe several major categories of anomaly detection approaches: statistical, proximity-based, density-based, and cluster-based. One or more specific techniques are considered within each of these categories. In these sections, we will follow common practice and use the term outlier instead of anomaly.

1-O.2 Statistical Approaches

Statistical approaches are model-based approaches; i.e., a model is created for the data, and objects are evaluated with respect to how well they fit the model. Most statistical approaches to outlier detection are based on building a probability distribution model and considering how Iikely objects are under that model. This idea is expressed by Definition 10.2.

Definition 10.2 (Probabilistic Definition of an Outlier). An outlier is an object that has a low probability with respect to a probability distribution model of the data.

A probability distribution model is created from the data by estimating the parameters of a user-specified distribution. If the data is assumed to have a Gaussian distribution, then the mean and standard deviation of the underlying distribution can be estimated by computing the mean and standard deviation of the data. The probability of each object under the distribution can then be estimated.

A wide variety of statistical tests based on Definitiont0.2 have been devised to detect outliers, or discordant observations, as they are often called in the statistical literature. Many of these discordancy tests are highly specialized and assume a level of statistical knowledge beyond the scope of this text. Thus, we illustrate the basic ideas with a few examples and refer the reader to the bibliographic notes for further pointers.

LO.2 Statistical Approaches 659

Issues

Among the important issues facing this approach to outlier detection are the following:

Identifying the specific distribution of a data set. While many types of data can be described by a small number of common distributions, such as Gaussian, Poisson, or binomial, data sets with non-standard distributions are relatively common. Of course, if the wrong model is chosen, then an object can be erroneously identified as an outlier. For example, the data may be modeled as coming from a Gaussian distribution, but may actually come from a distribution that has a higher probability (than the Gaussian distribution) of having values far from the mean. Statistical distributions with this type of behavior are common in practice and are known as heavy-tailed distributions.

The number of attributes used. Most statistical outlier detection tech- niques apply to a single attribute, but some techniques have been defined for multivariate data.

Mixtures of distributions. The data can be modeled as a mixture of distri- butions, and outlier detection schemes can be developed based on such models. Although potentially more powerful, such models are more complicated, both to understand and to use. For example, the distributions need to be identi- fied before objects can be classified as outliers. See the discussion of mixture models and the EM algorithm in Section 9.2.2.

10.2.1 Detecting Outliers in a Univariate Normal Distribution

The Gaussian (normal) distribution is one of the most frequently used dis- tributions in statistics, and we will use it to describe a simple approach to statistical outlier detection. This distribution has two parameters, p and o, which are the mean and standard deviation, respectively, and is represented using the notation N(p,o). Figure 10.1 shows the density function of l[(0,1).

There is little chance that an object (value) from a N(0, 1) distribution will occur in the tails of the distribution. For instance, there is only a probability of 0.0027 that an object lies beyond the central area between t3 standard deviations. More generally, if c is a constant and r is the attribute value of an object, then the probability that lzl ) c decreases rapidly as c increases. Let a : prob(lrl > c). Table 10.1 shows some sample values for c and the

660 Chapter 10 Anomaly Detection

o 4

Figure 10,1 . Probability density function of a Gaussian distribution with a mean of 0 and a standard deviation of 1.

corresponding values for a when the distribution is l/(0,1). Note that a value that is more than 4 standard deviations from the mean is a one-in-ten-thousand occurrence.

Table 10.1, Sample pairs (c, a), a : prob(lrl > c) for a Gaussian distribution with mean 0 and standard deviation 1.

) 0.25 q c o

> 0.2

G ! g 0 . 1 5

(!

– 1 0 1 2 3 4 5 x

c a fo r N (0 ,1 ) 1.00 1.50 2.00 2.50 3.00 3.50 4.00

0.3173 0.1336 0.0455 0.0124 0.0027 0.0005 0.0001

Because a value’s distance directly related to the value’s for whether an object (value)

c from the center of the N(0,1) distribution is probability, it can be used as the basis of a test is an outlier as indicated in Definition 10.3.

LO.2 Statistical Approaches 661

Definition 10.3 (Outlier for a Single N(0,1) Gaussian Attribute). At object with attribute value r from a Gaussian distribution with mean of 0 and standard deviation 1 is an outlier if

l r l > “,

(10 .1)

where c is a constant chosen so that prob(lrl) ) c: e.

To use this definition it is necessary to specify a value for a. From the viewpoint that unusual values (objects) indicate a value from a different dis- tribution, c indicates the probability that we mistakenly classify a value from the given distribution as an outlier. From the viewpoint that an outlier is a rare value of a l/(0,1) distribution, a specifies the degree of rareness.

If the distribution of an attribute of interest (for the normal objects) has a Gaussian distribution with mean p and a standard deviation o, i.e., a N(p,,o) distribution, then to use Definition 10.3, we need to transform the attribute r to a new attribute z, which has a ,A/(0,1) distribution. In particular, the approach is to set z: (r – p)lo. (z is typically called a z score.) However, pr and o are typically unknown and are estimated using the sample mean 7 and sample standard deviation s”. In practice, this works well when the number of observations is large. However, we note that the distribution of z is not actually N(0,1). A more sophisticated statistical procedure (Grubbs’ test) is explored in Exercise 7 on page 681.

LO.2.2 Outliers in a Multivariate Normal Distribution

For multivariate Gaussian observations, we would like to take an approach similar to that given for a univariate Gaussian distribution. In particular, we would like to classify points as outliers if they have low probability with respect to the estimated distribution of the data. Furthermore? we would like to be able to judge this with a simple test, for example, the distance of a point from the center of the distribution.

However, because of the correlation between the different variables (at- tributes), a rnultivariate normal distribution is not symmetrical with respect to its center. Figure 10.2 shows the probability density of a two-dimensional multivariate Gaussian distribution with mean of (0,0) and a covariance matrix of

\ ‘ – / 1’oo “- \ o.zs3r3)

662 Chapter 1O Anomaly Detection

If we are to use a simple threshold for whether an object is an outlier, then

we will need a distance measure that takes the shape of the data distribution into account. The Mahalanobis distance is such a distance. See Equation 2.14

on page 81. The Mahalanobis distance between a point x and the mean of the data x is shown in Equation 10.2.

mahalanobzs(x, x) : (x – x)S*1(x – *) t , (10.2)

where S is the covariance matrix of the data. It is easy to show that the Mahalanobis distance of a point to the mean of

the underlying distribution is directly related to the probability of the point.

In particular, the Mahalanobis distance is equal to the log of the probability

density of the point plus a constant. See Exercise 9 on page 682.

Example 10.1 (Outliers in a Multivariate Normal Distribution). Fig- ure 10.3 shows the Mahalanobis distance (from the mean of the distribution) for points in a two-dimensional data set. The points A (-4,4) and B (5, 5) are outliers that were added to the data set, and their Mahalanobis distance is indicated in the figure. The other 2000 points of the data set were randomly generated using the distribution used for Figure 10.2.

Both A and B have large Mahalanobis distances. However, even though A is closer to the center (the large black x at (0,0)) as measured by Euclidean dis- tance, it is farther away than B in terms of the Mahalanobis distance because the Mahalanobis distance takes the shape of the distribution into account. In particular, point B has a Euclidean distance of 5t/2 and a Mahalanobis distance of 24, while the point A has a Euclidean distance of 4Jd and a Ma- halanobis distance of 35. I

10.2.3 A Mixture Model Approach for Anomaly Detection

This section presents an anomaly detection technique that uses a mixture model approach. In clustering (see Section 9.2.2), the mixture model approach assumes that the data comes from a mixture of probability distributions and that each cluster can be identified with one of these distributions. Similarly, for anomaly detection, the data is modeled as a mixture of two distributions, one for ordinary data and one for outliers.

For both clustering and anomaly detection, the goal is to estimate the parameters of the distributions in order to maximize the overall likelihood (probability) of the data. In clustering, the EM algorithm is used to esti- mate the parameters of each probability distribution. However, the anomaly detection technique presented here uses a simpler approach. Initially, all the

IO.2 Statistical Approaches 663