In anomaly detection, the goal is to find objects that are different from most other objects. Often, anomalous objects are known as outliers, since, on a scatter plot of the data, they lie far away from other data points. Anomaly detection is also known as deviation detection, because anomalous objects have attribute values that deviate significantly from the expected or typical attribute values, or as exception mining, because anomalies are exceptional in some sense. In this chapter, we will mostly use the terms anomaly or outl’ier.
There are a variety of anomaly detection approaches from several areas, including statistics, machine learning, and data mining. All try to capture the idea that an anomalous data object is unusual or in some way inconsistent with other objects. Although unusual objects or events are, by definition, relatively rare, this does not mean’that ttrey do not occur frequently in absolute terms. For example, an event that is “one in a thousand” can occur millions of times when billions of events are considered.
In the natural world, human society, or the domain of data sets, most events and objects are, by definition, commonplace or ordinary. However, we have a keen awareness of the possibility of objects that are unusual or extraor- dinary. This includes exceptionally dry or rainy seasons) famous athletes, or an attribute value that is much smaller or larger than all others. Our inter- est in anomalous events and objects stems from the fact that they are often of unusual importance: A drought threatens crops, an athlete’s exceptional skill may lead to victory, and anomalous values in experimental results may indicate either a problem with the experiment or a new phenomenon to be investigated.
The following examples illus;trate applications for which anomalies are of considerable interest.
652 Chapter 10 Anomaly Detection
o Fbaud Detection. The purchasing behavior of someone who steals a
credit card is probably different from that of the original owner. Credit
card companies attempt to detect a theft by looking for buying patterns
that characterize theft or by noticing a change from typical behavior.
Similar approaches are used for other types of fraud.
o Intrusion Detection. Unfortunately, attacks on computer systems
and computer networks are commonplace. While some of these attacks, such as those designed to disable or overwhelm computers and networks, are obvious, other attacks, such as those designed to secretly gather
information, are difficult to detect. Many of these intrusions can only be
detected by monitoring systems and networks for unusual behavior.
o Ecosystem Disturbances. In the natural world, there are atypical events that can have a significant effect on human beings. Examples include hurricanes, floods, droughts, heat waves, and fires. The goal is
often to predict the likelihood of these events and the causes of them.
o Public Health. In many countries, hospitals and medical clinics re- port various statistics to national organizations for further analysis. For
example, if all children in a city are vaccinated for a particular disease, e.g., measles, then the occurrence ofa few cases scattered across various hospitals in a city is an anomalous event that may indicate a problem
with the vaccination programs in the city.
o Medicine. For a particular patient, unusual symptoms or test results may indicate potential health problems. However, whether a particular
test result is anomalous may depend on other characteristics of the pa-
tient, such as age and sex. Furthermore, the categorization of a result
as anomalous or not incurs a cost-unneeded additional tests if a pa-
tient is healthy and potential harm to the patient if a condition is left
undiagnosed and untreated.
Although much of the recent interest in anomaly detection has been driven
by applications in which anomalies are the focus, historically, anomaly detec-
tion (and removal) has been viewed as a technique for improving the analysis
of typical data objects. For instance, a relatively small number of outliers can
distort the mean and standard deviation of a set of values or alter the set
of clusters produced by a clustering algorithm. Therefore, anomaly detection (and removal) is often a part of data preprocessing.
10.1 Preliminaries 653
In this chapter, we will focur; on anomaly detection. After a few preliminar- ies, we provide a detailed discussion of some important approaches to anomaly detection, illustrating them with examples of specific techniques.
10.1 Preliminaries
Before embarking on a discussion of specific anomaly detection algorithms, we provide some additional background. Specifically, we (1) explore the causes of anomalies, (2) consider various anomaly detection approaches, (3) draw dis- tinctions among approaches based on whether they use class label information, and (4) describe issues commorr to anomaly detection techniques.
10.1.1 Causes of Anoma.l ies
The following are some common causes of anomalies: data from different classes, natural variation, and rlata measurement or collection errors.
Data from Different Classes An object may be different from other ob- jects, i.e., anomalous, because it is of a different type or class. To illustrate, someone committing credit cald fraud belongs to a different class of credit card users than those people v’ho use credit cards legitimately. Most of the examples presented at the beginning of the chapter, namely, fraud, intrusion, outbreaks of disease, and abnormal test results, are examples of anomalies that represent a different class of ob.iects. Such anomalies are often of considerable interest and are the focus of anomaly detection in the field of data mining.
The idea that anomale6 6[‘jects come from a different source (class) than most of the data objects is stal;ed in the often-quoted definition of an outlier by the statistician Douglas Har’ikins.
Definition 10.1 (Hawkins’ lDefinition of an Outlier). An outlier is an observation that differs so much from other observations as to arouse susoicion that it was generated by a diffe,rent mechanism.
Natural Variation Many daL,ta sets can be modeled by statistical distribu- tions, such as a normal (Gaussian) distribution, where the probability of a data object decreases rapidly as the distance of the object from the center of the distribution increases. In other words, most of the objects are near a center (average object) and the likelihood that an object differs significantly from this average object is small. For example, an exceptionally tall person is not anomalous in the sense of t,eing from a separate class of objects, but only
654 Chapter 10 Anomaly Detection
in the sense of having an extreme value for a characteristic (height) possessed
by all the objects. Anomalies that represent extreme or unlikely variations are often interesting.
Data Measurement and Collection Errors Errors in the data collection or measurement process are another source of anomalies. For example, a measurement may be recorded incorrectly because of human error) a problem
with the measuring device, or the presence of noise. The goal is to eliminate such anomalies, since they provide no interesting information but only reduce
the quality of the data and the subsequent data analysis. Indeed, the removal
of this type of anomaly is the focus of data preprocessing, specifically data cleaning.
Summary An anomaly may be a result of the causes given above or of other causes that we did not consider. Indeed, the anomalies in a data set may have several sources, and the underlying cause of any particular anomaly is often unknown. In practice, anomaly detection techniques focus on finding objects that differ substantially from most other objects, and the techniques themselves are not affected by the source of an anomaly. Thus, the under- lying cause of the anomaly is only important with respect to the intended application.
10.1.2 Approaches to Anomaly Detection
Here, we provide a high-level description of some anomaly detection tech- niques and their associated definitions of an anomaly. There is some overlap between these techniques, and relationships among them are explored further in Exercise 1 on page 680.
Model-Based Techniques Many anomaly detection techniques first build a model of the data. Anomalies are objects that do not fit the model very well. For example, a model of the distribution of the data can be created by using the data to estimate the parameters of a probability distribution. An object does not fit the model very well; i.e., it is an anomaly, if it is not very likely under the distribution. If the model is a set of clusters, then an anomaly is an object that does not strongly belong to any cluster. When a regression model is used, an anomaly is an object that is relatively far from its predicted value.
Because anomalous and normal objects can be viewed as defining two dis- tinct classes, classification techniques can be used for building models of these
10.1 Preliminaries 655
two classes. Of course, classification techniques can only be used if class labels are available for some of the objects so that a training set can be constructed. Also, anomalies are relatively rare, and this needs to be taken into account when choosing both a classification technique and the measures to be used for evaluation. (See Section 5.7.)
In some cases, it is difficult to build a model; e.g., because the statistical distribution of the data is unknown or no training data is available. In these situations, techniques that do not require a model, such as those described below, can be used.
Proximity-Based rechniques It is often possible to define a proximity measure between objects, and a number of anomaly detection approaches are based on proximities. Anomalous objects are those that are distant from most of the other objects. Many of the techniques in this area are based on distances and are referred to as distance-based outlier detection techniques. When the data can be displayed as a two- or three-dimensional scatter plot, distance- based outliers can be detected visually, by looking for points that are separated from most other points.
Density-Based Techniques Estimates of the density of objects are rela- tively straightforward to compute, especially if a proximity measure between objects is available. Objects that are in regions of low density are relatively distant from their neighbors, and can be considered anomalous. A more so- phisticated approach accommodates the fact that data sets can have regions of widely differing densities, and classifies a point as an outlier only if it has a local density significantly less than that of most of its neighbors.
10.1.3 The Use of Class Labels
There are three basic approaches to anomaly detection: unsupervised, super- vised’ and semi-supervised. The major distinction is the degree to which class labels (anomaly or normal) are available for at least some of the data.
Supervised anomaly detection Techniques for supervised anomaly detec- tion require the existence of a training set with both anomalous and normal objects. (Note that there may be more than one normal or anomalous class.) As mentioned previously, classification techniques that address the so-called rare class problem are particularly relevant because
656 Chapter 10 Anomaly Detection
anomalies are relatively rare with respect to normal objects. See Section r n r ) . 1 .
IJnsupervised anornaly detection In many practical situations, class Ia-
bels are not available. In such cases, the objective is to assign a score (or
a label) to each instance that reflects the degree to which the instance is
anomalous. Note that the presence of many anomalies that are similar to each other can cause them all to be labeled normal or have a low out- Iier score. Thus, for unsupervised anomaly detection to be successful, anomalies must be distinct from one another, as well as normal objects.
Semi-supervised anomaly detection Sometimes training data contains la- beled normal data, but has no information about the anomalous objects. In the semi-supervised setting, the objective is to find an anomaly label or score for a set of given objects by using the information from labeled normal objects. Note that in this case, the presence of many related
outliers in the set of objects to be scored does not impact the outlier evaluation. However, in many practical situations, it can be difficult to find a small set of representative normal objects.
AII anomaly detection schemes described in this chapter can be used in
supervised or unsupervised mode. Supervised schemes are essentially the same as classification schemes for rare classes discussed in Section 5.7.
LO.L.4 Issues
There are a variety of important issues that need to be addressed when dealing with anomalies.
Number of Attributes Used to Define an Anomaly The question of
whether an object is anomalous based on a single attribute is a question of
whether the object’s value for that attribute is anomalous. However, since an object may have many attributes, it may have anomalous values for some at- tributes, but ordinary values for other attributes. Furthermore, an object may be anomalous even if none of its attribute values are individually anomalous. For example, it is common to have people who are two feet tall (children) or are 300 pounds in weight, but uncommon to have a two-foot tall person who weighs 300 pounds. A general definition of an anomaly must specify how the
values of multiple attributes are used to determine whether or not an object is an anomaly. This is a particularly important issue when the dimensionality of the data is high.
10.1 Preliminaries 657
Global versus Local Perspective An object may seem unusual with re- spect to all objects, but not with respect to objects in its local neighborhood. For example, a person whose height is 6 feet 5 inches is unusually tall with re- spect to the general population, but not with respect to professional basketball players.
Degree to Which a Point Is an Anomaly The assessment of whether an object is an anomaly is reported by some techniques in a binary fashion: An object is either an anomaly or it is not. Frequently, this does not reflect the underlying reality that some objects are more extreme anomalies than others. Hence, it is desirable to have some assessment of the degree to which an object is anomalous. This assessment is known as the anomalv or outlier score.
Identifying One Anomaly at a Time versus Many Anomalies at Once In some techniques, anomalies are removed one at a time; i.e.