1499] T. Lane and C. E. Brodley. An Application of Machine Learning to Anomaly Detection. In Proc. 20th NIST-NCSC Nati,onal Informat’ion Systems Security Conf.,pages 366 380, 1997.

[500] A. Lazarev\c,L.Ertoz, V. Kumar, A. Ozgur, and J. Srivastava. A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection. In Proc. of the 2003 SIAM Intl. Conf. on Data Mini,ng,2OO3.

[501] A. Lazarevic, V. Kumar, and J. Srivastava. Intrusion Detection: A Survey. In Manag- ing Cyber Threats: Issues, Approaches and Challenges, pages 19-80. Kluwer Academic Publisher, 2005.

[502] W. Lee and S. J. Stolfo. Data Mining Approaches for Intrusion Detection. In 7th USENIX Security Sympos’ium, pages 26-29, January 1998.

[503] W. Lee, S. J. Stolfo, and K. W. Mok. A Data Mining Framework for Building Intrusion Detection Models. In IEEE Sgmposium on Securitg and Priuacg, pages 120-132, 1999.

[504] W. Lee and D. Xiang. Information-theoretic measures for anomaly detection In Proc. of the 2001 IEEE Syrnpos’ium on Security and Priuacy, pages 130-143, May 2001.

1505] R. Y. Liu, J. M. Parelius, and K Singh. Multivariate analysis by data depth: descrip- tive statistics, graphics and inference. Annals of Stat’istics,27(3):783-858, 1999.

1506] M. Markou and S. Singh. Novelty detection: A review part 1: Statistical approaches. Signal Processi,ng, 83(L2):248I 2497, 2003.

15071 M. Markou and S. Singh. Novelty detection: A review part 2: Neural network based approaches. Signal Processi,ng, 83(12):2499 2521, 2003.

[508] C. R. Muirhead. Distinguishing Outlier Types in Time Series. Journal of the Royal Stati,stical Soci,ety. Series B (Methodologi,cal), 48(l):39 47, 1986.

[509] L. Portnoy, E. Eskin, and S. J. Stolfo. Intrusion detection with unlabeled data using clustering. In In ACM Workshop on Data Mi.ning Appli,ed to Securi’ty,200I.

[510] S. Ramaswamy, R. Rastogi, and K Shim. Efficient algorithms for mining outliers from large data sets. In Proc. of 2000 ACM-SIGMOD IntI. Conf. on Management of Data, pages 427 438. ACM Press, 2000.

1511] D. M. Rocke and D. L. Woodruff. Identification of Outliers in Multivariate Data.

Journal oJ the Ameri,can Stati,stical Associat’ion, 91(435):1047-106t, September 1996.

f512] B. Rosner. On the Detection of Many Outliers. Technometrics, 17(3):227 227, 1975.

[513] P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. Wlley

Series in Probability and Statistics. John Wiley & Sons, September 2003.

[514] P. J. Rousseeuw, I. Ruts, and J. W. T[rkey. The Bagplot: A Bivariate Boxplot. ?he American Statist’ic’i an, 53 (4) :382-387, November 1999.

1515] P. J. Rousseeuw and B. C. van Zomeren. Unmasking Multivariate Outliers and Lever- age Points. Journal of the American Stati,st’ical Association,85(411):633-639, September 1990.

[516] D. W. Scott. Partial Mixture Estimation and Outlier Detection in Data and Regres- sion. In M. Hubert, G. Pison, A. Struyf, and S. V. Aelst, editors, Theory and’ Appli” cat’ions of Recent Robust Methods, Statistics for Industry and Technology. Birkhauser, 2003.

[517] S. Shekhar, C.-T. Lu, and P. Zhang. A Unified Approach to Detecting Spatial Outliers. G eo Inf orrnatica, 7 (2) :139-166, June 2003.

680 Chapter 10 Anomaly Detection

1518] M.-L. Shyu, S.-C. Chen, K. Sarinnapakorn, and L. Chang. A Novel Anomaly Detection Scheme Based on Principal Component Classifier. In Proc. of the 2003 IEEE Intl. Conf. on Data Mi,n’ing, pages 353 365, 2003.

1519] P. Sykacek. Equivalent error bars for neural network classifiers trained by bayesian inference. ln Proc. of the European Symposium on Artifi,cial Neural Networks, pages l2t 126, 1997.

f5201 N. Ye and Q. Chen. Chi-square Statistical Profiling for Anomaly Detection. In Proc. of the 2000 IEEE Workshop on Inforrnation Assurance and Securitg, pages 187 193, June 2000.

LO.7 Exercises

1 . Compare and contrast the different techniques for anomaly detection that were presented in Section I0.7.2. In particular, try to identify circumstances in which the definitions of anomalies used in the different techniques might be equivalent or situations in which one might make sense, but another would not. Be sure to consider different types of data.

Consider the following definition of an anomaly: An anomaly is an object that is unusually influential in the creation of a data model.

(a) Compare this definition to that of the standard model-based definition of an anomaly.

(b) For what sizes of data sets (small, medium, or large) is this definition appropriate?

In one approach to anomaly detection, objects are represented as points in a multidimensional space, and the points are grouped into successive shells, where each shell represents a layer around a grouping of points, such as a convex hull. An object is an anomaly if it lies in one of the outer shells.

(a) To which of the definitions of an anomaly in Section 10.1.2 is this definition most closely related?

(b) Name two problems with this definition of an anomaly.

Association analysis can be used to find anomalies as follows. Find strong asso- ciation patterns, which involve some minimum number of objects. Anomalies are those objects that do not belong to any such patterns. To make this more concrete, we note that the hyperclique association pattern discussed in Section 6.8 is particularly suitable for such an approach. Specifically, given a user- selected h-confidence level, maximal hyperclique patterns of objects are found. All objects that do not appear in a maximal hyperclique pattern of at least size three are classified as outliers.

2.

3 .

4 .

LO.7 Exercises 681

(a) Does this technique fall into any ofthe categories discussed in this chapter? If so, which one?

(b) Name one potential strength and one potential weakness ofthis approach.

Discuss techniques for combining multiple anomaly detection techniques to im- prove the identification of anomalous objects. Consider both supervised and unsupervised cases.

Describe the potential time complexity of anomaly detection approaches based on the following approaches: model-based using clustering, proximity-based, and density. No knowledge of specific techniques is required. Rather, focus on the basic computational requirements of each approach, such as the time required to compute the density of each object.

The Grubbs’ test, which is described by Algorithm 10.3, is a more statistically sophisticated procedure for detecting outliers than that of Definition 10.3. It is iterative and also takes into account the fact that the z-score does not have a normal distribution. This algorithm computes the z-score of each value based on the sample mean and standard deviation of the current set of values. The value with the largest magnitude z-score is discarded if its z-score is larger tlnan g., the critical value of the test for an outlier at significance level cr. This process is repeated until no objects are eliminated. Note that the sample mean, standard deviation, and g” are updated at each iteration.

Algorithm 10.3 Grubbs’ approach for outlier elimination. 1: Input the values and o

{m is the number of values, a is a parameter, and f. is a value chosen so that 61 : prob(r ) l.) for a t distribution with m – 2 degrees of freedom.) repeat

Compute the sample mean (r) and standard deviation (s”). Compute a value g. so that prob(lzl 2 9) : a.

(In terms of t. and m. 9.: # lF^-** I Compute the z-score of each value, i.e., z : (r – *)lt’. Let g: maxlzl, i.e., f ind the z-score of largest magnitude and call i t g. i f g > 9 . t h e n

Eliminate the value corresponding to g. m + m – I

end if until No objects are eliminated.

K

b .

2: 3 : 4:

o :

8: o .

10: 1 1 :

(a) What is the limit of the value #{H” used for Grubbs’ test as m

approaches infinity? Use a significance level of 0.05.

682 Chapter 10 Anomaly Detection

(b) Describe, in words, the meaning of the previous result.

Many statistical tests for outliers were developed in an environment in which a few hundred observations was a large data set. We explore the limitations of such approaches.

(a) For a set of 1,000,000 values, how likely are we to have outliers according to the test that says a value is an outlier if it is more than three standard deviations from the average? (Assume a normal distribution.)

(b) Does the approach that states an outlier is an object of unusually low probability need to be adjusted when dealing with large data sets? If so, how?

The probability density of a point x with respect to a multivariate normal distribution having a mean p and covariance matrix I is given by the equation

7 / , I – ( x – p t x – 1 ( x – ! ) P r o b l x l : – s – 2

(t/2r)-lEl’/’ (10 .8)

Using the sample mean x and covariance matrix S as estimates of the mean p and covariance matrix !, respectively, show that the logprob(x) is equal to the Mahalanobis distance between a data point x and the sample mean x plus a constant that does not depend on x.

compare the following two measures of the extent to which an object belongs to a cluster: (1) distance of an object from the centroid of its closest cluster and (2) the silhouette coefficient described in Section 8.5.2.

Consider the (relative distance) K-means scheme for outlier detection described in Section 10.5 and the accompanying figure, Figure 10.10.

(a) The points at the bottom of the compact cluster shown in Figure 10.10 have a somewhat higher outlier score than those points at the top of the compact cluster. Why?

(b) Suppose that we choose the number of clusters to be much larger, e.g., 10. Would the proposed technique still be effective in finding the most extreme outlier at the top of the figure? Why or why not?

(c) The use of relative distance adjusts for differences in density. Give an example of where such an approach might lead to the wrong conclusion.

If the probability that a normal object is classified as an anomaly is 0.01 and the probability that an anomalous object is classified as anomalous is 0.99, then

8.

9 .

10.

1 1 .

12,

LO.7 Exercises 683

what is the false alarm rate and detection rate if 99Ya of the objects are normal? (Use the definitions given below.)

number of anomalies detected detection rate :

false alarm rate

total number of anomalies

number of false anomalies

(10.e)

(10.10)

13.

number of objects classified as anomalies

When a comprehensive training set is available, a supervised anomaly detection technique can typically outperform an unsupervised anomaly technique when performance is evaluated using measures such as the detection and false alarm rate. However, in some cases, such as fraud detection, new types of anomalies are always developing. Performance can be evaluated according to the detection and false alarm rates, because it is usually possible to determine upon investiga- tion whether an object (transaction) is anomalous. Discuss the relative merits of supervised and unsupervised anomaly detection under such conditions.

Consider a group of documents that has been selected from a much larger set of diverse documents so that the selected documents are as dissimilar from one another as possible. If we consider documents that are not highly related (con-

nected, similar) to one another as being anomalous, then all of the documents that we have selected might be classified as anomalies. Is it possible for a data

set to consist only of anomalous objects or is this an abuse of the terminology?

Consider a set of points, where most points are in regions of low density, but a few points are in regions of high density. If we define an anomaly as a point in

a region of low density, then most points will be classified as anomalies. Is this

an appropriate use of the density-based definition of an anomaly or should the

definition be modified in some way?

Consider a set of points that are uniformly distributed on the interval 10,1]. Is

the statistical notion of an outlier as an infrequently observed value meaningful for this data?

An analyst applies an anomaly detection algorithm to a data set and finds a set of anomalies. Being curious, the analyst then applies the anomaly detection algorithm to the set of anomalies.

(a) Discuss the behavior of each of the anomaly detection techniques described in this chapter. (If possible, try this for real data sets and algorithms.)

(b) What do you think the behavior of an anomaly detection algorithm should be when applied to a set of anomalous objects?

14,

15 .

16 .

17 .

Linear Algebra

This appendix provides a brief introduction to linear alp;ebra with a focus on topics that are relevant to the material in this text. Wr begin by describing vectors, which can be used to represent both data objects and attributes. We then discuss matrices, which can be used both to reprer;ent data sets and to describe transformations on them.

A.1 Vectors

A.1.1 Def ini t ion

In Euclidean space, such as the ordinary two- and three-dimensional space with which we are familiar, a vector is a quantity that has magnitude and direction. It is traditionally represented as an arrow that has a length equal to its magnitude and an orientation given by its direction, Figure A.1(a) shows two vectors: vector u, which has a length of 1 and is parallel to the g axis, and vector v, which has a length of 2 and a direction c,f 45″ with respect to the r axis. (We shall use lowercase bold letters, such as u and v, to represent vectors. They are often also represented by italic lower(;ase letters, such as z and u.) Since a point can be regarded as a displacemenb from the origin in a particular direction, it can be represented by a vector from the origin to the point.