To a large extent, the answer depends on the kind of data and data dis- tributions that can be analyzed by the dimensionality reduction approach. Techniques such as PCA, SVD, and factor analysis assume that there is a lin- ear relationship between the old and new sets of attributes. Although this may

716 Appendix B Dimensionality Reduction

be approximately true in many cases, there are many cases where a non-linear

approach is necessary. In particular, algorithms such as ISOMAP and LLE

have been developed to deal with nonlinear relationships. The time and space complexity of dimensionality reduction algorithms is

a key issue. Most of the algorithms that we have discussed have time and/or

space complexity of O(m2) or higher, where rn is the number of objects. This

limits their applicability to larger data sets, although sampling can sometimes be used quite effectively. FastMap is the only algorithm presented here that

has linear time and space complexity. Another important aspect of dimensionality reduction algorithms is whether

they produce the same answer every time they are run. PCA, SVD, and LLE

do. Factor analysis and the MDS techniques can produce different answers

on different runs. Many of the techniques that we did not discuss also have

this characteristic because they try to optimize some objective, and this re- quires a search that may become trapped in a local minimum. Search-based approaches can also have poor time complexity.

Finally, a key issue is determining the number of dimensions for the di- mensionality reduction. The techniques that we have considered can typically perform a dimensionality reduction to almost any number of dimensions. The goodness of the reduction is typically measured by some quantity that can be plotted, as in a scree plot. In some cases, this curve provides a clear indication of the intrinsic dimensionality. In many other situations, a choice needs to

be made between a smaller number of dimensions and a larger approximation error) and a smaller approximation error and more dimensions.

B.3 Bibliographic Notes

Dimensionality reduction is a broad topic, and the relevant references are

scattered across many fields. A comprehensive discussion of PCA can be found in the book by Jolliffe [531], while an introduction to SVD is given by Demmel [527] and other linear algebra texts. Kernel PCA is described by

Scholkopf et al. [534]. Many books on multivariate statistical analysis, such

as that by Anderson 15241, also include discussions on PCA, as well as factor

analysis. More details on MDS can be found in the book by Kruskal and

Wish [532]. The FastMap algorithm was proposed by Faloutsos and Lin [529]. The papers for LLE (Roweis and Saul [535]) and ISOMAP (Tenenbaum et al.

[533]) appeared in the same issue of Sc’ience. MATLAB code for the ISOMAP and LLtr algorithms is available on the Web. Other articles that may be of

BIBLIOGRAPHY 7L7

interest include those by M. Belkin and P. Niyogi [525], Donoho and Grimes

[528], and Ye et al. [536, 537] There are many other techniques that are often used for dimensionality

reduction or are strongly related to it. These areas include principal curves and surfaces, non-linear PCA (including neural network approaches), vector quantization, random projections, Independent Components Analysis (ICA), Self-Organizing Maps (SOM), projection pursuit, regression-based approaches, genetic algorithms, and optimization-based approaches such as simulated or deterministic annealing. Descriptions of these areas and additional references can be found in two surveys on dimensionality reduction by Fodor [530] and Carreira-Perpinan [526]. SOM is discussed in Section 9.2.3.

Bibliography 1524] T. W. Anderson. An Introduct’ion to Multi,uariate Statistical Analysis. Wiley, 2nd

edition, July 2003.

15251 M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Technical Report TR 2002-01, Department of Computer Science and Statistics, University of Chicago, Jarluary 2002.

1526] M. A. Carreira-Perpinan. A Review of Dimension Reduction Techniques. Technical Report CS-96 09, Dept. of Computer Science, University of Sheffield, January 1997.

15271 J. W. Demmel. Appli,ed Numer,ical L,inear Algebra. SIAM Press, September 1997.

15281 D. L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. PNAS, 100(10):5591-5596, 2003.

[529] C. Faloutsos and K.-I. Lin. FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. In Proc. of the 1995 ACM SIGMOD IntI. Conf. on Management of Data, pages 163-174, San Jose, California, June 1995.

[530] I. K. Fodor. A survey of dimension reduction techniques. Technical Report UCRL-ID- t48494, LLNL, June 2002

f5311 L T. Jolliffe. Principal Component Anallls’is. Springer-Verlag, 2nd edition, October 2002.

[532] J. B. Kruskal and M. Wish. Multi.dzmensional Scali,ng. SAGE Publications, January 7978.

1533] S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. S cience, 290 ( 5500) : 232 3-2326, 2000.

[534] B. Schcilkopf, A. J. Smola, and K.-R. Miiller. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. N eural C omputat’ion, 10(5) : 1299-1319, 1998.

[535] J. B. Tenenbaum, V. d. Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction S cience, 290(5500) :2319-2323, 2OOO.

[536] J. Ye, R. Janardan, and Q. Li. GPCA: an efficient dimension reduction scheme for image compression and retrieval. In Proc. of the 1Oth Intl. Conf. on Knowledge Discouery and” Data Min’ing, pages 354 363, Seattle, Washington, August 2004. ACM.

[537] J. Ye, Q. Li, H Xiong, H. Park, R. Janardan, and V. Kumar. IDR/QR: an incremental dimension reduction algorithm via QR decomposition. In Proc. of the 1Oth Intl. Conf. on Knowledge Dzscouery and Data Mining, pages 364-373, Seattle, Washington, 2004. ACM.

Probability and Statistics This appendix presents some of the basic concepts in probability and statistics used throughout this book.

C.1 Probability

A random experirnent is the act of measuring a process whose outcome is uncertain. Examples include rolling a die, drawing from a deck of cards, and monitoring the types of traffic across a network router. The set of all possible outcomes of a random experiment is known as the sample space, f) . For example, O: {1,2,3,4,5,6} is the sample space for rol l ing a die. An event .D corresponds to a subset of these outcomes, i.e., E e f). For example E : {2,4,6} is the event of observing an even number when rolling a die.

A probability P is a real-valued function defined on the sample space Q that satisfies the following properties:

1. For any event E C 0, 0 < P(E) < 1.

2 . P ( Q ) : 7 .

3. For any set of disjoint events, Er, E2, . . . , Ep e {‘1,

k k pt l I R. ‘ r : \ – prn. t. ,Y ” , ‘ –

, ? t \ u t r .

The probability of an event E, which is written as P(E), is the fraction of times event -E is observed in a potentially unlimited number of experiments.

72O Appendix C Probability and Statistics

In a random experiment, there is often a quantity of interest we want to measure; e.g., counting the number of times a tail turns up when tossing a coin fifty times or measuring the height of a person taking a roller coaster ride at a theme park. Since the value of the quantity depends on the outcome of a random experiment, the quantity of interest is known as a random variable. The value of a random variable can be discrete or continuous. A Bernoulli random variable, for example, is a discrete random variable whose only possible

values are 0 and 1. For a discrete random variable X, the probability X takes on a particular

value z is given by the total probability of all outcomes e in which X(e) : v:

P(X : u) : P(A : {e le e Q, X(e) : , } ) . (c .1)

The probability distribution of a discrete random variable X is also known as its probability mass function.

Example C.l-. Consider a random experiment where a fair coin is tossed four times. There are 16 possible outcomes of this experiment: HHHH, HHHT, HHTH, HTHH, THHH, HHTT, HTHT, THHT, HTTH, THTH, TTHH, HTTT, THTT, TTHT, TTTH, and TTTT, where H (T) indicates that a head (tail) is observed. Let X be a random variable that measures the number of times a tail is observed in the experiment. The five possible values fot X are 0, 7) 2) 3, and 4. The probability mass function for X is given by the following table:

X 0 1 2 3 4 P(X) r l76 4l16 6176 4l16 1116

For example, P(X – 2) :6/16 because there are six outcomes in which the tail is observed twice during the four tosses. r

On the other hand, if X is a continuous random variable, then the proba- bility that X has a value between a and b is

P(o< r<b ) : f (r)dr

The function f (r) is known as the probability density cause / is a continuous distribution, the probability that value r is always zero.

(c.2)

function (pdf). Be- X takes a particular

C.1 Probabilitv 72L

Table C.1. Examples of probability functions. (f (n + 1) : nf (n) and f (1) : 1)

Probability F\rnction Parameters

Gaussian , t \ f , P ) 2

plrl : –+- exp ‘z o V z T o 11, o

Binomial p ( r ) : ( 7 ) p ” ( t – p ) n – ” n r P

Poisson p(r) : j.0″ exp o 0

Exponential P(r) : 0 exP-o’ 0

Gamma p(r) : { } r ‘ – t

exp-)z \ , d

Chi-square I b l u _ t r / ) P\:t ) : 2ro1-n7O”c'”1

– €xP – ‘ –

Table C.1 shows some of the well-known discrete and continuous proba- bility functions. The notion of a probability (mass or density) function can be extended to more than one random variable. For example, 1f X and Y are random variables, then p(X,Y) denotes their joint probability function. The random variables are independent of each other if P(X,Y) : P(X) x P(Y). If two random variables are independent, it means that the value for one vari- able has no impact on the value for the other.

Conditional probability is another useful concept for understanding the dependencies among random variables. The conditional probability for vari- able Y given X, denoted as P()’lX), is defined as

Pff tx1: P-(4J) , P(X)

(c .3)

If X and Y are independent, then P(YIX) : P(Y). The conditional proba- bilities P(YIX) and P(XIY) can be expressed in terms of one another using a formula known as the Baves theorem:

P(Ylx) : P(xlY)P(Y) (c.4) P(X)

If {X1, Xz,…,Xn} is the set of mutually exclusive and exhaustive outcomes of a random variable X, then the denominator of the above equation can be

C.1.1 Expected Values

The expected value of a function g of a random variable X, denoted as Elg(X)], is the weighted-average value of g(X), where the weights are given

by the probability function for X. If X is a discrete random variable, then the exoected value can be computed as follows:

722 Appendix C

expressed as follows:

Probabilitv and Statistics

k k

P(x) : I P(x,Y) : t P(xlV)P(Y). i :7 i : r

Equation C.5 is called the law of total probability.

nlg(x l : T

s@6)P(x : ,0) .

On the other hand, if X is a continuous random variable,

nlg(x)l : I:s@) f

(x)d.x,

px: ElXl: t

r i P(X : ro).

o2* : o11x – t tx) ‘ l : \ ( r i – I rx)2 P(x : r i ) .

(c.5)

(c.6)

(c.8)

(c.e)

(c.7)

where /(X) is the probability density function for X. The remainder of this section considers only the expected values for discrete random variables. The corresponding expected values for continuous random variables are obtained by replacing the summation with an integral.

There are several particularly useful expected values in probability theory. First , i f g(X): X, then

This expected value corresponds to the mean value of the random variable X. Another useful expected value is when g(X) : (X – p,y). The expected value of this function is

This expected value corresponds to the variance of the random variable X. The square root of the variance corresponds to the standard deviation of the random variable X.

C.2 Statistics 723