The singular value decomposition (SVD) of a matrix can also be expressed with the following equation. Note that while u;vfl might look like a dot product, it is not, and the result is a rank I mby n matrix.

rank(A) ^ \- ‘l-

A : ) o ; u ; v ; ; – 1

(A.16)

The importance of the above representation is that every matrix can be expressed as a sum of rank 1 matrices that are weighted by singular values. Since singular values, which are sorted in non-increasing order, often decline rapidly in magnitude, it is possible to obtain a good approximation of a matrix by using only a few singular values and singular vectors. This is useful for dimensionality reduction and will be discussed further in Appendix B.

4.2 Matrices 699

4.2.6 Matrices and Data Analysis

We can represent a data set as a data matrix, where each row is a data ob- ject and each column is an attribute. (We can, with equal validity, have attributes as rows and objects as columns.) Matrix representation provides a compact and well-structured representation for our data and permits the easy manipulation of the objects or attributes of the data through various matrix operations.

Systems of linear equations are one very common example of the usefulness of the matrix representation of data. A system of linear equations can be written as the matrix equation Ax : b and solved using matrix operations.

a t r t * a t z r z l . . . t a 1 n r ,

&ztr t * azzrz 1 . . . I a2nrn

:

a r n 1 ” r 1 . l a ^ 2 r 2 + . . . + a m n f i n : b *

In particular, if A has an inverse, the system of equations has a solution x : A-1b. If not, then the system of equations has either no solution or an infinite number of solutions. Note that in this case, our rows (data objects) were equations and our columns were variables (attributes).

For many statistical and data analysis problems, we want to solve linear systems of equations, but these equations cannot be solved in the manner just described. For example, we may have a data matrix where the rows represent patients and the columns represent characteristics of the patients- height, weight, and age-and their response to a particular medication, e.g., a change in blood pressure. We want to express blood pressure (the independent variable) as a linear function of the other (dependent) variables, and we can write a matrix equation in much the same way as above. However, if we have more patients than variables-the usual case the inverse of the matrix does not exist.

In this case) we still want to find the best solution for the set of equa- tions. This means that we want to find the best linear combination of the independent variables for predicting the dependent variable. Using linear al- gebra terminology, we want to find the vector Ax that is as close to B as possible; in other words, we want to minimize llb – Axll, which is the length of the vector b – Ax. This is known as the least squares problem. Many statistical techniques (e.g., linear regression, which is discussed in Appendix

u1

: bz

700 Appendix A Linear Algebra

D) require the solution of a least squares problem. It can be shown that the

least squares solution of the equation Ax: b is x : (A”A)-1A”b.

Singular value and eigenvalue decomposition are also very useful in an-

alyzing data, particularly in the area of dimensionality reduction, which is

discussed in Appendix B. Note that noise reduction can also occur as a side

effect of dimensionality reduction. While we have given a few examples of the application of linear algebra,

we have omitted many more. Examples of other areas where linear algebra is

important in the formulation and solution of problems include solving systems

of differential equations, optimization problems (such as linear programming),

and graph partitioning.

A.3 Bibliographic Notes

There are many books that provide good coverage of linear algebra, including

those by Demmel [521], Golub and Van Loan 1522], and Stramg [523].

Bibliography f521] J. W. Demmel. Applied Numerical Linear Algebra. SIAM Press, September 1997.

1522] G. H. Golub and C. F. Van Loan. Matrir Cornputat’ions. Johns Hopkins University

Press. 3rd edition. November 1996.

f523] G. Strang. L,inear Algebra anil Its Applications. Harcourt Brace & Company, Orlando,

FL, 3rd edit ion. 1986.

Dimensionality Reduction

This appendix considers various techniques for dimensionality reduction. The goal is to expose the reader to the issues involved and to describe some of the more common approaches. We begin with a discussion of Principal Compo- nents Analysis (PCA) and Singular Value Decomposition (SVD). These meth- ods are described in some detail since they are among the most commonly used approaches and we can build on the discussion of linear algebra in Ap- pendix A. However, there are many other approaches that are also employed for dimensionality reduction, and thus, we provide a quick overview of several other techniques. We conclude with a short review of important issues.

8.1 PCA and SVD

PCA and SVD are two closely related techniques. For PCA, the mean of the data is removed, while for SVD, it is not. These techniques have been widely used for decades in a number of fields. In the following discussion, we will assume that the reader is familiar with linear alqebra at the level presented in Appendix A.

B.1.1 Principal Components Analysis (PCA)

The goal of PCA is to find a new set of dimensions (attributes) that better captures the variability of the data. More specifically, the first dimension is chosen to capture as much of the variability as possible. The second dimension is orthogonal to the first, and, subject to that constraint, captures as much of the remaining variability as possible, and so on.

7O2 Appendix B Dimensionality Reduction

PCA has several appealing characteristics. First, it tends to identify the strongest patterns in the data. Hence, PCA can be used as a pattern-finding technique. Second, often most of the variability of the data can be captured by a small fraction of the total set of dimensions. As a result, dimensionality reduction using PCA can result in relatively low-dimensional data and it may be possible to apply techniques that don’t work well with high-dimensional data. Third, since the noise in the data is (hopefully) weaker than the patterns, dimensionality reduction can eliminate much of the noise. This is beneficial both for data mining and other data analysis algorithms.

We briefly describe the mathematical basis of PCA and then present an example.

Mathematical Details

Statisticians summarize the variability of a collection of multivariate data; i.e., data that has multiple continuous attributes, by computing the covariance matrix S of the data.

Definition B.1. Given an m by n data matrix D, whose ?n, rows are data objects and whose n columns are attributes, the covariance matrix of D is the matrix S, which has entries si1 defined as

sij : couariance(d*i, d*). (8 .1 )

In words, s;7 is the covariance of the i,th andjth attributes (columns) of the data.

The covariance of two attributes is defined in Appendix C, and is a measure ofhowstronglytheattr ibutesvarytogether. I f i . : j , i .e. , theattr ibutesarethe same, then the covariance is the variance of the attribute. If the data matrix D is preprocessed so that the mean of each attribute is 0, then S : D”D.

A goal of PCA is to find a transformation of the data that satisfies the following properties:

1. Each pair of new attributes has 0 covariance (for distinct attributes).

2. The attributes are ordered with resoect to how much of the variance of the data each attribute captures.

3. The first attribute captures as much of the variance of the data as pos- sible.

B.1 PCA and SVD 703

4. Subject to the orthogonality requirement, each successive attribute cap- tures as much of the remaining variance as possible.

A transformation of the data that has these properties can be obtained by using eigenvalue analysis of the covariance matrix. Let ,\1, . . . , \n be the eigenvalues of S. The eigenvalues are all non-negative and can be ordered such that )1 ) A2 ) . . . \m-r > x^. (Covariance matrices are examples of what are called positive semidefinite matrices, which, among other properties, have non-negative eigenvalues.) Let U : [rr,. . . , ur] be the matrix of eigenvectors of S. These eigenvectors are ordered so that the ith eigenvector corresponds to the ith largest eigenvalue. Finally, assume that data matrix D has been preprocessed so that the mean of each attribute (column) is 0. We can make the following statements.

o The data matrix D’ : DU is the set of transformed data that satisfies the conditions posed above.

o Each new attribute is a linear combination of the original attributes. Specifically, the weights of the linear combination for the irh attribute are the components of the irh eigenvector. This follows from the fact that the jth column of D/ is given by Du, and the definition of matrix-vector multiplication given in trquation A.12.

o The variance of the ith new attribute is ,\;.

o The sum of the variance of the original attributes is equal to the sum of the variance of the new attributes.

o The new attributes are called principal components; i.e., the first new attribute is the first principal component, the second new attribute is the second principal component, and so on.

The eigenvector associated with the largest eigenvalue indicates the direc- tion in which the data has the most variance. In other words, if all of the data vectors are projected onto the line defined by this vector, the resulting val- ues would have the maximum variance with respect to all possible directions. The eigenvector associated with the second largest eigenvalue is the direction (orthogonal to that of the first eigenvector) in which the data has the largest remaining variance.

The eigenvectors of S define a new set of axes. Indeed, PCA can be viewed as a rotation of the original coordinate axes to a new set of axes that are aligned with the variability in the data. The total variability of the data is preserved, but the new attributes are now uncorrelated.

7O4 Appendix B Dimensionality Reduction

-1

-2

-3

-5 – 2 0 2 4 1 0 -6

(a) origi;al points 101 eoints atteitransformation

Figure 8.1. Using PCA to transform the data.