Example B.1 (Two-Dimensional Data). We illustrate the use of PCA

for aligning the axes in the directions of the maximum variability of the data.

Figure B.1 shows a set of 1000 two-dimensional data points, before and after

a PCA transformation. The total variance for the original set of points is the

sum of the variance of the r and g attributes, which is equal to 2.84 -t 2.95 :

5.79. After transformation, the variance is 4.81 + 0.98 : 5.79. I

Example B.2 (Iris Data). This example uses the Iris data set to demon-

strate the use of PCA for dimensionality reduction. This data set contains

150 data objects (flowers); there are 50 flowers from each of three different

Iris species: Setosa, Versicolour, and Virginica. Each flower is described by

four attributes: sepal length, sepal width, petal length, and petal width. See

Chapter 3 for more details. Figure 8.2(a) shows a plot of the fraction of the overall variance accounted

for by each eigenvalue (principal component) of the covariance matrix. This

type of plot is known as a scree plot and is useful for determining how many principal components need to be kept to capture most of the variability of the

data. For the Iris data, the first principal component accounts for most of

the variation (92.5%), the second for only 5.3T0, and the last two components for just 2.2%0. Thuq keeping only the fi.rst two principal components preserves

most of the variability in the data set. Figure 8.2(b) shows a scatter plot of the

Iris data based on the first two principal components. Note that the Setosa

flowers are well separated from the Versicolour and Virginica flowers. The

latter two sets of flowers, while much closer to each other, are still relatively

well separated.

o o o G

o

o o E Lr

1

0.9

0.8

n ‘7

0.6

A E

0.4

0.3

0.2

0 .1

0

B.1 PCA and SVD 7O5

1 2 3 Eigenvalue

(a) Fraction of variance accounted for by each principal component.

* * * E o – * * – t r

* * E E

*

t – 2 – 1 0 1 2

First Principal Componet

(b) Plot of first two principal components of lris data.

Figure 8.2, PCA applied to the lris data set.

706 Appendix B Dimensionalitv Reduction

8.T .2 SVD

PCA is equivalent to an SVD analysis of the data matrix, once the mean

of each variable has been removed. Nonetheless, it is informative to look at

dimensionality reduction from the SVD point of view, since it is not always

desirable to remove the mean from data, especially if the data is relatively sparse.

Mathematical Details

From Appendix A, we know that an m by n matrix A can be written as

rank(A)

A- t oppf , :UEV”. i : l

(B.2)

where oi is the ith singular value of A (the ith diagonal entry of )), u6 is the ith left singular vector of A (the ith column of U), and the v; is the i.th right singular vector of A (the frh column of V). (See Section A.2.5.) An SVD decomposition of a data matrix has the following properties.

o Patterns among the attributes are captured by the right singular vectots, i.e., the columns of V.

o Patterns among the objects are captured by the left singular vectors, i.e., the columns of IJ.

o A matrix A can be successively approximated in an optimal manner by taking, in order, the terms of Equation B.2. We do not explain what we mean by optimal, but refer the reader to the bibliographic notes. Informally, the larger a singular value, the larger the fraction of a matrix that is accounted for by the singular value and its associated singular vectors.

o To obtain a new data matrix with k attributes, we compute the matrix D/: Dx [v1 ,v2,. ‘ . ,vk] . I t might seem from the previous discussion that we would take the matrix that results from the first k terms of Equation A.12. However, while the resulting matrix is of rank k, it still has n columns (attributes).

Example B.3 (Document Data). SVD decomposition can be used to an- alyze document data. The data for this example consists of 3204 newspaper

B.1 PCA and SVD 7O7

articles from the Los Angeles T’imes. These articles come from 6 different sec- tions: Entertainment, Financial, Foreign, Metro, National, and Sports. The data matrix is a document-term matrix, where each row represents a docu- ment and each column is a term (word). The value of the i,jth entry is the number of times the jth term occurs in the iih document. The data was pro- cessed using standard techniques to remove common words, to adjust for the different frequencies with which terms appear? and to adjust for the different lengths of documents. (See Section 2.3.7 for more details.)

An SVD analysis of the data was performed to find the first 100 singular values and vectors. (For many data sets, it is too expensive to find a full SVD or PCA decomposition and often pointless since relatively few of the singular values or eigenvalues are required to capture the structure of the matrix.) The largest singular value is associated with common terms that are frequent, but not eliminated by the preprocessing. (It can happen that the strongest patterns represent noise or uninteresting patterns.)

However, the patterns associated with other singular values were more interesting. For example, the following are the top 10 terms (words) associated with the strongest components in the second right singular vector:

game, score, lead, team, plaj , rebound, season, coach, league, goal

These are all terms associated with sports. Not surprisingly, the documents associated with the strongest components of the second left singular vector are predominantly from the Sports section

The top 10 terms associated with the strongest components in the third right singular vector are the following:

earn, mi l l ion, quarter, bank, rose, bi l l ion, stock, company, corporation, revenue

These are all financial terms, and, not surprisingly, the documents associated with the strongest components in the third left singular vector are predomi- nantly from the Financial section.

We reduced the dimensionality of the data using the second and third singular vectors, i.e., D/: D x [vz,ve] . In other words, all documents were expressed in terms of two attributes, one relating to Sports and one relating to Finance. A scatter plot of documents is given by Figure B.3. For clarity, non- Sports, non-Financial documents have been eliminated. The Sports documents are shown in a lighter shade of gray, while the Financial documents are a darker gray. The two different categories of documents are well separated for

708 Appendix B Dimensionality Reduction

Documents

” i .

-o.2 -0.1 0 0.1 0.2 0.3 0.4 Second Singular Component

Figure B.3. Plot of Sports and Financial documents from the LA Times using the second and third singular values.

the most part. Indeed, the Sports documents do not vary much with respect to the Financial variable (component 3) and the Financial documents do not vary much with respect to the Sports variable (component 2).

8.2 Other Dimensionality Reduction Techniques

In this section, we review a few other dimensionality reduction techniques. These techniques will be discussed more briefly, with a focus on their general motivation and approach.

8.2.I Factor Analysis

For PCA and SVD, the new attributes that are produced are linear combina- tions of the original variables. With factor analysis, the goal is to express the original variables as linear combinations of a small number of hidden or la- tent attributes. The motivation is based on the following observation. Often there are characteristics of data objects that are hard to measure directly, but that seem to be related to measurable characteristics. One common example is intelligence and performance on various types of IQ tests. Another common

0.6

E o t c

E o n e

E f

.= v.1 a

c

a a

a a ‘ ,

8.2 Other Dimensionality Reduction Techniques 7Og

example is the connection between performance in various athletic events and an athlete’s speed and strength. If a small number of attributes can be found that group and summarize the original attributes, then we will have achieved both a reduction in dimensionality and an increase in our understanding of the data.

The motivation for factor analysis is sometimes also explained in terms of the covariance or correlation matrix of the data. Suppose that a group of attributes are not very highly correlated to other attributes, but are strongly correlated to one another, perhaps because they measure the same underlying quantity. In this case, it would seem desirable to develop techniques that could find a single underlying attribute that summarizes each such group.