For example, consider a data set that records the performance of a group of athletes in the ten separate events that comprise the decathlon. We might find that athletes tend to show the same performance in all events that em- phasize speed; i.e., slow athletes are consistently slow and fast athletes are consistently fast. Likewise, we might find that an athlete’s behavior in an event that requires strength indicates how he or she will perform in another event that emphasizes strength. Hence, we might hypothesize that an athlete’s performance in any given event is really determined by the nature of the event and two underlying factors: speed and strength. Factor analysis attempts to discover such relationships.
More formally, let ft,fz,.. . , fo be the latent factors, i.e., the underlying or hidden attributes. Note that these are the new attributes and have a value for each object. If the original data matrix is D, an mby n matrix, then the new data matrix is F : lh,fz, . . . ,fp], which is an rn by p matrix. (Note that f*j : fi.) The i’jth entry of F is f.;i, the jth component of fi.
Assume that the mean of each attribute is 0. If d,;* is the ‘ith row of the original data matrix D, then f;* is the corresponding row of the new data matrix, F. The standard factor analysis model assumes the following relationship between the old and new data objects:
afi: trffi + e
or equivalently by
di i : Air f t I \ izfn, . . . , \ jpf tu I er.
A, which has entries ,\61, is an n by p matrix of factor indicate, for each of the original attributes, how the original on the latent factors, i.e., the new attributes. To illustrate, in
loadings that value depends the decathlon
7LO Appendix B Dimensionality Reduction
example, there would be two latent factors: speed and strength. These corre- spond to columns of F. Each athlete would be represented by a row of F with entries recording the athlete’s speed and strength. Each column of D would correspond to one of the ten events of the decathlon, while each row again corresponds to an athlete. The ijth entry of D is the performance of the i,th athlete in the jth event. A would be a 10 by 2 matrix. If the first column of D records the performance of the athletes on the 100-meter dash, then the per- formance of athlete i in the 100-meter dash is written as di1- \nft* \tzfn, where fi1 is a value indicating the speed of athlete i, and f,;2 is a value indi- cating the strength of athlete i. \tr and .\12 indicate how an athlete’s speed and strength, respectively, should be weighted to predict an athlete’s perfor- mance in the 100 meter dash. We would expect that .\11 would be relatively large compared to )rz. Note that these weights are the same across all objects (athletes).
Since all latent factors are involved in the determination of the value of any original attribute, they are known as common factors. e is an error term that accounts for the portion of the attributes that is not accounted for by the common factors, and hence, the components of e are known as the specific factors.
Example B.4 (Factor Analysis of Iris Data). This example is based on the Iris data set. For this data, only a single factor could be found. The flowers in the Iris data set are organized so that the first 50 flowers are of species Setosa, the second 50 are Versicolour, and the last 50 are Virginica. This single factor (attribute) is plotted against flower as shown in Figure B.4. This factor seems to capture the distinction among the three species.
8.2.2 Locally Linear Embedding (LLE)
LLE is a technique for dimensionality reduction based on the idea of analyzing overlapping local neighborhoods in order to determine the local structure. The LLE algorithm is given below.
Algorithm E|.1 LLE algorithm. 1: Find the nearest neighbors of each data point. 2: Express each point X1 d,S & linear combination of the other points, i.e., x4 :
Diw4x7, where f r .o j : I andwi i :0 i f x j is not a near neighbor of x ; . 3: Find the coordinates of each point in lower-dimensional space of specified dimen-
sion p by using the weights found in step 2.
8.2 Other Dimensionality Reduction Techniques 7Ll
: 0 o o
I I i
1501 0 050 Flower
Figure 8.4. Plot of the flower of the lris data set versus the single latent factor.
In step 2, the weight matrix W, whose entries are wij) is found by minimiz- ing the squared approximation error as measured by the following equation. W can be found by solving a least squares problem. (Such problems were discussed in Appendix A.)
Step 3 performs the actual dimensionality reduction. Given the weight matrix and a number of dimensiotrs, ?, specified by the user, the algorithm constructs a “neighborhood preserving embedding” of the data into the lower- dimensional space. If y6 is the vector in the lower-dimensional space that corresponds to x6 and Y is the new data matrix whose ‘ith row is y;, then this can be accomplished by finding a Y that minimizes the following equation.
7L2 Appendix B Dimensionality Reduction
o o o o @ o @ . r 6 D o o @ o & @ @ o o o o @
– 5 – 4 – 3 – 2 – 1 0 1 2 3 4 First New Attribute
Figure 8.5. Plot of the flowers of the lris data set based on two new attributes from LLE.
Example B.5. the use of LLtr for dimensionality reduction is illustrated using the Iris data set. Specifically, the data was projected to two dimensions. A neighborhood of 30 points was used. A scatter plot of the projected data is shown in Figure B.5. The data can also be projected to one dimension. In that case, it looks much like Figure B.4.
8.2.3 Multidimensional Scaling, FastMap, and ISOMAP
Multidimensional scaling is a technique that is often used for dimensionality reduction. A number of variations of this technique have been proposed, but the general strategy of these techniques is the same: Find a projection of the data to a lower-dimensional space that preserves pairwise distances as well as possible, as measured by an objective function. Because of this strategy, MDS starts from a dissimilarity matrix, and thus, can be used even for data that does not originally have a vector space representation, e.g., strings.
Standard MDS Techniques
We begin by describing the classical MDS approach for projecting data to a p-dimensional space. Assume that we are given a distance matrix D, where the entry d6, is the distance between Lhe i,th and jth objects. Let dtoo be the
* ir x
8.2 Other Dimensionality Reduction Techniques 713
distance between the objects after they have been transformed. Classical MDS tries to assign each object to a p-dimensional point such that a quantity called stress is minimized, where stress is defined as
stTess : (B.7)
The classical version of MDS is an example of metric MDS techniques, which assume that the dissimilarities are continuous variables (interval or ra- tion). Non-metric MDS techniques assume that the data is categorical (at best ordinal). We will not discuss the details of these algorithms, except to say that the typical approach is to initially assign objects to p-dimensional points in some manner and then try to modify the points to reduce the stress.
When classical MDS or some of the other standard variants of MDS are applied to the Iris data set, they yield almost the same results as shown in Figure B.2. Indeed, classical MDS for Euclidean distance is equivalent to PCA.
A recent development in the area of MDS is the algorithm FastMap. It has the same goal as other MDS techniques, but has two important differences.
It is faster-linear complexity.
It can operate incrementally.
The FastMap algorithm identifies a pair of objects and then computes the distance of each remaining object in this direction. This can be accomplished using only pairwise distances by employing certain facts of geometry, namely, the law of cosines. This distance is taken as the value of the first attribute. The objects are then projected onto an (n – 1)-dimensional subspace. Again, this can be performed using only pairwise distances. The process is then repeated.
The FastMap algorithm is initially applied to an entire data set. However, if we keep track of the pairs of objects that are chosen at each step, then we can incrementally apply FastMap to a new object. The only information needed is the distance of the new object to the selected pairs.
t , , 2
Du \di, – ari)
714 Appendix B Dimensionality Reduction
Figure 8.6. Plot of Swiss roll data set.
MDS and PCA are not good at dimensionality reduction when the points have a complicated, non-linear relationship to one another. (An exceptions is kernel PCA-see bibliographic notes.) ISOMAP, which is an extension of traditional MDS, was developed to handle such data sets. An example of the type of data set that it can handle is given in Figure 8.6, which shows a plot of the “Swiss roll” surface. A data set with this structure constitutes a two-dimensional set of data in a three-dimensional space, but one that cannot be successfully handled by PCA or MDS. However, ISOMAP can successfully analyze this data set.
Algorithm B.2 outlines the basic ISOMAP algorithm. Nearest neighbors
Algorithm B.2 ISOMAP Algorithm. 1: Find the nearest neighbors of each data point and create a weighted graph by
connecting a point to its nearest neighbors. The nodes are the data points and the weights of the links are the distances between points.
2: Redefine the distances between points to be the length of the shortest path be- tween the two points in the neighborhood graph.
3: Apply classical MDS to the new distance matrix.
can be defined, either by taking the k-nearest points, where k is a parameter, or by taking all points within a specified radius of the point. The purpose of step 2 is to compute the geodesic distance; i.e., the distance between two points that stays on the surface, rather than the Euclidean distance. As an
8.2 Other Dimensionality Reduction Techniques 7L5
o u-z 3
t s n f E 2 -o.z c
8 -0.+ o a
l E n U U
n _ U N
tr D n * n E __.
” n tr
tr tr-“E r
h#..#” D * i F . t i . * x
1 F ^ x x x + * t r *
* * * *
** * x *
* * **
t r o
-1.2 – 4 – 3 – 2 – 1 0 1 2 3 4
First New Attribute
Figure 8.7. Plot of the flower of the lris data set based on two new attributes from ISOMAP.
example, the Euclidean distance between two cities on opposite sides of the Earth is the length of a line segment that passes through the Earth, while the geodesic distance between two cities is the length of the shortest arc on the surface of the Earth.
Example El.6. ISODATA was used to project the Iris data into two dimen- sions. See Figure B.7. The result is similar to previous techniques.
8.2.4 Common Issues
As with other data analysis techniques, we can distinguish between different dimensionality techniques in a number of areas. One key issue is the quality of the result: Can a technique produce a reasonably faithful representation of the data in a lower-dimensional space? Does this representation capture the characteristics of the data that are important to the intended application (e.g., clusters), while eliminating aspects that are irrelevant or even detrimental (e.g., noise)?