z

Figure 2.8. Histograms of standard deviation for monthly and yearly precipitation in Australia for the oeriod 1982 to 1993.

2.3.2 Sampling

Sampling is a commonly used approach for selecting a subset of the data objects to be analyzed. In statistics, it has long been used for both the pre- Iiminary investigation of the data and the final data analysis. Sampling can also be very useful in data mining. However, the motivations for sampling in statistics and data mining are ofben different. Statisticians use sampling because obtaining the entire set of data of interest is too expensive or time consuming, while data miners sample because it is too expensive or time con- suming to process all the data. In some cases, using a sampling algorithm can reduce the data size to the point where a better, but more expensive algorithm can be used.

The key principle for effective sampling is the following: Using a sample will work almost as well as using the entire data set if the sample is repre- sentative. fn turn, a sample is representative if it has approximately the same property (of interest) as the original set of data. If the mean (average) of the data objects is the property of interest, then a sample is representative if it has a mean that is close to that of the original data. Because sampling is a statistical process, the representativeness of any particular sample will vary, and the best that we can do is choose a sampling scheme that guarantees a high probability of getting a representative sample. As discussed next, this involves choosing the appropriate sample size and sampling techniques.

48 Chapter 2

Sampling Approaches

There are many sampling techniques, but only a few of the most basic ones and their variations will be covered here. The simplest type of sampling is simple random sampling. For this type of sampling, there is an equal prob- ability of selecting any particular item. There are two variations on random sampling (and other sampling techniques as well): (1) sampling without re- placement-as each item is selected, it is removed from the set of all objects that together constitute the population, and (2) sampling with replace- ment-objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked more than once. The samples produced by the two methods are not much different when samples are relatively small compared to the data set size, but sampling with replacement is simpler to analyze since the probability of selecting any object remains constant during the sampling process.

When the population consists of different types of objects, with widely different numbers of objects, simple random sampling can fail to adequately represent those types of objects that are less frequent. This can cause prob- lems when the analysis requires proper representation of all object types. For example, when building classification models for rare classes, it is critical that the rare classes be adequately represented in the sample. Hence, a sampling scheme that can accommodate differing frequencies for the items of interest is needed. Stratified sampling, which starts with prespecified groups of ob- jects, is such an approach. In the simplest version, equal numbers of objects are drawn from each group even though the groups are ofdifferent sizes. In an- other variation, the number of objects drawn from each group is proportional to the size of that group.

Example 2.8 (Sampling and Loss of Information). Once a sampling technique has been selected, it is still necessary to choose the sample size. Larger sample sizes increase the probability that a sample will be representa- tive, but they also eliminate much of the advantage of sampling. Conversely, with smaller sample sizes, patterns may be missed or erroneous patterns can be detected. Figure 2.9(a) shows a data set that contains 8000 two-dimensional points, while Figures 2.9(b) and 2.9(c) show samples from this data set of size 2000 and 500, respectively. Although most of the structure of this data set is present in the sample of 2000 points, much of the structure is missing in the sample of 500 points. r

Data

2.3 Data Preprocessing 49

(a) 8000 points (b) 2000 points (c) 500 points

Figure 2.9. Example of the loss of structure with sampling,

Example 2.9 (Determining the Proper Sample Size). To illustrate that determining the proper sample size requires a methodical approach, consider the following task.

Given a set of data that consists of a small number of almost equal- sized groups, find at least one representative point for each of the groups. Assume that the objects in each group are highly similar to each other, but not very similar to objects in different groups. Also assume that there are a relatively small number of groups, e.g., 10. Figure 2.10(a) shows an idealized set of clusters (groups) from which these points might be drawn.

This problem can be efficiently solved using sampling. One approach is to take a small sample of data points, compute the pairwise similarities between points, and then form groups of points that are highly similar. The desired set of representative points is then obtained by taking one point from each of these groups. To follow this approach, however, we need to determine a sample size that would guarantee, with a high probability, the desired outcome; that is, that at least one point will be obtained from each cluster. Figure 2.10(b) shows the probability of getting one object from each of the 10 groups as the sample size runs from 10 to 60. Interestingly, with a sample size of 20, there is little chance (20%) of getting a sample that includes all 10 clusters. Even with a sample size of 30, there is still a moderate chance (almost a0%) of getting a sample that doesn’t contain objects from all 10 clusters. This issue is further explored in the context of clustering by Exercise 4 on page 559.

I

ii

50 Chapter 2 Data

oo oo oo

(a) ren sroups or points’ f,tl-T:i:1ti[ ;,””T”1″

contains points

Figure 2.10. Finding representative points from 10 groups.

Progressive Sampling

The proper sample size can be difficult to determine, so adaptive or progres- sive sampling schemes are sometimes used. These approaches start with a small sample, and then increase the sample size until a sample of sufficient size has been obtained. While this technique eliminates the need to determine the correct sample size initially, it requires that there be a way to evaluate the sample to judge if it is large enough.

Suppose, for instance, that progressive sampling is used to learn a pre- dictive model. Although the accuracy of predictive models increases as the sample size increases, at some point the increase in accuracy levels off. We want to stop increasing the sample size at this leveling-off point. By keeping track of the change in accuracy of the model as we take progressively larger samples, and by taking other samples close to the size of the current one, we can get an estimate as to how close we are to this leveling-off point, and thus, stop sampling.

2.3.3 Dimensionality Reduction

Data sets can have a large number of features. Consider a set of documents, where each document is represented by a vector whose components are the frequencies with which each word occurs in the document. In such cases,

(d

e n

o o

o o

Sample Size

2.3 Data Preprocessing 51

there are typically thousands or tens of thousands of attributes (components), one for each word in the vocabulary. As another example, consider a set of time series consisting of the daily closing price of various stocks over a period of 30 years. In this case, the attributes, which are the prices on specific days, again number in the thousands.

There are a variety of benefits to dimensionality reduction. A key benefit is that many data mining algorithms work better if the dimensionality the number of attributes in the data-is lower. This is partly because dimension- ality reduction can eliminate irrelevant features and reduce noise and partly because of the curse of dimensionality, which is explained below. Another ben- efit is that a reduction of dimensionality can lead to a more understandable model because the model may involve fewer attributes. Also, dimensionality reduction may allow the data to be more easily visualized. Even if dimen- sionality reduction doesn’t reduce the data to two or three dimensions, data is often visualized by looking at pairs or triplets of attributes, and the num- ber of such combinations is greatly reduced. Finally, the amount of time and memory required by the data mining algorithm is reduced with a reduction in dimensionality.

The term dimensionality reduction is often reserved for those techniques that reduce the dimensionality of a data set by creating new attributes that are a combination of the old attributes. The reduction of dimensionality by selecting new attributes that are a subset of the old is known as feature subset selection or feature selection. It will be discussed in Section 2.3.4.

In the remainder of this section, we briefly introduce two important topics: the curse of dimensionality and dimensionality reduction techniques based on linear algebra approaches such as principal components analysis (PCA). More details on dimensionality reduction can be found in Appendix B.

The Curse of Dimensionality

llhe curse of dimensionality refers to the phenomenon that many types of data analysis become significantly harder as the dimensionality of the data increases. Specifically, as dimensionality increases, the data becomes increas- ingly sparse in the space that it occupies. For classification, this can mean that there are not enough data objects to allow the creation of a model that reliably assigns a class to all possible objects. For clustering, the definitions of density and the distance between points, which are critical for clustering, become less meaningful. (This is discussed further in Sections g.1.2, 9.4.5, and 9.4.7.) As a result) many clustering and classification algorithms (and other

52 Chapter 2 Data

data analysis algorithms) have trouble with high-dimensional data-reduced

classification accuracy and poor quality clusters.

Linear Algebra Techniques for Dimensionality Reduction

Some of the most common approaches for dimensionality reduction, partic-

ularly for continuous data, use techniques from linear algebra to project the

data from a high-dimensional space into a lower-dimensional space. Principal

Components Analysis (PCA) is a linear algebra technique for continuous

attributes that finds new attributes (principal components) that (1) are linear

combinations of the original attributes, (2) are orthogonal (perpendicular) to

each other, and (3) capture the maximum amount of variation in the data. For

example, the first two principal components capture as much of the variation

in the data as is possible with two orthogonal attributes that are linear combi-

nations of the original attributes. Singular Value Decomposition (SVD)

is a linear algebra technique that is related to PCA and is also commonly used

for dimensionality reduction. For additional details, see Appendices A and B.

2.3.4 Feature Subset Selection

Another way to reduce the dimensionality is to use only a subset of the fea-

tures. While it might seem that such an approach would lose information, this

is not the case if redundant and irrelevant features are present. Redundant

features duplicate much or all of the information contained in one or more

other attributes. For example, the purchase price of a product and the amount

of sales tax paid contain much of the same information. Irrelevant features

Oontain almost no useful information for the data mining task at hand. For

instance, students’ ID numbers are irrelevant to the task of predicting stu-

dents’ grade point averages. Redundant and irrelevant features can reduce

classification accuracy and the quality of the clusters that are found.

While some irrelevant and redundant attributes can be eliminated imme-

diately by using common sense oI domain knowledge, selecting the best subset

of features frequently requires a systematic approach. The ideal approach to

feature selection is to try all possible subsets of features as input to the data

mining aigorithm of interest, and then take the subset that produces the best

results. This method has the advantage of reflecting the objective and bias of

the data mining algorithm that will eventually be used. Unfortunately, since

the number of subsets involving n attributes is 2n, such an approach is imprac-

tical in most situations and alternative strategies are needed. There are three

standard approaches to feature selection: embedded, filter, and wrapper.

Data Preprocessing 53

Embedded approaches Feature selection occurs naturally as part of the data mining algorithm. Specifically, during the operation of the data mining algorithm, the algorithm itself decides which attributes to use and which to ignore. Algorithms for building decision tree classifiers, which are discussed in Chapter 4, often operate in this manner.

Filter approaches Features are selected before the data mining algorithm is run, using some approach that is independent of the data mining task. For example, we might select sets of attributes whose pairwise correlation is as low as possible.

Wrapper approaches These methods use the target data mining algorithm as a black box to find the best subset of attributes, in a way similar to that of the ideal algorithm described above, but typically without enumerating all possible subsets.

Since the embedded approaches are algorithm-specific, only the filter and wrapper approaches will be discussed further here.

An Architecture for Feature Subset Selection