It is possible to encompass both the filter and wrapper approaches within a common architecture. The feature selection process is viewed as consisting of four parts: a measure for evaluating a subset, a search strategy that controls the generation of a new subset of features, a stopping criterion, and a valida- tion procedure. Filter methods and wrapper methods differ only in the way in which they evaluate a subset of features. For a wrapper method, subset evaluation uses the target data mining algorithm, while for a filter approach, the evaluation technique is distinct from the target data mining algorithm. The following discussion provides some details of this approach, which is sum- marized in Figure 2.11.

Conceptually, feature subset selection is a search over all possible subsets of features. Many different types of search strategies can be used, but the search strategy should be computationally inexpensive and should find optimal or near optimal sets of features. It is usually not possible to satisfy both requirements, and thus, tradeoffs are necessary.

An integral part ofthe search is an evaluation step tojudge how the current subset of features compares to others that have been considered. This requires an evaluation measure that attempts to determine the goodness of a subset of attributes with respect to a particular data mining task, such as classification

2 .3

54 Chapter 2 Data

Figure 2,11. Flowchart of a feature subset selection process.

or clustering. For the filter approach, such measures attempt to predict how

well the actual data mining algorithm will perform on a given set of attributes. For the wrapper approach, where evaluation consists of actually running the

target data mining application, the subset evaluation function is simply the

criterion normally used to measure the result of the data mining. Because the number of subsets can be enormous and it is impractical to

examine them all, some sort of stopping criterion is necessary. This strategy is

usualiy based on one or more conditions involving the following: the number

of iterations, whether the value of the subset evaluation measure is optimal or

exceeds a certain threshold, whether a subset of a certain size has been ob-

tained, whether simultaneous size and evaluation criteria have been achieved, and whether any improvement can be achieved by the options available to the

search strategy. Finally, once a subset of features has been selected, the results of the

target data mining algorithm on the selected subset should be validated. A

straightforward evaluation approach is to run the algorithm with the full set

of features and compare the full results to results obtained using the subset of

features. Hopefully, the subset of features will produce results that are better

than or almost as good as those produced when using all features. Another validation approach is to use a number of different feature selection algorithms to obtain subsets of features and then compare the results of running the data

mining algorithm on each subset.

2.3 Data Preprocessing 55

Feature Weighting

Feature weighting is an alternative to keeping or eliminating features. More important features are assigned a higher weight, while less important features are given a lower weight. These weights are sometimes assigned based on do- main knowledge about the relative importance of features. Alternatively, they may be determined automatically. For example, some classification schemes, such as support vector machines (Chapter 5), produce classification models in which each feature is given a weight. Features with larger weights play a more important role in the model. The normalization of objects that takes place when computing the cosine similarity (Section 2.4.5) can also be regarded as a type of feature weighting.

2.3.5 Feature Creation

It is frequently possible to create, from the original attributes, a new set of attributes that captures the important information in a data set much more effectively. Furthermore, the number of new attributes can be smaller than the number of original attributes, allowing us to reap all the previously described benefits of dimensionality reduction. Three related methodologies for creating new attributes are described next: feature extraction, mapping the data to a new space, and feature construction.

Feature Extraction

The creation of a new set of features from the original raw data is known as feature extraction. Consider a set of photographs, where each photograph is to be classified according to whether or not it contains a human face. The raw data is a set of pixels, and as such, is not suitable for many types of classification algorithms. However, if the data is processed to provide higher- level features, such as the presence or absence of certain types of edges and areas that are highly correlated with the presence of human faces, then a much broader set of classification techniques can be applied to this problem.

Unfortunately, in the sense in which it is most commonly used, feature extraction is highly domain-specific. For a particular field, such as image processing, various features and the techniques to extract them have been developed over a period of time, and often these techniques have limited ap- plicability to other fields. Consequently, whenever data mining is applied to a relatively new area, a key task is the development of new features and feature extraction methods.

56 Chapter 2 Data

(a) Two time series. (b) Noisy time series. (c) Power spectrum

Figure 2.12. Application of the Fourier transform to identify the underlying frequencies in time series

data.

Mapping the Data to a New Space

A totally different view of the data can reveal important and interesting fea-

tures. Consider, for example, time series data, which often contains periodic

patterns. If there is only a single periodic pattern and not much noise’ then

the pattern is easily detected. If, on the other hand, there are a number of periodic patterns and a significant amount of noise is present, then these pat-

terns are hard to detect. Such patterns can, nonetheless, often be detected

by applying a Fourier transform to the time series in order to change to a

representation in which frequency information is explicit. In the example that

follows, it will not be necessary to know the details of the Fourier transform.

It is enough to know that, for each time series, the Fourier transform produces

a new data object whose attributes are related to frequencies.

Example 2.10 (Fourier Analysis). The time series presented in Figure

2.I2(b) is the sum of three other time series, two of which are shown in Figure

2.12(a) and have frequencies of 7 and 17 cycles per second, respectively. The

third time series is random noise. Figure 2.12(c) shows the power spectrum

that can be computed after applying a Fourier transform to the original time

series. (Informally, the power spectrum is proportional to the square of each

frequency attribute.) In spite ofthe noise, there are two peaks that correspond to the periods of the two original, non-noisy time series. Again, the main point

is that better features can reveal important aspects of the data. I

2.3 Data Preprocessing 57

Many other sorts of transformations are also possible. Besides the Fourier transform, the wavelet transform has also proven very useful for time series and other types of data.

Feature Construction

Sometimes the features in the original data sets have the necessary information, but it is not in a form suitable for the data mining algorithm. In this situation, one or more new features constructed out of the original features can be more useful than the original features.

Example 2.11- (Density). To illustrate this, consider a data set consisting of information about historical artifacts, which, along with other information, contains the volume and mass of each artifact. For simplicity, assume that these artifacts are made of a small number of materials (wood, clay, bronze, gold) and that we want to classify the artifacts with respect to the material of which they are made. In this case, a density feature constructed from the mass and volume features, i.e., density : mass/uolume, would most directly yield an accurate classification. Although there have been some attempts to automatically perform feature construction by exploring simple mathematical combinations of existing attributes, the most common approach is to construct features using domain expertise.

2.3.6 Discretization and Binarization

Some data mining algorithms, especially certain classification algorithms, re- quire that the data be in the form of categorical attributes. Algorithms that find association patterns require that the data be in the form of binary at- tributes. Thus, it is often necessary to transform a continuous attribute into a categorical attribute (discretization), and both continuous and discrete attributes may need to be transformed into one or more binary attributes (binarization). Additionally, if a categorical attribute has a large number of values (categories), or some values occur infrequently, then it may be beneficial for certain data mining tasks to reduce the number of categories by combining some of the values.