Categorical attributes can sometimes have too many values. If the categorical attribute is an ordinal attribute, then techniques similar to those for con-

tinuous attributes can be used to reduce the number of categories. If the categorical attribute is nominal, however, then other approaches are needed.

Consider a university that has a large number of departments. Consequently, a department name attribute might have dozens of different values. In this

situation, we could use our knowledge of the relationships among different departments to combine departments into larger groups, such as eng’ineering, soc’ial sciences, or biological sc’iences. If domain knowledge does not serve as

a useful guide or such an approach results in poor classification performance,

then it is necessary to use a more empirical approach, such as grouping values

Data Preprocessing 63

(a) Three intervals

Figure 2.14. Discretizing r and y attributes for four groups (classes) of points.

together only if such a grouping results in improved classification accuracy or achieves some other data mining objective.

2.3.7 Variable Tlansformation

A variable transformation refers to a transformation that is applied to all the values of a variable. (We use the term variable instead of attribute to ad- here to common usage, although we will also refer to attribute transformation on occasion.) In other words, for each object, the transformation is applied to the value of the variable for that object. For example, if only the magnitude of a variable is important, then the values of the variable can be transformed by taking the absolute value. In the following section, we discuss two impor- tant types of variable transformations: simple functional transformations and normalization.

Simple Functions

For this type of variable transformation, a simple mathematical function is applied to each value individually. If r is a variable, then examples of such transformations include rk, log tr, e’, yE,If n, sinr, or lrl. In statistics, vari- able transformations, especially sqrt, log , and 7 f r , are often used to transform data that does not have a Gaussian (normal) distribution into data that does. While this can be important, other reasons often take precedence in data min-

2 .3

– – – – -xg : :””” *S.;t-

– – – – – :”-,$

64 Chapter 2 Data

ing. Suppose the variable of interest is the number of data bytes in a session)

and the number of bytes ranges from 1 to 1 billion. This is a huge range, and

it may be advantageous to compress it by using a lo916 transformation. In

this case, sessions that transferred 108 and 10e bytes would be more similar

to each other than sessions that transferred 10 and 1000 bytes (9 – 8 : 1

versus 3 – 1 : 2). For some applications, such as network intrusion detection,

this may be what is desired, since the first two sessions most likely represent

transfers of large files, while the latter two sessions could be two quite distinct

types of sessions. Variable transformations should be applied with caution since they change

the nature of the data. While this is what is desired, there can be problems

if the nature of the transformation is not fully appreciated. For instance, the

transformation If r reduces the magnitude of values that are 1 or larger, but

increases the magnitude of values between 0 and 1. To illustrate, the values

{L ,2 ,3 } go to {1 ,+ ,+} , bu t the va lues {1 , ; ,+ } go to {1 ,2 ,3 } . Thus , fo r all sets of values, the transformation If r reverces the order. To help clarify

the effect of a transformation, it is important to ask questions such as the

following: Does the order need to be maintained? Does the transformation apply to all values, especially negative values and 0? What is the effect of

the transformation on the values between 0 and 1? Exercise 17 on page 92

explores other aspects of variable transformation.

Normalization or Standardization

Another common type of variable transformation is the standardization or

normalization of a variable. (In the data mining community the terms are

often used interchangeably. In statistics, however, the term normalization can

be confused with the transformations used for making a variable normal, i.e’,

Gaussian.) The goal of standardization or normalization is to make an en-

tire set of values have a particular property. A traditional example is that

of “standardizing a variable” in statistics. If 7 is the mean (average) of the

attribute values and s, is their standard deviation, then the transformation r’ : (r -?)lt” creates a new variable that has a mean of 0 and a standard deviation of 1. If different variables are to be combined in some way, then

such a transformation is often necessary to avoid having a variable with large

values dominate the results of the calculation. To illustrate, consider compar- ing people based on two variables: age and income. For any two people, the

difference in income will likely be much higher in absolute terms (hundreds or

thousands of dollars) than the difference in age (less than 150). If the differ- ences in the range ofvalues ofage and income are not taken into account, then

Measures of Similaritv and Dissimilaritv 65

the comparison between people will be dominated by differences in income. In particular, if the similarity or dissimilarity of two people is calculated using the similarity or dissimilarity measures defined later in this chapter, then in many cases, such as that of Euclidean distance, the income values will dominate the calculation.

The mean and standard deviation are strongly affected by outliers, so the above transformation is often modified. First, the mean is replaced by the median, i.e., the middle value. Second, the standard deviation is replaced by the absolute standard deviation. Specifically, if r is a variable, then the absolute standard deviation of r is given by oa : Dlrl*n – ltl, where ri is lhe ‘ith value of the variable, rn is the number of objects, and. p, is either the mean or median. Other approaches for computing estimates of the location (center) and spread of a set of values in the presence of outliers are described in Sections 3.2.3 and 3.2.4, respectively. These measures can also be used to define a standardi zation transformation.

2.4 Measures of Similarity and Dissimilarity

Similarity and dissimilarity are important because they are used by a number of data mining techniques, such as clustering, nearest neighbor classification, and anomaly detection. In many cases) the initial data set is not needed once these similarities or dissimilarities have been computed. Such approaches can be viewed as transforming the data to a similarity (dissimilarity) space and then performing the analysis.

We begin with a discussion of the basics: high-level definitions of similarity and dissimilarity, and a discussion of how they are related. For convenience, the term proximity is used to refer to either similarity or dissimilarity. Since the proximity between two objects is a function of the proximity between the corresponding attributes of the two objects, we first describe how to measure the proximity between objects having only one simple attribute, and then consider proximity measures for objects with multiple attributes. This in- cludes measures such as correlation and Euclidean distance, which are useful for dense data such as time series or two-dimensional points, as well as the Jaccard and cosine similarity measures, which are useful for sparse data like documents. Next, we consider several important issues concerning proximity measures. The section concludes with a brief discussion of how to select the right proximity measure.

2.4

66 Chapter 2 Data

2.4.L Basics

Definitions

Informally, the similarity between two objects is a numerical measure of the

degree to which the two objects are alike. Consequently, similarities are hi,gher for pairs of objects that are more alike. Similarities are usually non-negative and are often between 0 (no similarity) and 1 (complete similarity).

The dissimilarity between two objects is a numerical measure of the de- gree to which the two objects are different. Dissimilarities are lower for more similar pairs of objects. FYequently, the term distance is used as a synonym for dissimilarity, although, as we shall see, distance is often used to refer to a special class of dissimilarities. Dissimilarities sometimes fall in the interval

[0,1], but it is also common for them to range from 0 to oo.

TYansformations

Thansformations are often applied to convert a similarity to a dissimilarity, or vice versa, or to transform a proximity measure to fall within a particular

range, such as [0,1]. For instance, we may have similarities that range from 1 to 10, but the particular algorithm or software package that we want to use may be designed to only work with dissimilarities, or it may only work with similarities in the interval 10,1]. We discuss these issues here because we will employ such transformations later in our discussion of proximity. In addi- tion, these issues are relatively independent of the details of specific proximity

measures. Frequently, proximity measures, especially similarities, are defined or trans-

formed to have values in the interval [0,1]. Informally, the motivation for this is to use a scale in which a proximity value indicates the fraction of similarity (or dissimilarity) between two objects. Such a transformation is often rela- tively straightforward. For example, if the similarities between objects range from 1 (not at all similar) to 10 (completely similar), we can make them fall within the range 10, 1] by using the transformation s’ : (s – 1)/9, where s and s/ are the original and new similarity values, respectively. In the more general case, the transformation of similarities to the interval [0,1] is given by the expression

“‘ : (” – mi,n-s) I (mar -s – mi,n-s) , where mar -s and m’in-s are the

maximum and minimum similarity values, respectively. Likewise, dissimilarity measures with a finite range can be mapped to the interval [0,1] by using the formula d’ : (d – rni,n-d)l(mar-d – mi,n-d).

There can be various complications in mapping proximity measures to the interval 10,1], however. If, for example, the proximity measure originally takes

Measures of Similaritv and Dissimilaritv 67

values in the interval [0,-], then a non-linear transformation is needed and values will not have the same relationship to one another on the new scale. Consider the transformation d,’ : dl(I* d) for a dissimilarity measure that ranges from 0 to oo. The dissimilarities 0, 0.5, 2, 10, 100, and 1000 will be transformed into the new dissimilarities 0, 0.33, 0.67, 0.90, 0.99, and 0.999, respectively. Larger values on the original dissimilarity scale are compressed into the range of values near 1, but whether or not this is desirable depends on the application. Another complication is that the meaning of the proximity measure may be changed. For example, correlation, which is discussed later, is a measure of similarity that takes values in the interval [-1,1]. Mapping these values to the interval [0,1] by taking the absolute value loses information about the sign, which can be important in some applications. See Exercise 22 on page 94.

Tlansforming similarities to dissimilarities and vice versa is also relatively straightforward, although we again face the issues of preserving meaning and changing a linear scale into a non-linear scale. If the similarity (or dissimilar- ity) falls in the interval [0,1], then the dissimilarity can be defined as d : 1- s (s : 1 – d). Another simple approach is to define similarity as the nega- tive of the dissimilarity (or vice versa). To illustrate, the dissimilarities 0, 1, 10, and 100 can be transformed into the similarities 0, -1, -10, and -100,

respectively. The similarities resulting from the negation transformation are not re-

stricted to the range [0,1], but if that is desired, then transformations such as

” : ;{1, s : e-d, or s : 1- *mZ can be used. For the transformation

s : 7fi, the dissimilarities 0, 1, 10, 100 are transformed into 1, 0.5, 0.09, 0.01, respectively. For s : e-d, they become 1.00, 0.37, 0.00, 0.00, respectively, while for s: 1- *ffin they become 1.00, 0.99, 0.00, 0.00, respectively. In this discussion, we have focused on converting dissimilarities to similarities. Conversion in the opposite direction is considered in Exercise 23 on page 94.

In general, any monotonic decreasing function can be used to convert dis- similarities to similarities, or vice versa. Of course, other factors also must be considered when transforming similarities to dissimilarities, or vice versa, or when transforming the values of a proximity measure to a new scale. We have mentioned issues related to preserving meaning, distortion of scale, and requirements of data analysis tools, but this list is certainly not exhaustive.

2.4.2 Similarity and Dissimilarity between Simple Attributes

The proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes, and thus, we first discuss