As with feature selection, the best discretization and binarization approach is the one that “produces the best result for the data mining algorithm that will be used to analyze the data.” It is typically not practical to apply such a criterion directly. Consequently, discretization or binarization is performed in

Categorical Value Integer Value :x7 u 5 tr4

awtuL poor OK good great

0 I 2 3 4

1 0 0 0 0

U 1 0 0 0

0 0 1 0 0

0 0 0 1 0

0 0 0 0 1

58 Chapter 2 Data

Table 2.5. Conversion of a categorical attribute to three binary attributes.

Catesorical Value Integer Value f i1 T a

awful poor OK good great

0 I 2 3 4

0 0 0 0 1

0 0 1 1 0

0 1 0 1 0

Table 2.6. Conversion of a categorical attribute to five asymmetric binary attributes.

a way that satisfies a criterion that is thought to have a relationship to good

performance for the data mining task being considered.

Binarization

A simple technique to binarize a categorical attribute is the following: If there

are rn categorical values, then uniquely assign each original value to an integer

in the interval [0,rn – 1]. If the attribute is ordinal, then order must be

maintained by the assignment. (Note that even if the attribute is originally

represented using integers, this process is necessary ifthe integers are not in the

interval 10, ̂ – 1] . ) Next, convert each of these m integers to a binary number.

Since n : flogr(m)l binary digits are required to represent these integers,

represent these binary numbers using n binary attributes. To illustrate, a

categorical variable with 5 values {awful, poor, OK, good, great} would require

three binary variables frrt fr2t and 23. The conversion is shown in Table 2.5.

Such a transformation can cause complications, such as creating unin-

tended relationships among the transformed attributes. For example, in Table

2.5, attributes 12 and 13 are correlated because information about the good

value is encoded using both attributes. Furthermore, association analysis re-

quires asymmetric binary attributes, where only the presence of the attribute (value : 1) is important. For association problems, it is therefore necessary to

introduce one binary attribute for each categorical value) as in Table 2.6. Ifthe

2.3 Data Preprocessing 59

number of resulting attributes is too large, then the techniques described below can be used to reduce the number of categorical values before binarization.

Likewise, for association problems, it may be necessary to replace a single binary attribute with two asymmetric binary attributes. Consider a binary attribute that records a person’s gender, male or female. For traditional as- sociation rule algorithms, this information needs to be transformed into two asymmetric binary attributes, one that is a 1 only when the person is male and one that is a 1 only when the person is female. (For asymmetric binary attributes, the information representation is somewhat inefficient in that two bits of storage are required to represent each bit of information.)

Discretization of Continuous Attributes

Discretization is typically applied to attributes that are used in classification or association analysis. In general, the best discretization depends on the algo- rithm being used, as well as the other attributes being considered. Typically, however, the discretization of an attribute is considered in isolation.

tansformation of a continuous attribute to a categorical attribute involves two subtasks: deciding how many categories to have and determining how to map the values ofthe continuous attribute to these categories. In the first step, after the values of the continuous attribute are sorted, they are then divided into n intervals by specifying n – r split points. In the second, rather trivial step, all the values in one interval are mapped to the same categorical value. Therefore, the problem of discretization is one of deciding how many split points to choose and where to place them. The result can be represented either as a set of intervals {(ro, ,t], (*t, ,zl, . . . , (*n-t, r”,) }, wher e ro and rn may be *oo or -oo, respectively, or equivalently, as a series of inequalities : t g 1 i t 1 I I t . . . t I n – I 1 r 1 r n .

IJnsupervised Discretization A basic distinction between discretization methods for classification is whether class information is used (supervised) or not (unsupervised). If class information is not used, then relatively simple approaches are common. For instance, the equal width approach divides the range of the attribute into a user-specified number of intervals each having the same width. Such an approach can be badly affected by outliers, and for that reason, an equal frequency (equal depth) approach, which tries to put the same number of objects into each interval, is often preferred. As another example of unsupervised discretization, a clustering method, such as K-means (see Chapter 8), can also be used. Finally, visually inspecting the data can sometimes be an effective approach.

60 Chapter 2 Data

Example 2.12 (Discretization Techniques). This example demonstrates how these approaches work on an actual data set. Figure 2.13(a) shows data points belonging to four different groups, along with two outliers-the large

dots on either end. The techniques of the previous paragraph were applied

to discretize the r values of these data points into four categorical values. (Points in the data set have a random g component to make it easy to see

how many points are in each group.) Visually inspecting the data works quite

well, but is not automatic, and thus, we focus on the other three approaches. The split points produced by the techniques equal width, equal frequency, and

K-means are shown in Figures 2.13(b), 2.13(c), and 2.13(d), respectively. The

split points are represented as dashed lines. If we measure the performance of

a discretization technique by the extent to which different objects in different groups are assigned the same categorical value, then K-means performs best, followed by equal frequency, and finally, equal width. I

Supervised Discretization The discretization methods described above

are usually better than no discretization, but keeping the end purpose in mind

and using additional information (class labels) often produces better results. This should not be surprising, since an interval constructed with no knowledge

of class labels often contains a mixture of class labels. A conceptually simple approach is to place the splits in a way that maximizes the purity of the intervals. In practice, however, such an approach requires potentially arbitrary decisions about the purity of an interval and the minimum size of an interval. To overcome such concerns, some statistically based approaches start with each attribute value as a separate interval and create larger intervals by merging adjacent intervals that are similar according to a statistical test. Entropy- based approaches are one of the most promising approaches to discretization, and a simple approach based on entropy will be presented.

First, it is necessary to define entropy. Let k be the number of different

class labels, mibe the number of values in the eth interval of a partition, and

mai be the number of values of class j in interval i. Then the entropy ei of the ‘ith interval is given by the equation

k

e r : l p i i l o g z p i j , i : l

where pq : m4 fm2 is the probability (fraction of values) of class j in the i’th interval. The total entropy, e, of the partition is the weighted average of the individual interval entropies, i.e.,

2 .3

:1. t r .

Data Preprocessing 61,

t

s- 3′

t

. r i I

! , . 3

l .

. ; . !

.ti i . .

i..

r ; .

‘l

? r

2 . 1

10 15 1 0 1 5 20

(a) Original data. (b) Equal width discretization.

i :-i i *| ‘ . . . I . i .

i .’T

i ,i. : i.’ : g

| . , ‘ I . . .

! : .

i i,:, 1 0 1 5 20

(c) Equal frequency discretization. (d) K-means discretization.

Figure 2.1 3. Different discretization techniques.

” – \ -” : / ,= i to ‘o ‘

where rn is the number of values, wi : mif m is the fraction of values in the ith interval. and n is the number of intervals. Intuitively, the entropy of an interval is a measure of the purity of an interval. If an interval contains only values ofone class (is perfectly pure), then the entropy is 0 and it contributes

62 Chapter 2 Data

nothing to the overall entropy. If the classes of values in an interval occur

equally often (the interval is as impure as possible), then the entropy is a

maximum. A simple approach for partitioning a continuous attribute starts by bisect-

ing the initial values so that the resulting two intervals give minimum entropy.

This technique only needs to consider each value as a possible split point, be-

cause it is assumed that intervals contain ordered sets of values. The splitting process is then repeated with another interval, typically choosing the interval

with the worst (highest) entropy, until a user-specified number of intervals is

reached, or a stopping criterion is satisfied.

Example 2.13 (Discretization of Two Attributes). This method was

used to independently discretize both the z and y attributes of the two-

dimensional data shown in Figure 2.I4. In the first discretization, shown in

Figure 2.14(a), the r and g attributes were both split into three intervals. (The

dashed lines indicate the split points.) In the second discretization, shown in

Figure 2.L4(b), the r and gr attributes were both split into five intervals. I

This simple example illustrates two aspects of discretization. First, in two

dimensions, the classes of points are well separated, but in one dimension, this

is not so. In general, discretizing each attribute separately often guarantees

suboptimal results. Second, five intervals work better than three, but six

intervals do not improve the discretization much, at least in terms of entropy. (Entropy values and results for six intervals are not shown.) Consequently, it is desirable to have a stopping criterion that automatically finds the right

number of partitions.

Categorical Attributes with Too Many Values