Figure 7.1. A pie chart with a merged category called 0thers.

2. Some attribute values may have considerably higher frequencies than others. For example, suppose 85% of the survey participants own a home computer. By creating a binary item for each attribute value that appears frequently in the data, we may potentially generate many redundant patterns, as illustrated by the following example:

{Computer at hone = Yes, Shop Online = Yes}

——- {Privacy Concerns = Yes}.

4LB Chapter 7 Association Analysis: Advanced Concepts

The rule is redundant because it is subsumed by the more general rule given at the beginning of this section. Because the high-frequency items correspond to the typical values of an attribute, they seldom carry any new information that can help us to better understand the pattern. It may therefore be useful to remove such items before applying standard association analysis algorithms. Another possibility is to apply the tech- niques presented in Section 6.8 for handling data sets with a wide range of support values.

3. Although the width of every transaction is the same as the number of attributes in the original data, the computation time may increase es- pecially when many of the newly created items become frequent. This is because more time is needed to deal with the additional candidate itemsets generated by these items (see Exercise 1 on page 473). One way to reduce the computation time is to avoid generating candidate itemsets that contain more than one item from the same attribute. For example, we do not have to generate a candidate itemset such as {State = X, State = Y, . . .) because the support count of the itemset is zero.

7.2 Handling Continuous Attributes

The Internet survey data described in the previous section may also contain continuous attributes such as the ones shown in Table 7.3. Mining the con- tinuous attributes may reveal useful insights about the data such as “users whose annual income is more than $120K belong to the 45 60 age group” or “users who have more than 3 email accounts and spend more than 15 hours online per week are often concerned about their personal privacy.” Association rules that contain continuous attributes are commonly known as quantitative association rules.

This section describes the various methodologies for applying association analysis to continuous data. We will specifically discuss three types of meth- ods: (i) discretization-based methods, (2) statistics-based methods, and (3) non-discretization methods. The quantitative association rules derived using these methods are quite different in nature.

7.2.I Discretization-Based Methods

Discretization is the most common approach for handling continuous attributes. This approach groups the adjacent values of a continuous attribute into a finite number of intervals. For example, the Age attribute can be divided into the

7.2 Handling Continuous Attributes ALg

Table 7.3. Internet survey data with continuous attributes.

Gender Age Annual Income

No. of Hours Spent Online per Week

No. of Email Accounts

Privacy Concern

Female Male Male

Female Female Male Male Male t”i1″

26 51 29 45 31 25 37 47 26

90K 135K 80K 120K 95K 55K 100K 65K*1

20 10 10 15 20 25 10 8 12

4

2 3 3 D

D

I 2 I

Yes No Yes Yes Yes Yes No No

I:

following intervals:

Age â‚¬ [12 ,16) , Age â‚¬ [16 ,20) , Age e 120,24) , . . . ,Age â‚¬ [56 ,60) ,

where [a, b) represents an interval that includes a but not b. Discretization can be performed using any of the techniques described in Section 2.3.6 (equal

interval width, equal frequen,cy, entropy-based, or clustering). The discrete intervals are then mapped into asymmetric binary attributes so that existing

association analysis algorithms can be applied. Table 7.4 shows the Internet

survey data after discretization and binarization.

Table 7.4. Internet survey data after binarizing categorical and continuous attributes.

Male Female Age <13

Age â‚¬ [13 ,21 )

Age â‚¬ [21,30)

Privacy : Yes

Privacy : N O

0 1 1 0 0 1 1 1

:

1 0 0 1 I 0 0 0

i

0 0 0 0 U

0 0 0

:

0 0 0 0 0 0 0 0 0

1 0 1 0 0 I 0 0 I

I 0 I 1 1 1 0 0

:

U

0 0 0 0

I

1 I I

42O Chapter 7 Association Analysis: Advanced Concepts

Table 7.5. A breakdown of Internet users who participated in online chat according to their age group.

Age Group Chat Online : Yes Chat Online: No 12,16) 16,20) .20,24) 24,28) 28,32) 32,36) 36,40) 40,44) 44,48) 48,52) 52,56) 56.60)

1 1 1 1 12 T4 15 16 16

13 2 3 13 t2 L2 t4 t4 10 1 1 10 1 1

A key parameter in attribute discretization is the number of intervals used to partition each attribute. This parameter is typically provided by the users and can be expressed in terms of the interval width (for the equal interval width approach), the average number of transactions per interval (for the equal frequency approach), or the number of desired clusters (for the clustering- based approach). The difficulty in determining the right number of intervals can be illustrated using the data set shown in Table 7.5, which summarizes the responses of 250 users who participated in the survey. There are two strong rules embedded in the data:

Er: Age â‚¬ 116,24) –r Chat Online : Yes (s : 8.8%, c : 81.5%). Rzt Age â‚¬ [44,60) ——+ Chat Online : No (s : t6.8To, c : 70To).

These rules suggest that most of the users from the age group of 16-24 often participate in online chatting, while those from the age group of. 44-60 are less likely to chat online. In this example, we consider a rule to be interesting only if its support (s) exceeds 5% and its confidence (c) exceeds 65%. One of the problems encountered when discretizing the Age attribute is how to determine the interval width.

1. If the interval is too wide, then we may lose some patterns because of their lack of confidence. For example, when the interval width is 24 years, .Rr and R2 arc replaced by the following rules:

R ‘ r , Age â‚¬ [12 ,36) – -+ Chat On l ine : Yes (s :30%, c :57 .7Vo) . RL, Age â‚¬ [36,60) —+ Chat Online : No (s : 28y0, c: 58.3Vo).

7.2 Handling Continuous Attributes 42I

Despite their higher supports, the wider intervals have caused the con- fidence for both rules to drop below the minimum confidence threshold. As a result, both patterns are lost after discretization.

If the interval is too narrow, then we may lose some patterns because of their lack of support. For example, if the interval width is 4 years, then .Rr is broken up into the following two subrules:

Rftn), Age â‚¬ [16,20) ——+ Chat Online: Yes (s:4.4T0, c:84.6%). l i \

RYl, Age â‚¬ 120,,24) –+ Chat Online : No (s:4.4To, c:78.6T0).

Since the supports for the subrules are less than the minimum support threshold, -R1 is lost after discretization. Similarly, the rule -Bz, which is broken up into four subrules, will also be lost because the support of each subrule is less than the minimum support threshold.

If the interval width is 8 years, then the rule R2 is broken up into the following two subrules:

Age â‚¬ 144,52) ——+ Chat Online : No (s:8.4To, c:70To).

Age â‚¬ [52,60) ——+ Chat Online : No (s:8.4To, c:70T0).

Since El!) and ,t?l| have sufficient support and confidence, R2 can be recovered by aggregating both subrules. Meanwhile, E1 is broken up into the following two subrules:

nl?)’ Age â‚¬ 112,20) —– Chat Online : Yes (s:9.2To, c:60.5%).

R[?’ Age â‚¬ [20,28) ——+ Chat Online: Yes (s:9.2T0, c:60.0%).

Unlike R2, we cannot recover the rule ftr by aggregating the subrules because both subrules fail the confidence threshold.

One way to address these issues is to consider every possible grouping of adjacent intervals. For example, we can start with an interval width of 4 years and then merge the adjacent intervals into wider intervals, Age â‚¬ [12,16), Age â‚¬ [L2,20),.. . , Age â‚¬ [12,60), Age â‚¬ [16,20), Age â‚¬ [16,24), etc. This approach enables the detection of both -Br and R2 as strong rules. However, it also leads to the following computational issues:

1. The computation becomes extremely expensive. If the range is initially divided into,k intervals, then k(k -I)12 binary items must be

3.

Rt?, Rf),

422 Chapter 7 Association Analysis: Advanced Concepts

generated to represent all possible intervals. F\rrthermore, if an item corresponding to the interval [a,b) is frequent, then all other items cor- responding to intervals that subsume [a,b) must be frequent too. This approach can therefore generate far too many candidate and frequent itemsets. To address these problems, a maximum support threshold can be applied to prevent the creation of items corresponding to very wide intervals and to reduce the number of itemsets.

2. Many redundant rules are extracted. For example, consider the following pair of rules:

{Age e [16,20), Gender: MaIe] —–* {CUat Onl-ine = Yes},

{Age e 116,24), Gender: Male} —-* {Cnat Onl-ine = Yes}.

lB+ is a generalization of ,?3 (and R3 is a specialization of -Ra) because Ba has a wider interval for the Age attribute. If the confidence values for both rules are the same, then .E+ should be more interesting be- cause it covers more examples-including those for R3. fi3 is therefore a redundant rule.

7.2.2 Statistics-Based Methods

Quantitative association rules can be used to infer the statistical properties of a population. For example, suppose we are interested in finding the average age ofcertain groups oflnternet users based on the data provided in Tables 7.1 and 7.3. Using the statistics-based method described in this section, quantitative association rules such as the following can be extracted:

{Amual Incone > $100K, Shop Online = Yes} —-+ Age: Mear : 38.

The rule states that the average age of Internet users whose annual income exceeds $100K and who shop online regularly is 38 years old.

Rule Generation

To generate the statistics-based quantitative association rules, the target at- tribute used to characterize interesting segments of the population must be specified. By withholding the target attribute, the remaining categorical and continuous attributes in the data are binarized using the methods described in the previous section. Existing algorithms such as Apri,ori, or FP-growth are then applied to extract frequent itemsets from the binarized data. Each

Rs,

R a , :

7.2 Handling Continuous Attributes 423

frequent itemset identifies an interesting segment of the population. The dis- tribution of the target attribute in each segment can be summarized using descriptive statistics such as mean, median, variance, or absolute deviation. For example, the preceding rule is obtained by averaging the age of Inter- net users who support the frequent itemset {Annual Incorne > $100K, Snop Onl ine = Yes).

The number of quantitative association rules discovered using this method is the same as the number of extracted frequent itemsets. Because of the way the quantitative association rules are defined, the notion of confidence is not applicable to such rules. An alternative method for validating the quantitative association rules is presented next.

Rule Validation

A quantitative association rule is interesting only if the statistics computed from transactions covered by the rule are different than those computed from transactions not covered by the rule. For example, the rule given at the be- ginning of this section is interesting only if the average age of Internet users who do not support the frequent itemset {Annua1 Income > 100K, Shop Online = Yes) is significantly higher or lower than 38 years old. To deter- mine whether the difference in their average ages is statistically significant, statistical hypothesis testing methods should be applied.

Consider the quantitative association rule, -4. ——+ t : p, where A is a frequent itemset, t is the continuous target attribute, and p, is the average value of f among transactions covered by A. Furthermore, let p/ denote the average value of f among transactions not covered by A. The goal is to test whether the difference between p and p/ is greater than some user-specified threshold, A. In statistical hypothesis testing, two opposite propositions, known as the null hypothesis and the alternative hypothesis, are given. A hypothesis test is performed to determine which of these two hypotheses should be accepted, based on evidence gathered from the data (see Appendix C).

In this case, assuming that F I lt’ , the null hypothesis is ,FIs : pt : p,l L, while the alternative hypothesis is Ifi : Lt’ > p * L. To determine which hypothesis should be accepted, the following Z-statistic is computed:

l- t ‘ – t – t – L (7 .1)