+1 (208) 254-6996 essayswallet@gmail.com
  

where n1 is the number of transactions support ing A, nz is the number of trans- actions not supporting A, s1 is the standard deviation for f among transactions

^2 ^2

Don't use plagiarized sources. Get Your Custom Essay on
where n1 is the number of transactions support ing A, nz is the number of trans- actions not supporting A, s1 is the standard deviation for f among transactions
Just from $13/Page
Order Essay

rlt n2

 

 

424 Chapter 7 Association Analysis: Advanced Concepts

that support A, and s2 is the standard deviation for t among transactions that do not support A. Under the null hypothesis, Z has a standard normal distri- bution with mean 0 and variance 1. The value of Z comptfted using Equation

7.1 is then compared against a critical value, Zo, which is a threshold that depends on the desired confidence level. If / ) Za, then the null hypothesis is rejected and we may conclude that the quantitative association rule is in- teresting. Otherwise, there is not enough evidence in the data to show that the difference in mean is statistically significant.

Example 7.1-. Consider the quantitative association rule

{Income > 100K, Shop Onl ine:Yes} —— Age:F:38.

Suppose there are 50 Internet users who supported the rule antecedent. The standard deviation of their ages is 3.5. On the other hand, the average age of the 200 users who do not support the rule antecedent is 30 and their standard deviation is 6.5. Assume that a quantitative association rule is considered interesting only if the difference between p and ;.1/ is more than 5 years. Using Equation 7.1 we obtain

38 -30 -5 :4 .4414 .

For a one-sided hypothesis test at a g5% confidence level, the critical value for rejecting the null hypothesis is 1.64. Since Z > 1.64, the null hypothesis can be rejected. We therefore conclude that the quantitative association rule is interesting because the difference between the average ages of users who support and do not support the rule antecedent is more than 5 years. t

7.2.3 Non-discretization Methods

There are certain applications in which analysts are more interested in find- ing associations among the continuous attributes, rather than associations among discrete intervals of the continuous attributes. For example, consider the problem of finding word associations in text documents, as shown in Ta- ble 7.6. Each entry in the document-word matrix represents the normalized frequency count of a word appearing in a given document. The data is normal- ized by dividing the frequency of each word by the sum of the word frequency across all documents. One reason for this normalization is to make sure that the resulting support value is a number between 0 and L. However, a more

 

 

Table 7,6. Normalized document-word matrix. Document wordl word2 wordg worda word5 word6

d1

d2

ds da ds

0.3 0.1 0.4 0.2 0

0.6 0.2 0.2 0 0

0 0

u . ( 0.3 0

0 0 0 0

1.0

0 0 0 0

1.0

0.2 0.2 0.2 0 .1 0.3

7.2 Handling Continuous Attributes 425

important reason is to ensure that the data is on the same scale so that sets of words that vary in the same way have similar support values.

In text mining, analysts are more interested in finding associations between words (e.g., data and nining) instead of associations between ranges of word frequencies (e.g., data € [1,4] and mining € [2,3]). One way to do this is to transform the data into a 0/1 matrix, where the entry is 1 if the normal- ized frequency count exceeds some threshold t, and 0 otherwise. While this approach allows analysts to apply existing frequent itemset generation algo- rithms to the binarized data set, finding the right threshold for binarization can be quite tricky. If the threshold is set too high, it is possible to miss some interesting associations. Conversely, if the threshold is set too low, there is a potential for generating a large number of spurious associations.

This section presents another methodology for finding word associations known as min-Apri,ora. Analogous to traditional association analysis, an item- set is considered to be a collection of words, while its support measures the degree of association among the words. The support of an itemset can be computed based on the normalized frequency of its corresponding words. For example, consider the document d1 shown in Table 7.6. The normalized fre- quencies for uordl and word2 in this document are 0.3 and 0.6, respectively. One might think that a reasonable approach to compute the association be- tween both words is to take the average value of their normalized frequencies, i.e., (0.3 +0.6)12:0.45. The support of an itemset can then be computed by summing up the averaged normalized frequencies across all the documents:

s({word,1,word,2}): q=!f *gry *y# *W#:t.

This result is by no means an accident. Because every word frequency is normalized to 1, averaging the normalized frequencies makes the support for every itemset equal to 1. All itemsets are therefore frequent using this ap- proach, making it useless for identifying interesting patterns.

 

 

426 Chapter 7 Association Analysis: Advanced Concepts

In min-Apriori, the association among words in a given document is ob-

tained by taking the minimum value of their normalized frequencies, i.e.,

min(word1,word2) : min(0.3,0.6) : 0.3. The support of an itemset is com- puted by aggregating its association over all the documents.

s({word1,word2}) : min(0.3,0.6) + min(0.1,0’2) + min(0.4,0.2)

* min(0.2,0) : 0 .6.

The support measure defined in min-Apriori has the following desired prop-

erties, which makes it suitable for finding word associations in documents:

1. Support increases monotonically as the normalized frequency of a word increases.

2. Support increases monotonically as the number of documents that con- tain the word increases.

3. Support has an anti-monotone property. For example, consider a pair of i temsets {A,B} and {,4, B,C}. Since min({A,B}) > min({A, B,C}),

s({A,B}) > “({A,

B,C}). Therefore, support decreases monotonically as the number of words in an itemset increases.

The standard Apriori, algorithm can be modified to find associations among words using the new support definition.

7.3 Handling a Concept Hierarchy

A concept hierarchy is a multilevel organization of the various entities or con- cepts defined in a particular domain. For example, in market basket analysis, a concept hierarchy has the form of an item taxonomy describing the “is-a” relationships among items sold at a grocery store—-e.g., milk is a kind of food and DVD is a kind of home electronics equipment (see Figure 7.2). Concept hierarchies are often defined according to domain knowledge or based on a standard classification scheme defined by certain organizations (e.g., the Li-

brary of Congress classification scheme is used to organize library materials based on their subject categories).

A concept hierarchy can be represented using a directed acyclic graph,

as shown in Figure 7.2. If there is an edge in the graph from a node p to another node c, we call p the parent of c and c the child of p. For example,

 

 

Food

Handling a Concept Hierarchy 427

“olSto, ??”ff: Figure 7.2. Example of an item taxonomy.

nilk is the parent of skin milk because there is a directed edge from the node milk to the node skim milk. * is called an ancestor of X (and X a descendent of *) if there is a path from node * to node X in the directed acyclic graph. In the diagram shown in Figure 7.2, f ood is an ancestor of skim rnil-k and AC adaptor is a descendent of electronics.

The main advantages of incorporating concept hierarchies into association analysis are as follows:

1. Items at the lower levels of a hierarchy may not have enough support to appear in any frequent itemsets. For example, although the sale of AC adaptors and docking stations may be low, the sale of laptop accessories, which is their parent node in the concept hierarchy, may be high. Unless the concept hierarchy is used, there is a potential to miss interesting patterns involving the laptop accessories.

2. Rules found at the lower levels of a concept hierarchy tend to be overly specific and may not be as interesting as rules at the higher levels. For example, staple items such as milk and bread tend to produce many low- level rules such as skim nilk ——+ wheat bread, 2″/” niJ-k ——+ wheat bread, and skin milk —–+ white bread. Using a concept hierarchy, they can be summarized into a single rule, milk ——+ bread. Considering only items residing at the top level of their hierarchies may not be good enough because such rules may not be of any practical use. For example, although the rule electronics —-+ food may satisfy the support and

7.3

Electronics

 

 

428 Chapter 7 Association Analysis: Advanced Concepts

confidence thresholds, it is not informative because the combination of electronics and food items that are frequently purchased by customers are unknown. If milk and batteries are the only items sold together frequently, then the pattern {food, electronics} may have overgener- alized the situation.

Standard association analysis can be extended to incorporate concept hi- erarchies in the following way. Each transaction t is initially replaced with its extended transaction //, which contains all the items in t along with their corresponding ancestors. For example, the transaction {DVD, wheat bread} can be extended to {DVD, wheat bread, hone electronics, electronics, bread, food), where hone electronics and electronics are the ancestors of DVD, while bread and food are the ancestors of wheat bread. With this approach, existing algorithms such as Apri,ori can be applied to the extended database to find rules that span different levels of the concept hierarchy. This approach has several obvious limitations:

1. Items residing at the higher levels tend to have higher support counts than those residing at the lower levels of a concept hierarchy. As a result, if the support threshold is set too high, then only patterns involving the high-level items are extracted. On the other hand, if the threshold is set too low, then the algorithm generates far too many patterns (most of which may be spurious) and becomes computationally inefficient.

2. Introduction of a concept hierarchy tends to increase the computation time of association analysis algorithms because of the larger number of items and wider transactions. The number of candidate patterns and frequent patterns generated by these algorithms may also grow expo- nentially with wider transactions.

3. Introduction of a concept hierarchy may produce redundant rules. A rule X ——+ Y is redundant if there exists a more general rule * , t, where * is an ancestor of. X, i ir un ancestor of Y, and both rules have very similar confidence. For example, suppose {bread} ——+ {nilk}, {white bread} —— {2%ni1k}, {rheat bread} -‘- {2% milk}, {white bread) —+ {skim milk}, and {wheat bread} ——+ {skin nitk} have very similar confidence. The rules involving items from the lower level of the hierarchy are considered redundant because they can be summarized by a rule involving the ancestor items. An itemset such as {skin nilk, milk, food) is also redundant because food and milk are ancestors of skim nilk. Fortunately, it is easy to eliminate such redundant itemsets during frequent itemset generation, given the knowledge of the hierarchy.

 

 

7.4 Sequential Patterns 429

Timeline

Order your essay today and save 10% with the discount code ESSAYHELP