) Consider the database shown in Table 7.15. What are the support and confidence values for the following negative association rules involving regular and diet soda?

i. -Regular ——+ Diet.

I

II

20.

27.

7.8 Exercises 485

iii. -Diet ——+ Regular.

iv. Diet —+ -Regular.

Suppose we would like to extract positive and negative itemsets from a data set that contains d items.

(a) Consider an approach where we introduce a new variable to represent each negative item. With this approach, the number of items grows from d to 2d. What is the total size of the itemset lattice, assuming that an itemset may contain both positive and negative items of the same variable?

(b) Assume that an itemset must contain positive or negative items of different variables. For example, the itemset {a, a,b,Z} is invalid because it contains both positive and negative items for variable o. What is the total size of the itemset lattice?

For each type of pattern defined below, determine whether the support measure is monotone, anti-monotone, or non-monotone (i.e., neither monotone nor anti- monotone) with respect to increasing itemset size.

(a) Itemsets that contain both positive and negative items such as {a,b,Z,d}. Is the support measure monotone, anti-monotone, or non-monotone when applied to such patterns?

(b) Boolean logical patterns such as {(a v b V c), d, e}, which may con- tain both disjunctions and conjunctions of items. Is the support measure monotone, anti-monotone, or non-monotone when applied to such pat- terns?

Many association analysis algorithms rely on an Apriori-like approach for find- ing frequent patterns. The overall structure of the algorithm is given below.

Algorithm 7.3 Apri,ori,-like algorithm.

22.

23.

I : k : I . 2: F6 : { i I i eIA <#D 2 minsup}. {Find frequent l-patterns.} 3: repeat 4 : k : k + I . 5: Cr – genCandidate(Fp-1). {Candidate Generation} 6: 6o : pruneCandidate(Cr, Fr-r). {Candidate Pruning} 7: Cx: cotnt(Cu, D). {Support Counting}

8: Fu: {cl c€Cp A # > rninsup]1. {Extract frequent patterns} 9: until Fx : A

10: Answer : lJFr.

486 Chapter 7 Association Analysis: Advanced Concepts

Suppose we are interested in finding boolean logical rules such as

{o v b} —— {c,d},

which may contain both disjunctions and conjunctions of items. The corre- sponding itemset can be written as {(o Y b),c,d}.

(a) Does the Apriori. principle still hold for such itemsets?

(b) How should the candidate generation step be modified to find such pat- terns?

(c) How should the candidate pruning step be modified to find such patterns?

(d) How should the support counting step be modified to find such patterns?

Cluster Analysis: Basic Concepts and Algorithms

Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. Ifmeaningful groups are the goal, then the clusters should capture the natural structure of the data. In some cases, however, cluster analysis is only a useful starting point for other purposes, such as data summarization. Whether for understanding or utility, cluster analysis has long played an important role in a wide variety of fields: psychology and other social sciences, biology, statistics, pattern recognition, information retrieval, machine learning, and data mining.

There have been many applications of cluster analysis to practical prob- lems. We provide some specific examples, organized by whether the purpose of the clustering is understanding or utility.

Clustering for Understanding Classes, or conceptually meaningful groups of objects that share common characteristics, play an important role in how people analyze and describe the world. Indeed, human beings are skilled at dividing objects into groups (clustering) and assigning particular objects to these groups (classification). For example, even relatively young children can quickly label the objects in a photograph as buildings, vehicles, people, ani- mals, plants, etc. In the context of understanding data, clusters are potential classes and cluster analysis is the study of techniques for automatically finding classes. The following are some examples:

488 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

Biology. Biologists have spent many years creating a taxonomy (hi- erarchical classification) of all living things: kingdom, phylum, class, order, family, genus) and species. Thus, it is perhaps not surprising that much of the early work in cluster analysis sought to create a discipline of mathematical taxonomy that could automatically find such classifi- cation structures. More recently, biologists have applied clustering to analyze the large amounts of genetic information that are now available. For example, clustering has been used to find groups of genes that have similar functions.

Information Retrieval. The World Wide Web consists of billions of Web pages, and the results of a query to a search engine can return thousands of pages. Clustering can be used to group these search re- sults into a small number of clusters, each of which captures a particular aspect of the query. For instance, a query of “movie” might return Web pages grouped into categories such as reviews, trailers, stars, and theaters. Each category (cluster) can be broken into subcategories (sub- clusters), producing a hierarchical structure that further assists a user’s exploration of the query results.

Climate. Understanding the Earth’s climate requires finding patterns in the atmosphere and ocean. To that end, cluster analysis has been applied to find patterns in the atmospheric pressure of polar regions and areas of the ocean that have a significant impact on land climate.

Psychology and Medicine. An illness or condition frequently has a number of variations, and cluster analysis can be used to identify these different subcategories. For example, clustering has been used to identify different types of depression. Cluster analysis can also be used to detect patterns in the spatial or temporal distribution of a disease.

Business. Businesses collect large amounts of information on current and potential customers. Clustering can be used to segment customers into a small number of groups for additional analysis and marketing activities.

Clustering for Utility Cluster analysis provides an abstraction from in- dividual data objects to the clusters in which those data objects reside. Ad- ditionally, some clustering techniques characterize each cluster in terms of a cluster prototype; i.e., a data object that is representative of the other ob- jects in the cluster. These cluster prototypes can be used as the basis for a

489

number of data analysis or data processing techniques. Therefore, in the con- text of utility, cluster analysis is the study of techniques for finding the most representative cluster prototypes.

o Summarization. Many data analysis techniques, such as regression or PCA, have a time or space complexity of O(m2) or higher (where rn is the number of objects), and thus, are not practical for large data sets. However, instead of applying the algorithm to the entire data set, it can be applied to a reduced data set consisting only of cluster prototypes. Depending on the type of analysis, the number of prototypes, and the accuracy with which the prototypes represent the data, the results can be comparable to those that would have been obtained if all the data could have been used.