425] CLUTO 2.1.1: Software for Clustering High-Dimensional Datasets’

/www.cs.umn.edu/-161to1s, November 2003.

14261 L. Ertoz, M. Steinbach, and V. Kumar. A New Shared Nearest Neighbor Clustering

Algorithm and its Applications. In Workshop on Clustering Hi,gh Dimensional Data

and,’its Appl’ications, Proc. of Tent Mine’}l, Fi,rst SIAM IntI. Conf. on Data M’in’ing,

Ch’icago, IL, USA, 2001.

14271 L. Ertriz, M. Steinbach, and V. Kumar. Finding Clusters of Different Sizes, Shapes,

and Densities in Noisy, High Dimensional Data. In Proc. of the 2003 SIAM Intl. Conf.

on Data Min’ing, San FYancisco, May 2003. SIAM.

1428] D. Fisher and P. Langley. Conceptual clustering and its relation to numerical taxon-

olny. Artifi,cial Intelli,gence and Stati,sti,cs, pages 77-1L6, 1986.

14291 C. Ftaley and A. E. Raftery. How Many Clusters? Which Clustering Method? Answers

Via Model-Based Cluster Analysis. The Computer Journal,4l(8):578 588, 1998.

1430] V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS Clustering Categorical Data

Using Summaries. In Proc. of the 5th Intl. Conf. on Knowled,ge Discouery and Data

Min’ing, pages 73 83. ACM Press, 1999.

[431] A. Gersho and R. M. Gray. Vector Quantizati,on and Si,gnal Cornpression, volume 159

of Kluwer Intemr,ational Series i,n Engr,neering and Computer Sc’ience. Kluwer Academic

Publishers, 1992.

1432] J. Ghosh. Scalable Clustering Methods for Data Mining. In N. Ye, editor, Handbook

of Data Mining, pages 247-277. Lawrence Ealbaum Assoc, 2003.

[433] D. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering Categorical Data: An

Approach Based on Dynamical Systems. VLDB Journal’ 8(3-4):222 236,2000.

1434] K. C. Gowda and G. Krishna. Agglomerative Clustering Using the Concept of Mutual

Nearest Neighborhoo d. P att ern Recognit’ion, 10 (2) : 105-1 1 2, 1978.

[435] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering Data

Streams: Theory and Practice. IEEE Transacti,ons on Knowled,ge and Data Eng’ineering,

I 5(3):51 5-528, May/June 2003.

1436] S. Guha, R. Rastogi, and K. Shim. CURE: An Efficient Clustering Algorithm for Large

Databases. In Proc. of 1998 ACM-SIGMOD Intl. Conf. on Management of Data, pages

73 84. ACM Press, June 1998.

[437] S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering Algorithm for Cate-

gorical Attributes. In Proc. of the 15th Intl. Conf. on Data Engineering, pages 512 521.

IEEE Computer Society, March 1999.

[438] tr.-H. Han, G. Karypis, V. Kumar, and B. Mobasher. Hypergraph Based Clustering

in High-Dimensional Data Sets: A Summary of Results. IEEE Data Eng. Bulletin,2l ( 1 ) : 1 5 – 2 2 , 1 9 9 8 .

646 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

1439] A. Hinneburg and D. A. Keim. An Efficient Approach to Clustering in Large Multi- media Databases with Noise. In Proc. of the lth Intl. Conf. on Knowledge D’iscouerg and Data M’ining, pages 58 65, New York City, August 1998. AAAI Press.

[440] A. Hinneburg and D. A. Keim. Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. In Proc. of the 25th VLDB Conf., pages 506 517, trdinburgh, Scotland, UK, September 1999. Morgan Kaufmann.

l44ll F. Hcippner, F. Klawonn, R. Kruse, and T. Runkler. Fuzzy Cluster Analysis: Methods

for Classi,fi,cat’ion, Data Analysi,s and Image Recogni,tion. John Wiley & Sons, New York, Julv 2 1999.

14421 R. A. Jarvis and E. A. Patrick. Clustering Using a Similarity Measure Based on Shared Nearest Neighbors. IEEE Transactions on Computers, C-22(ll):1025 1034, 1973.

14431 I. Jonyer, D. J. Cook, and L. B. Holder. Graph-based hierarchical conceptual cluster- ing. Journal of Machine Learning Research,2:19-43, 2002.

1444] K. Kailing, H.-P. Kriegel, and P. Kroger. Density-Connected Subspace Clustering for High-Dimensional Data. In Proc. of the 2001 SIAM Intl. Conf. on Data Mini,ng, pages 428-439, Lake Buena Vista, Florida, April 2004. SIAM.

1445] G. Karypis, E.-H. Han, and V. Kumar CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. IEEE Computer,32(8):68-75, August 1999.

1446] G. Karypis and V. Kumar. Multilevel k-way Partitioning Scheme for Irregular Graphs. Journal of Parallel and Di,stri,buted Computing, 48(1):96-129, 1998.

14471 T. Kohonen, T. S. Huang, and M. R. Schroeder. Self-Organ’izing Maps. Springer- Verlag, December 2000.

[448] R. S. Michalski and R. E. Stepp. Automated Construction of Classifications: Concep- tual Clustering Versus Numerical Taxonomy. IEEE Transactions on Pattern Analyszs and” M achine Intelli,g ence, 5(4) :396 409, 1983.

[449] N. Mishra, D. Ron, and R. Swaminathan. A New Conceptual Clustering Framework. M achine Learni,ng J ournal, 56 ( 1-3) : 1 1 5-1 5 1, July/August/September 2004.

f450] T. Mitchell. Mach’ine Learning. McGraw-Hill, Boston, MA, 1997.

[451] F. Murtagh. Clustering massive data sets. In J. Abello, P. M. Pardalos, and M. G. C. Reisende, editors, Handbook of Massi,ue Data Sets. Kluwer, 2000.

14521 H. Nagesh, S. Goil, and A. Choudhary. Parallel Algorithms for Clustering High- Dimensional Large-Scale Datasets. In R. L. Grossman, C. Kamath, P. Kegelmeyer, V Kumar, and R. Namburu, editors, Data Mining for Sci,entific and Eng’ineering Appli- cat’ions, pages 335 356. Kluwer Academic Publishers, Dordrecht, Netherlands, October 2001.

1453] R. T. Ng and J. Han. CLARANS: A Method for Clustering Objects for Spatial Data Mining. IEEE Transactr,ons on Knowled.ge and Data Engineering, 14(5):1003-1016, 2002.

[454] M. Peters and M. J. Zaki. CLICKS: Clustering Categorical Data using K-partite Maximal Cliques. ln Proc. of the 21st IntI. Conf. on Data Eng’ineering, Tokyo, Japan, April 2005.

1455] E. Schikuta and M. Erhart. The BANG-Clustering System: Grid-Based Data Analy- sis. In Aduances in Intelli,gent Data Analgsis, Reasoning about Data, Second Intl. Sym-

Ttos’ium, IDA-97, London, volume 1280 of Lecture Notes ‘in Computer Sc’ience, pages 513 524. Springer, August 1997.

[456] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In Proc. of the 2lth VLDB Conf., pages 428 439, New York City, August 1998. Morgan Kaufmann.

9.8 Exercises 647

[457] M. Steinbach, P.-N. Tan, V. Kumar, S. Klooster, and C. Potter. Discovery of climate

indices using clustering. In KDD ’03: Proceed,ings of the ninth ACM SIGKDD ‘interna’

tional conference on Knowledge discouery and data mi’n’ing, pages 446-455, New York,

NY. USA. 2003. ACM Press.

[45S] R. E. Stepp and R. S. Michalski. Conceptual clustering of structured objects: A

goal-oriented approach. Arti,fici,al Intelligence, 28(1) :43-69, 1986

1459] A. Strehl and J. Ghosh. A Scalable Approach to Balanced, High-dimensional Cluster-

ing of Market-Baskets. In Proc. of the 7th Intl. Conf. on High Performance Computing

(H|PC 2000), vohtme 1970 of Lecture Notes in Computer Science, pages 525 536, Ban-

galore, India, December 2000. Springer.

[460] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method

for very large databases. In Proc. of 1996 ACM-SIGMOD Intl. Conf. on Management

of Data, pages 103-114, Montreal, Quebec, Canada, June 1996. ACM Press.

9.8 Exercises

For sparse data, discuss why considering only the presence of non-zero values

might give a more accurate view of the objects than considering the actual

magnitudes of ralues. When would such an approach not be desirable?

Describe the change in the time complexity of K-means as the number of clusters

to be found increases.

Consider a set of documents. Assume that all documents have been normalized

to have unit length of 1. What is the “shape” of a cluster that consists of all

documents whose cosine similarity to a centroid is greater than some specified

constant? In other words, cos(d, c) ) d, where 0 < d < 1.

Discuss the advantages and disadvantages of treating clustering as an optimiza-

tion problem. Among other factors, consider efficiency, non-determinism, and

whether an optimization-based approach captures all types of clusterings that

are of interest.

What is the time and space complexity of fuzzy c-means? Of SOM? How do

these complexities compare to those of K-means?

Tladitional K-means has a number of limitations, such as sensitivity to outliers

and difficulty in handling clusters of different sizes and densities, or with non-

globular shapes. Comment on the ability of fivzy c-means to handle these

situations.

7. For tlne fuzzy c-means algorithm described in this book, the sum of the mem-

bership degree of any point over all clusters is 1. Instead, we could only require

that the membership degree of a point in a cluster be between 0 and 1. What

are the advantages and disadvantages of such an approach?

8. Explain the difference between likelihood and probability.

1 .

2 .

.).

4 .

K

6.

648 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

* * , (*

Figure 9.28. Data set for Exercise 12, EM clustering of a two-dimensional point set with two clusters of differing density.

Equation 9.12 gives the likelihood for a set of points from a Gaussian distribu- tion as a function of the mean p and the standard deviation o. Show math- ematically that the maximum likelihood estimate of p and o are the sample mean and the sample standard deviation, respectively.

We take a sample of adults and measure their heights. If we record the gender of each person, we can calculate the average height and the variance of the height, separately, for men and women. Suppose, however, that this information was not recorded. Would it be possible to still obtain this information? Explain.

Compare the membership weights and probabilities of Figures 9.1 and 9.4, which come, respectively, from applying fuzzy and EM clustering to the same set of data points. What differences do you detect, and how might you explain these differences?

Figure 9.28 shows a clustering of a two-dimensional point data set with two clusters: The leftmost cluster, whose points are marked by asterisks, is some- what diffuse, while the rightmost cluster, whose points are marked by circles, is compact. To the right of the compact cluster, there is a single point (marked by an arrow) that belongs to the diffuse cluster, whose center is farther away than that of the compact cluster. Explain why this is possible with EM clustering, but not K-means clustering.

a

10.

1 1 .

t2 .

13.

14.

9.8 Exercises 649

Show that the MST clustering technique of Section 9.4.2 produces the same clusters as single link. To avoid complications and special cases) assume that all the pairwise similarities are distinct.

One way to sparsify a proximity matrix is the following: For each object (row

in the matrix), set all entries to 0 except for those corresponding to the objects k-nearest neighbors. However, the sparsified proximity matrix is typically not symmetric.

(a) If object o is among the k-nearest neighbors of object b, why is b not guaranteed to be among the k-nearest neighbors of a.?

(b) Suggest at least two approaches that could be used to make the sparsified proximity matrix symmetric.

Give an example of a set of clusters in which merging based on the closeness of clusters Ieads to a more natural set of clusters than merging based on the

strength of connection (interconnectedness) of clusters.

Table 9.4 lists the two nearest neighbors of four points.

Table 9,4. Two nearest neighbors of four points.

Point First Neiehbor Second Neighbor I A

2 J 4 a A 2 4 .) 1

Calculate the SNN similarity between each pair of points using the definition of SNN similarity defined in Algorithm 9.10.

For the definition of SNN similarity provided by Algorithm 9.10, the calculation of SNN distance does not take into account the position of shared neighbors in the two nearest neighbor lists. In other words, it might be desirable to give

higher similarity to two points that share the same nearest neighbors in the same or roughly the same order.

(a) Describe how you might modify the definition of SNN similarity to give

higher similarity to points whose shared neighbors are in roughly the same order.

(b) Discuss the advantages and disadvantages of such a modification.

Name at least one situation in which you would not want to use clustering based on SNN similarity or density.

Grid-clustering techniques are different from other clustering techniques in that

they partition space instead of sets of points.

15.

16 .

77 .

18

19.

650 Chapter 9 Cluster Analysis: Additional Issues and Algorithms

(a) How does this affect such techniques in terms of the description of the resulting clusters and the types of clusters that can be found?

(b) What kind of cluster can be found with grid-based clusters that cannot be found by other types of clustering approaches? (Hint: See Exercise 20 in Chapter 8, page 564.)

In CLIQUE, the threshold used to find cluster density remains constant, even as the number of dimensions increases. This is a potential problem since density drops as dimensionality increases; i.e., to find clusters in higher dimensions the threshold has to be set at a level that may well result in the merging of low- dimensional clusters. Comment on whether you feel this is truly a problem and, if so, how you might modify CLIQUE to address this problem.

Given a set of points in Euclidean space, which are being clustered using the K-means algorithm with Euclidean distance, the triangle inequality can be used in the assignment step to avoid calculating all the distances of each point to each cluster centroid. Provide a general discussion of how this might work.

Instead of using the formula derived in CURE see Equation 9.19–we could run a Monte Carlo simulation to directly estimate the probability that a sample of size s would contain at least a certain fraction of the points from a cluster. Using a Monte Carlo simulation compute the probability that a sample of size s contains 50% of the elements of a cluster of size 100, where the total number of points is 1000, and where s can take the values 100, 200, or 500.

20.

2r .

22.

10

Anomaly Detection