. In the initial example of Chapter 2, the statistician says, ((Yes, fields 2 and 3 are basically the same.” Can you tell from the three lines of sample data that are shown why she says that?

Exercises 89

2. Classify the following attributes as binary, discrete, or continuous. Also classifu them as qualitative (nominal or ordinal) or quantitative (interval or ratio).

Some cases may have more than one interpretation, so briefly indicate your

reasoning if you think there may be some ambiguity.

Example: Age in years. Answer: Discrete, quantitative, ratio

2.6

(u)

(b)

(“)

(d)

(e)

(f)

(s) (h)

(i)

Time in terms of AM or PM.

Brightness as measured by a light meter.

Brightness as measured by people’s judgments.

Angles as measured in degrees between 0 and 360.

Bronze, Silver, and Gold medals as awarded at the Olympics.

Height above sea level.

Number of patients in a hospital.

ISBN numbers for books. (Look up the format on the Web.)

Ability to pass light in terms of the following values: opaque, translucent’ transparent.

Military rank.

Distance from the center of campus.

Density of a substance in grams per cubic centimeter.

Coat check number. (When you attend an event, you can often give your

coat to someone who, in turn, gives you a number that you can use to

claim your coat when you leave.)

0) (k)

(l)

(-)

3. You are approached by the marketing director of a local company, who believes

that he has devised a foolproof way to measure customer satisfaction’ He

explains his scheme as follows: “It’s so simple that I can’t believe that no one

has thought of it before. I just keep track of the number of customer complaints for each product. I read in a data mining book that counts are ratio attributes, and so, my measure of product satisfaction must be a ratio attribute. But

when I rated the products based on my new customer satisfaction measure and

showed them to my boss, he told me that I had overlooked the obvious, and that my measure was worthless. I think that he was just mad because our best-

selling product had the worst satisfaction since it had the most complaints. Could you help me set him straight?”

(a) Who is right, the marketing director or his boss? If you answered, his

boss, what would you do to fix the meaaure of satisfaction?

(b) What can you say about the attribute type of the original product satis- faction attribute?

90 Chapter 2 Data

7

A few months later, you are again approached by the same marketing director as in Exercise 3. This time, he has devised a better approach to measure the extent to which a customer prefers one product over other, similar products. He explains, “When we develop new products, we typically create several variations and evaluate which one customers prefer. Our standard procedure is to give our test subjects all ofthe product variations at one time and then ask them to rank the product variations in order of preference. However, our test subjects are very indecisive, especially when there are more than two products. As a result, testing takes forever. I suggested that we perform the comparisons in pairs and then use these comparisons to get the rankings. Thus, if we have three product variations, we have the customers compare variations I and 2, then 2 and 3, and finally 3 and 1. Our testing time with my new procedure is a third of what it was for the old procedure, but the employees conducting the tests complain that they cannot come up with a consistent ranking from the results. And my boss wants the latest product evaluations, yesterday. I should also mention that he was the person who came up with the old product evaluation approach. Can you help me?”

(a) Is the marketing director in trouble? Will his approach work for gener- ating an ordinal ranking of the product variations in terms of customer preference? Explain.

(b) Is there a way to fix the marketing director’s approach? More generally, what can you say about trying to create an ordinal measurement scale based on pairwise comparisons?

(c) For the original product evaluation scheme, the overall rankings of each product variation are found by computing its average over all test subjects. Comment on whether you think that this is a reasonable approach. What other approaches might you take?

Can you think of a situation in which identification numbers would be useful for prediction?

An educational psychologist wants to use association analysis to analyze test results. The test consists of 100 questions with four possible answers each.

(a) How would you convert this data into a form suitable for association analysis?

(b) In particular, what type of attributes would you have and how many of them are there?

Which of the following quantities is likely to show more temporal autocorrela- tion: daily rainfall or daily temperature? Why?

Discuss why a document-term matrix is an example of a data set that has asymmetric discrete or asymmetric continuous features.

I

5.

o.

8.

9.

10.

1 1 .

12.

2 .6 Exercises 9L

Many sciences rely on observation instead of (or in addition to) designed ex- periments. Compare the data quality issues involved in observational science with those of experimental science and data mining.

Discuss the difference between the precision of a measurement and the terms single and double precision, as they are used in computer science, typically to represent floating-point numbers that require 32 and 64 bits, respectively.

Give at least two advantages to working with data stored in text files instead of in a binary format.

Distinguish between noise and outliers. Be sure to consider the following ques- tions.

(a) Is noise ever interesting or desirable? Outliers?

(b) Can noise objects be outliers?

(c) Are noise objects always outliers?

(d) Are outliers always noise objects?

(e) Can noise make a typical value into an unusual one, or vice versa?

13. Consider the problem of finding the K nearest neighbors of a data object. A programmer designs Algorithm 2.2 for this task.

Algorithm 2.2 Algorithm for finding K nearest neighbors. 1: for ri : 1 to number of data objects do 2: Find the distances of the ith object to all other objects. 3: Sort these distances in decreasing order.

(Keep track of which object is associated with each distance.) 4: return the objects associated with the first K distances of the sorted list 5: end for

Describe the potential problems with this algorithm if there are duplicate objects in the data set. Assume the distance function will only return a distance of 0 for objects that are the same.

How would you fix this problem?

14. The following attributes are measured for members of a herd of Asian ele- phants: wei,ght, hei,ght, tusk length, trunk length, and ear area. Based on these measurements, what sort of similarity mea”sure from Section 2.4 would you use to compare or group these elephants? Justify your answer and explain any special circumstances.

(u)

(b)

92 Chapter 2 Data

You are given a set of rn objects that is divided into K groups, where the ith group is of size mi. If. the goal is to obtain a sample of size fl I ffi, what is the difference between the following two sampling schemes? (Assume sampling with replacement.)

(a) We randomly select n *mif m elements from each group.

(b) We randomly select n elements from the data set, without regard for the group to which an object belongs.

Consider a document-term matrix, where tfii isthe frequency of the rith word (term) in the jth document and m is the number of documents. Consider the variable transformation that is defined by

15.

16.

(2.18)

where dfi is the number of documents in which the i.th term appears, which is known as the document frequency of the term. This transformation is known as the inverse document frequency transformation.

(a) What is the effect of this transformation if a term occurs in one document? In every document?

(b) What might be the purpose of this transformation?

Assume that we apply a square root transformation to a ratio attribute r to obtain the new attribute r*. As part of your analysis, you identify an interval (o, b) in which r* has a linear relationship to another attribute gr.

(a) What is the corresponding interval (o, b) in terms of r?

(b) Give an equation that relates y to r.

This exercise compares and contrasts some similarity and distance measures.

(a) For binary data, the Ll distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two bina,ry vectors. Compute the Hamming distance and the Jaccard similarity be- tween the following two binary vectors.

x : 0101010001 y : 0100011000

(b) Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching Coefficient, and which approach is more similar to the cosine measure? Explain. (Note: The Hamming mea,sure is a distance, while the other three measures are similarities, but don’t let this confuse you.)

tf ‘ t i :tft i *nsffi,

17.

18.

19.

2.6 Exercises 93

(c) Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, you think would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)

(d) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human beings share > 99.9% of the same genes.)

For the following vectors, x and y, calculate the indicated similarity or distance measures.

(a) x : (1, 1, 1, 1), y : (2,2,2,2) cosine, correlation, Euclidean

(b) x : (0, 1,0, 1), y : (1,0, 1,0) cosine, correlation, Euclidean, Jaccard

(c) x : (0, -1,0, 1) , y : (1,0, -1,0) cosine, corre lat ion, Eucl idean

(d) x : (1, 1,0, 1,0, 1) , y : (1, 1, 1,0,0, 1) cosine, corre lat ion, Jaccard

(e ) x : ( 2 , -7 ,0 ,2 ,0 , -3 ) , y : ( – 1 , 1 , – 1 ,0 ,0 , – 1 ) cos ine , co r re la t i on

Here, we further explore the cosine and correlation measures.

(a) What is the range of values that are possible for the cosine measure?

(b) If two objects have a cosine measure of 1, are they identical? Explain.

(c) What is the relationship of the cosine mea,sure to correlation, if any? (Hint: Look at statistical measures such as mean and standard deviation in cases where cosine and correlation are the same and different.)

(d) Figure 2.20(a) shows the relationship of the cosine measure to Euclidean distance for 100,000 randomly generated points that have been normalized to have an L2 length of 1. What general observation can you make about the relationship between Euclidean distance and cosine similarity when vectors have an L2 norm of 1?

(e) Figure 2.20(b) shows the relationship of correlation to Euclidean distance for 100,000 randomly generated points that have been standardized to have a mean of 0 and a standard deviation of 1. What general observa- tion can you make about the relationship between Euclidean distance and correlation when the vectors have been standardized to have a mean of 0 and a standard deviation of 1?

(f) Derive the mathematical relationship between cosine similarity and Eu- clidean distance when each data object has an L2 length of 1.

(g) Derive the mathematical relationship between correlation and Euclidean distance when each data point has been been standardized by subtracting its mean and dividing by its standard deviation.

20.