Attribute Type Description Examples Operations

Nominal The values of a nominal attribute are just different names; i.e., nominal values provide only enough information to distinguish one object from another. t – + \ \ – ) T l

codes, employee ID numbers, eye color, gender

zrp mode, entropy, contingency correlation, y2 test

Ordinal The values of an ordinal attribute provide enough information to order objects. (< , > )

hardness of minerals,

{good,better,best}, grades, street numbers

median, percentiles, rank correlation, run tests, siqn tests

lnterval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+ , – )

calendar dates, temperature in Celsius or Fahrenheit

mean, standard deviation, Pearson’s correlation, t and F tests

Katto For ratio variables, both differences and ratios are meaningful. ( +, l )

temperature in Kelvin. monetary quantities, counts, age, mass, length, electrical current

geometric mean, harmonic mean, percent variation

Table 2,3. Transformations that define attribute levels, Attribute Typ” Tlansformation Comment

Nominal Any one-to-one mapping, €.g., & permutation of values

It all employee IIJ numbers are reassigned, it will not make any differcnce

()rdinal An order-preserving change of values. i.e.. new _u alue : f (old _u alue), where / is a monotonic function.

An attribute encompassing the notion of good, better, best can be represented equally well by the values {1,2,3} or by

{0 .5 , 1 , 10 } . Interval new -ualue : a * old-talue I b,

o. and b constants. The Fahrenheit and Celsius temperature scales differ in the Iocation of their zero value and the size of a degree (unit).

Ratio new -ualue : a * ol,d-ua|ue Length can be measured in meters or feet.

2 .L Types of Data 27

the meaning of a length attribute is unchanged if it is measured in meters instead of feet.

The statistical operations that make sense for a particular type of attribute are those that will yield the same results when the attribute is transformed us- ing a transformation that preserves the attribute’s meaning. To illustrate, the average length of a set of objects is different when measured in meters rather than in feet, but both averages represent the same length. Table 2.3 shows the permissible (meaning-preserving) transformations for the four attribute types of Table 2.2.

Example 2.5 (Temperature Scales). Temperature provides a good illus- tration of some of the concepts that have been described. First, temperature can be either an interval or a ratio attribute, depending on its measurement scale. When measured on the Kelvin scale, a temperature of 2o is, in a physi- cally meaningful way, twice that of a temperature of 1o. This is not true when temperature is measured on either the Celsius or Fahrenheit scales, because, physically, a temperature of 1o Fahrenheit (Celsius) is not much different than a temperature of 2″ Fahrenheit (Celsius). The problem is that the zero points of the Fahrenheit and Celsius scales are, in a physical sense, arbitrary, and therefore, the ratio of two Celsius or Fahrenheit temperatures is not physi- cally meaningful.

28 Chapter 2 Data

Describing Attributes by the Number of Values

An independent way of distinguishing between attributes is by the number of values they can take.

Discrete A discrete attribute has a finite or countably infinite set of values. Such attributes can be categorical, such as zip codes or ID numbers, or numeric, such as counts. Discrete attributes are often represented using integer variables. Binary attributes are a special case of dis- crete attributes and assume only two values, e.g., true/false, yes/no, male/female, or 0f 1. Binary attributes are often represented as Boolean variables, or as integer variables that only take the values 0 or 1.

Continuous A continuous attribute is one whose values are real numbers. Ex- amples include attributes such as temperature, height, or weight. Con- tinuous attributes are typically represented as floating-point variables. Practically, real values can only be measured and represented with lim- ited precision.

In theory, any of the measurement scale types-nominal, ordinal, interval, and ratio could be combined with any of the types based on the number of at- tribute values-binary, discrete, and continuous. However, some combinations occur only infrequently or do not make much sense. For instance, it is difficult to think of a realistic data set that contains a continuous binary attribute. Typically, nominal and ordinal attributes are binary or discrete, while interval and ratio attributes are continuous. However, count attributes, which are discrete, are also ratio attributes.

Asymmetric Attributes

For asymmetric attributes, only presence a non-zero attribute value-is re- garded as important. Consider a data set where each object is a student and each attribute records whether or not a student took a particular course at a university. For a specific student, an attribute has a value of 1 if the stu- dent took the course associated with that attribute and a value of 0 otherwise. Because students take only a small fraction of all available courses, most of the values in such a data set would be 0. Therefore, it is more meaningful and more efficient to focus on the non-zero values. To illustrate, if students are compared on the basis of the courses they don’t take, then most students would seem very similar, at least if the number of courses is large. Binary attributes where only non-zero values are important are called asymmetric

2 .L Types of Data 29

binary attributes. This type of attribute is particularly important for as- sociation analysis, which is discussed in Chapter 6. It is also possible to have discrete or continuous asymmetric features. For instance, if the number of credits associated with each course is recorded, then the resulting data set will consist of asymmetric discrete or continuous attributes.

2.L.2 Types of Data Sets

There are many types of data sets, and as the field of data mining develops and matures, a greater variety of data sets become available for analysis. In this section, we describe some of the most common types. For convenience, we have grouped the types of data sets into three groups: record data, graph- based data, and ordered data. These categories do not cover all possibilities and other groupings are certainly possible.

General Characteristics of Data Sets

Before providing details of specific kinds of data sets, we discuss three char- acteristics that apply to many data sets and have a significant impact on the data mining techniques that are used: dimensionality, sparsity, and resolution.

Dimensionality The dimensionality of a data set is the number of attributes that the objects in the data set possess. Data with a small number of dimen- sions tends to be qualitatively different than moderate or high-dimensional data. Indeed, the difficulties associated with analyzing high-dimensional data are sometimes referred to as the curse of dimensionality. Because of this, an important motivation in preprocessing the data is dimensionality reduc- tion. These issues are discussed in more depth later in this chapter and in Appendix B.

Sparsity For some data sets, such as those with asymmetric features, most attributes of an object have values of 0; in many casesT fewer than 1% of the entries are non-zero. In practical terms, sparsity is an advantage because usually only the non-zero values need to be stored and manipulated. This results in significant savings with respect to computation time and storage. FurthermoreT some data mining algorithms work well only for sparse data.

Resolution It is frequently possible to obtain data at different levels of reso- Iution, and often the properties ofthe data are different at different resolutions. For instance, the surface of the Earth seems very uneven at a resolution of a

30 Chapter 2 Data

few meters, but is relatively smooth at a resolution of tens of kilometers. The patterns in the data also depend on the level of resolution. If the resolution is too fine, a pattern may not be visible or may be buried in noise; if the resolution is too coarse, the pattern may disappear. For example, variations in atmospheric pressure on a scale of hours reflect the movement of storms and other weather systems. On a scale of months, such phenomena are not detectable.

Record Data

Much data mining work assumes that the data set is a collection of records (data objects), each of which consists of a fixed set of data fields (attributes).

See Figure 2.2(a). For the most basic form of record data, there is no explicit relationship among records or data fields, and every record (object) has the same set of attributes. Record data is usually stored either in flat files or in relational databases. Relational databases are certainly more than a collection of records, but data mining often does not use any of the additional information available in a relational database. Rather, the database serves as a convenient place to find records. Different types of record data are described below and are illustrated in Figure 2.2.

Tbansaction or Market Basket Data Tbansaction data is a special type of record data, where each record (transaction) involves a set of items. Con- sider a grocery store. The set of products purchased by a customer during one shopping trip constitutes a transaction, while the individual products that were purchased are the items. This type of data is called market basket data because the items in each record are the products in a person’s “mar- ket basket.” Tlansaction data is a collection of sets of items, but it can be viewed as a set of records whose fields are asymmetric attributes. Most often, the attributes are binary, indicating whether or not an item was purchased, but more generally, the attributes can be discrete or continuous, such as the number of items purchased or the amount spent on those items. Figure 2.2(b) shows a sample transaction data set. Each row represents the purchases of a particular customer at a particular time.

The Data Matrix If the data objects in a collection of data all have the same fixed set of numeric attributes, then the data objects can be thought of as points (vectors) in a multidimensional space, where each dimension represents a distinct attribute describing the object. A set of such data objects can be interpreted as an n’L by n matrix, where there are rn rows, one for each object,

2.L Types of Data 31

(a) Record data. (b) Ttansaction data.

Document 1 0 0 2 o 0 0 2

Document 2 0 7 0 0 0 0 0

Document 3 0 I 0 0 2 2 0 o 0

(c) Data matrix. (d) Document-term matrix.

Figure 2.2, Different variations of record data.

and n columns, one for each attribute. (A representation that has data objects as columns and attributes as rows is also fine.) This matrix is called a data matrix or a pattern matrix. A data matrix is a variation of record data, but because it consists of numeric attributes, standard matrix operation can be applied to transform and manipulate the data. Therefore, the data matrix is the standard data format for most statistical data. Figure 2.2(c) shows a sample data matrix.

The Sparse Data Matrix A sparse data matrix is a special case of a data matrix in which the attributes are of the same type and are asymmetric; i.e., only non-zero values are important. Transaction data is an example of a sparse data matrix that has only 0 1 entries. Another common example is document data. In particular, if the order of the terms (words) in a document is ignored,

32 Chapter 2 Data

then a document can be represented as a term vector, where each term is a component (attribute) of the vector and the value of each component is the number of times the corresponding term occurs in the document. This representation of a collection of documents is often called a document-term matrix. Figure 2.2(d) shows a sample document-term matrix. The documents are the rows of this matrix, while the terms are the columns. In practice, only the non-zero entries of sparse data matrices are stored.

Graph-Based Data

A graph can sometimes be a convenient and powerful representation for data. We consider two specific cases: (1) the graph captures relationships among data objects and (2) the data objects themselves are represented as graphs.

Data with Relationships among Objects The relationships among ob- jects frequently convey important information. In such cases, the data is often represented as a graph. In particular, the data objects are mapped to nodes of the graph, while the relationships among objects are captured by the links between objects and link properties, such as direction and weight. Consider Web pages on the World Wide Web, which contain both text and links to other pages. In order to process search queries, Web search engines collect and process Web pages to extract their contents. It is well known, however, that the links to and from each page provide a great deal of information about the relevance of a Web page to a query, and thus, must also be taken into consideration. Figure 2.3(a) shows a set of linked Web pages.

Data with Objects That Are Graphs If objects have structure, that is, the objects contain subobjects that have relationships, then such objects are frequently represented as graphs. For example, the structure of chemical compounds can be represented by a graph, where the nodes are atoms and the links between nodes are chemical bonds. Figure 2.3(b) shows a ball-and-stick diagram of the chemical compound benzene, which contains atoms of carbon (black) and hydrogen (gray). A graph representation makes it possible to determine which substructures occur frequently in a set of compounds and to ascertain whether the presence of any of these substructures is associated with the presence or absence of certain chemical properties, such as melting point

or heat of formation. Substructure mining, which is a branch of data mining that analyzes such data, is considered in Section 7.5.

2 .1 Types of Data 33

(a) Linked Web pages. (b) Benzene molecule.

Figure 2.3. Different variations of graph data.

Ordered Data

For some types of data, the attributes have relationships that involve order in time or space. Different types of ordered data are described next and are shown in Figure 2.4.

Sequential Data Sequential data, also referred to as temporal data, can be thought of as an extension of record data, where each record has a time associated with it. Consider a retail transaction data set that also stores the time at which the transaction took place. This time information makes it possible to find patterns such as “candy sales peak before Halloween.” A time can also be associated with each attribute. For example, each record could be the purchase history of a customer, with a listing of items purchased at different times. Using this information, it is possible to find patterns such as “people who buy DVD players tend to buy DVDs in the period immediately following the purchase.”

Figure 2.a@) shows an example of sequential transaction data. There are fi.ve different times-/7, t2, t3, tl, and t5; three different customers-Cl,

Useful Links: . Bbuoq@hv –

. mer Useful Web sib

o ACM SIGmD

o onuqqets

o fteDahh€

Knowledge Discovery and Data Mining Bibliography

(GeB up&td frequenily, so dsironenl)

Bd Refereffi in Dab MilDg and Knwled$ Dlsc@ry

Us@ Fayyad, cregory HateBky-Shapirc, Ptrdc Smyfr, ed Rmmy ud|many, “Advses in kowledge Dhcovery dd De Mining”, MI hess/the Mnkss, 1996

J Ross Quinlm, “g 5i kogms ftr Mehne hing”, Mqil Kilfmmn hblishers, 1993 Michael Bery ild ftdon Linon “Dau Mining T€hniques (For kkdng, Sales, md Custom Suppd). John Wiley & Sons, 197