Usam Fayyad, “Mining Databiles: Towtrds Algdhms ftr howl€dge D$overy”, Bulleun oI he EEE ConpuFr Smery Technical [email protected] on&BEnginedng,vo l 2 t ,no I , Mmh 1998
Time Customer Items Purchased t1 c1 A , B 12 c3 A , C 12 c1 c ,D t3 c2 A , D t4 c2 E r5 c1 A , E
34 Chapter 2 Data
(a) Sequential transaction data.
GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGC CCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGT ECGACCAGGTGCC CCCTCTGCT CGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GC CAAGTAGAAEAEG CGAAGCGC TGGGCTGCCTGCTGCGACCAGGG
(b) Genomic sequence data.
Minneapolis Av€rage Monthly Temperature (1S2-1993)
1983 19& 1985 1986 1987 1984 1989 1990 1991 lS2 1993 1994 y€r
(c) Temperature t ime series. (d) Spatial temperature data.
Figure 2.4. Different variations of ordered data.
C2, and C3; and five different items A, B, C, D, and E. In the top table, each row corresponds to the items purchased at a particular time by each customer. For instance, at time f3, customer C2 purchased items A and D. In the bottom table, the same information is displayed, but each row corresponds to a particular customer. Each row contains information on each transaction involving the customer, where a transaction is considered to be a set of items and the time at which those items were purchased. For example, customer C3 bought items A and C at time t2.
Customer Time and ltems Purchased c1 (t1: A,B) (t2:C,D) (ts:A,E) c2 (t3: A, D) (t4: E) c3 ( t2 :A, C)
2 .1 Types of Data 35
Sequence Data Sequence data consists of a data set that is a sequence of individual entities, such as a sequence of words or letters. It is quite similar to sequential data, except that there are no time stamps; instead, there are posi- tions in an ordered sequence. For example, the genetic information of plants and animals can be represented in the form of sequences of nucleotides that are known as genes. Many of the problems associated with genetic sequence data involve predicting similarities in the structure and function of genes from similarities in nucleotide sequences. Figure 2.4(b) shows a section of the hu- man genetic code expressed using the four nucleotides from which all DNA is constructed: A, T, G, and C.
Time Series Data Time series data is a special type of sequential data in which each record is a time series, i.e., a series of measurements taken over time. For example, a financial data set might contain objects that are time series of the daily prices of various stocks. As another example, consider Figure 2.4(c), which shows a time series of the average monthly temperature for Minneapolis during the years 1982 to 1994. When working with temporal data, it is important to consider temporal autocorrelation; i.e., if two measurements are close in time, then the values of those measurements are often very similar.
Spatial Data Some objects have spatial attributes, such as positions or ar- eas, as well as other types of attributes. An example of spatial data is weather data (precipitation, temperature, pressure) that is collected for a variety of geographical locations. An important aspect of spatial data is spatial auto- correlation; i.e., objects that are physically close tend to be similar in other ways as well. Thus, two points on the Earth that are close to each other usually have similar values for temperature and rainfall.
Important examples of spatial data are the science and engineering data sets that are the result of measurements or model output taken at regularly or irregularly distributed points on a two- or three-dimensional grid or mesh. For instance, Earth science data sets record the temperature or pressure mea- sured at points (grid cells) on latitude-longitude spherical grids of various resolutions, e.8., 1o by 1o. (See Figure 2.4(d).) As another example, in the simulation of the flow of a gas, the speed and direction of flow can be recorded for each grid point in the simulation.
36 Chapter 2 Data
Handling Non-Record Data
Most data mining algorithms are designed for record data or its variations, such as transaction data and data matrices. Record-oriented techniques can be applied to non-record data by extracting features from data objects and using these features to create a record corresponding to each object. Consider the chemical structure data that was described earlier. Given a set of common substructures, each compound can be represented as a record with binary attributes that indicate whether a compound contains a specific substructure. Such a representation is actually a transaction data set, where the transactions are the compounds and the items are the substructures.
In some cases, it is easy to represent the data in a record format, but this type of representation does not capture all the information in the data. Consider spatio-temporal data consisting of a time series from each point on a spatial grid. This data is often stored in a data matrix, where each row represents a location and each column represents a particular point in time. However, such a representation does not explicitly capture the time relation- ships that are present among attributes and the spatial relationships that exist among objects. This does not mean that such a representation is inap- propriate, but rather that these relationships must be taken into consideration during the analysis. For example, it would not be a good idea to use a data mining technique that assumes the attributes are statistically independent of one another.
2.2 Data Quality
Data mining applications are often applied to data that was collected for an- other purpose, or for future, but unspecified applications. For that reasonT data mining cannot usually take advantage of the significant benefits of “ad- dressing quality issues at the source.” In contrast, much of statistics deals with the design of experiments or surveys that achieve a prespecified level of data quality. Because preventing data quality problems is typically not an op- tion, data mining focuses on (1) the detection and correction of data quality problems and (2) the use of algorithms that can tolerate poor data quality. The first step, detection and correction, is often called data cleaning.
The following sections discuss specific aspects of data quality. The focus is on measurement and data collection issues, although some application-related issues are also discussed.
2.2 Data Quality 37
2.2.L Measurement and Data Collection Issues
It is unrealistic to expect that data will be perfect. There may be problems due to human error, limitations of measuring devices, or flaws in the data collection process. Values or even entire data objects may be missing. In other cases, there may be spurious or duplicate objects; i.e., multiple data objects that all correspond to a single “real” object. For example, there might be two different records for a person who has recently lived at two different addresses. Even if all the data is present and “looks fine,” there may be inconsistencies-a person has a height of 2 meters, but weighs only 2 kilograms.
In the next few sections, we focus on aspects ofdata quality that are related to data measurement and collection. We begin with a definition of measure- ment and data collection errors and then consider a variety of problems that involve measurement error: noise, artifacts, bias, precision, and accuracy. We conclude by discussing data quality issues that may involve both measurement and data collection problems: outliers, missing and inconsistent values, and duplicate data.
Measurement and Data Collection Errors
The term measurement error refers to any problem resulting from the mea- surement process. A common problem is that the value recorded differs from the true value to some extent. For continuous attributes, the numerical dif- ference of the measured and true value is called the error. The term data collection error refers to errors such as omitting data objects or attribute values, or inappropriately including a data object. For example, a study of animals of a certain species might include animals of a related species that are similar in appearance to the species of interest. Both measurement errors and data collection errors can be either systematic or random.
We will only consider general types of errors. Within particular domains, there are certain types of data errors that are commonplace, and there ofben exist well-developed techniques for detecting and/or correcting these errors. For example, keyboard errors are common when data is entered manually, and as a result, many data entry programs have techniques for detecting and, with human intervention, correcting such errors.
Noise and Artifacts
Noise is the random component of a measurement error. It may involve the distortion of a value or the addition of spurious objects. Figure 2.5 shows a time series before and after it has been disrupted by random noise. If a bit
38 Chapter 2 Data
(a) Time series.
Figure 2.5. Noise in a time series context.
(b) Time series with noise.
. i*. i
^ + , +
(a) Three groups of points. (b) With noise points (+) added.
Figure 2.6. Noise in a spatial context.
more noise were added to the time series, its shape would be lost. Figure 2.6 shows a set of data points before and after some noise points (indicated by ‘+’s) have been added. Notice that some of the noise points are intermixed with the non-noise points.
The term noise is often used in connection with data that has a spatial or temporal component. In such cases, techniques from signal or image process- ing can frequently be used to reduce noise and thus, help to discover patterns (signals) that might be “lost in the noise.” Nonetheless, the elimination of noise is frequently difficult, and much work in data mining focuses on devis- ing robust algorithms that produce acceptable results even when noise is present.