Business Point-of-sale data collection (bar code scanners, radio frequency identification (RFID), and smart card technology) have allowed retailers to collect up-to-the-minute data about customer purchases at the checkout coun- ters of their stores. Retailers can utilize this information, along with other business-critical data such as Web logs from e-commerce Web sites and cus- tomer service records from call centers, to help them better understand the needs of their customers and make more informed business decisions.
Data mining techniques can be used to support a wide range of business intelligence applications such as customer profiling, targeted marketing, work- flow management, store layout, and fraud detection. It can also help retailers
2 Chapter 1 lntroduction
answer important business questions such as “Who are the most profitable customers?” “What products can be cross-sold or up-sold?” and “What is the revenue outlook of the company for next year?)) Some of these questions mo- tivated the creation of association analvsis (Chapters 6 and 7), a new data analysis technique.
Medicine, Science, and Engineering Researchers in medicine, science, and engineering are rapidly accumulating data that is key to important new discoveries. For example, as an important step toward improving our under- standing of the Earth’s climate system, NASA has deployed a series of Earth- orbiting satellites that continuously generate global observations of the Iand surface, oceans, and atmosphere. However, because of the size and spatio- temporal nature of the data, traditional methods are often not suitable for analyzing these data sets. Techniques developed in data mining can aid Earth scientists in answering questions such as “What is the relationship between the frequency and intensity of ecosystem disturbances such as droughts and hurricanes to global warming?” “How is land surface precipitation and temper- ature affected by ocean surface temperature?” and “How well can we predict the beginning and end of the growing season for a region?”
As another example, researchers in molecular biology hope to use the large amounts of genomic data currently being gathered to better understand the structure and function of genes. In the past, traditional methods in molecu- lar biology allowed scientists to study only a few genes at a time in a given experiment. Recent breakthroughs in microarray technology have enabled sci- entists to compare the behavior of thousands of genes under various situations. Such comparisons can help determine the function of each gene and perhaps isolate the genes responsible for certain diseases. However, the noisy and high- dimensional nature of data requires new types of data analysis. In addition to analyzing gene array data, data mining can also be used to address other important biological challenges such as protein structure prediction, multiple sequence alignment, the modeling of biochemical pathways, and phylogenetics.
1.1 What Is Data Mining?
Data mining is the process of automatically discovering useful information in large data repositories. Data mining techniques are deployed to scour large databases in order to find novel and useful patterns that might otherwise remain unknown. They also provide capabilities to predict the outcome of a
1.1 What Is Data Mining? 3
future observation, such as predicting whether a newly arrived. customer will spend more than $100 at a department store.
Not all information discovery tasks are considered to be data mining. For example, Iooking up individual records using a database managemenr sysrem or finding particular Web pages via a query to an Internet search engine are tasks related to the area of information retrieval. Although such tasks are important and may involve the use of the sophisticated algorithms and data structures, they rely on traditional computer science techniques and obvious features of the data to create index structures for efficiently organizing and retrieving information. Nonetheless, data mining techniques have been used to enhance information retrieval systems.
Data Mining and Knowledge Discovery
Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall process of converting raw data into useful in- formation, as shown in Figure 1.1. This process consists of a series of trans- formation steps, from data preprocessing to postprocessing of data mining results.
Information
Figure 1 ,1. The process of knowledge discovery in databases (KDD).
The input data can be stored in a variety of formats (flat files, spread- sheets, or relational tables) and may reside in a centralized data repository or be distributed across multiple sites. The purpose of preprocessing is to transform the raw input data into an appropriate format for subsequent analysis. The steps involved in data preprocessing include fusing data from multiple sources, cleaning data to remove noise and duplicate observations, and selecting records and features that are relevant to the data mining task at hand. Because of the many ways data can be collected and stored, data
4 Chapter 1 Introduction
preprocessing is perhaps the most laborious and time-consuming step in the
overall knowledge discovery process. ,,Closing the loop” is the phrase often used to refer to the process of in-
tegrating data mining results into decision support systems. For example,
in business applications, the insights offered by data mining results can be
integrated with campaign management tools so that effective marketing pro-
motions can be conducted and tested. Such integration requires a postpro-
cessing step that ensures that only valid and useful results are incorporated
into the decision support system. An example of postprocessing is visualiza-
tion (see Chapter 3), which allows analysts to explore the data and the data
mining results from a variety of viewpoints. Statistical measures or hypoth-
esis testing methods can also be applied during postprocessing to eliminate
spurious data mining results.
L.2 Motivating Challenges
As mentioned earlier, traditional data analysis techniques have often encoun-
tered practical difficulties in meeting the challenges posed by new data sets.
The following are some of the specific challenges that motivated the develop-
ment of data mining.
Scalability Because of advances in data generation and collection, data sets
with sizes of gigabytes, terabytes, or even petabytes are becoming common.
If data mining algorithms are to handle these massive data sets, then they
must be scalable. Many data mining algorithms employ special search strate-
gies to handle exponential search problems. Scalability may also require the
implementation of novel data structures to access individual records in an ef-
ficient manner. For instance, out-of-core algorithms may be necessary when
processing data sets that cannot fit into main memory. Scalability can also be
improved by using sampling or developing parallel and distributed algorithms.
High Dimensionality It is now common to encounter data sets with hun-
dreds or thousands of attributes instead of the handful common a few decades
ago. In bioinformatics, progress in microarray technology has produced gene
expression data involving thousands of features. Data sets with temporal
or spatial components also tend to have high dimensionality. For example,
consider a data set that contains measurements of temperature at various
locations. If the temperature measurements are taken repeatedly for an ex-
tended period, the number of dimensions (features) increases in proportion to
L.2 Motivating Challenges 5
the number of measurements taken. Tladitional data analysis techniques that were developed for low-dimensional data often do not work well for such high- dimensional data. Also, for some data analysis algorithms, the computational complexity increases rapidly as the dimensionality (the number of features) increases.
Heterogeneous and Complex Data TYaditional data analysis methods often deal with data sets containing attributes of the same type, either contin- uous or categorical. As the role of data mining in business, science, medicine, and other flelds has grown, so has the need for techniques that can handle heterogeneous attributes. Recent years have also seen the emergence of more complex data objects. Examples of such non-traditional types of data include collections of Web pages containing semi-structured text and hyperlinks; DNA data with sequential and three-dimensional structure; and climate data that consists of time series measurements (temperature, pressure, etc.) at various locations on the Earth’s surface. Techniques developed for mining such com- plex objects should take into consideration relationships in the data, such as temporal and spatial autocorrelation, graph connectivity, and parent-child re- lationships between the elements in semi-structured text and XML documents.
Data ownership and Distribution Sometimes, the data needed for an analysis is not stored in one location or owned by one organization. Instead, the data is geographically distributed among resources belonging to multiple entities. This requires the development of distributed data mining techniques. Among the key challenges faced by distributed data mining algorithms in- clude (1) how to reduce the amount of communication needed to perform the distributed computatior, (2) how to effectively consolidate the data mining results obtained from multiple sources, and (3) how to address data security issues.