+1 (208) 254-6996 essayswallet@gmail.com

Non-traditional Analysis The traditional statistical approach is based on a hypothesize-and-test paradigm. In other words, a hypothesis is proposed, an experiment is designed to gather the data, and then the data is analyzed with respect to the hypothesis. Unfortunately, this process is extremely labor- intensive. Current data analysis tasks often require the generation and evalu- ation of thousands of hypotheses, and consequently, the development of some data mining techniques has been motivated by the desire to automate the process of hypothesis generation and evaluation. Furthermore, the data sets analyzed in data mining are typically not the result of a carefully designed


Don't use plagiarized sources. Get Your Custom Essay on
Non-traditional Analysis The traditional statistical approach is based on a hypothesize-and-test paradigm. In other words
Just from $13/Page
Order Essay


6 Chapter 1 Introduction

experiment and often represent opportunistic samples of the data, rather than

random samples. Also, the data sets frequently involve non-traditional types

of data and data distributions.

1.3 The Origins of Data Mining

Brought together by the goal of meeting the challenges of the previous sec-

tion, researchers from different disciplines began to focus on developing more

efficient and scalable tools that could handle diverse types of data. This work,

which culminated in the field of data mining, built upon the methodology and

algorithms that researchers had previously used. In particular, data mining

draws upon ideas, such as (1) sampling, estimation, and hypothesis testing

from statistics and (2) search algorithms, modeling techniques, and learning

theories from artificial intelligence, pattern recognition, and machine learning.

Data mining has also been quick to adopt ideas from other areas, including

optimization, evolutionary computing, information theory, signal processing,

visualization, and information retrieval. A number of other areas also play key supporting roles. In particular,

database systems are needed to provide support for efficient storage, index-

ing, and query processing. Techniques from high performance (parallel) com-

puting are often important in addressing the massive size of some data sets.

Distributed techniques can also help address the issue of size and are essential

when the data cannot be gathered in one location. Figure 1.2 shows the relationship of data mining to other areas.

Figure 1.2. Data mining as a conlluence of many disciplines.



Data Mining Tasks 7

1.4 Data Mining Tasks

Data mining tasks are generally divided into two major categories:

Predictive tasks. The objective of these tasks is to predict the value of a par- ticular attribute based on the values of other attributes. The attribute to be predicted is commonly known as the target or dependent vari- able, while the attributes used for making the prediction are known as the explanatory or independent variables.

Descriptive tasks. Here, the objective is to derive patterns (correlations, trends, clusters, trajectories, and anomalies) that summarize the un- derlying relationships in data. Descriptive data mining tasks are often exploratory in nature and frequently require postprocessing techniques to validate and explain the results.

Figure 1.3 illustrates four of the core data mining tasks that are described in the remainder of this book.

Four of the core data mining tasks.



Figure 1.3.



8 Chapter 1 Introduction

Predictive modeling refers to the task of building a model for the target

variable as a function of the explanatory variables. There are two types of

predictive modeling tasks: classification, which is used for discrete target

variables, and regression, which is used for continuous target variables. For

example, predicting whether a Web user will make a purchase at an online

bookstore is a classification task because the target variable is binary-valued.

On the other hand, forecasting the future price of a stock is a regression task

because price is a continuous-valued attribute. The goal of both tasks is to

learn a model that minimizes the error between the predicted and true values

of the target variable. Predictive modeling can be used to identify customers

that will respond to a marketing campaign, predict disturbances in the Earth’s

ecosystem, or judge whether a patient has a particular disease based on the

results of medical tests.

Example 1.1 (Predicting the Type of a Flower). Consider the task of

predicting a species of flower based on the characteristics of the flower. In

particular, consider classifying an Iris flower as to whether it belongs to one

of the following three Iris species: Setosa, Versicolour, or Virginica. To per-

form this task, we need a data set containing the characteristics of various

flowers of these three species. A data set with this type of information is

the well-known Iris data set from the UCI Machine Learning Repository at

http: /hrurw.ics.uci.edu/-mlearn. In addition to the species of a flower,

this data set contains four other attributes: sepal width, sepal length, petal

length, and petal width. (The Iris data set and its attributes are described

further in Section 3.1.) Figure 1.4 shows a plot of petal width versus petal

length for the 150 flowers in the Iris data set. Petal width is broken into the

categories low, med’ium, and hi’gh, which correspond to the intervals [0′ 0.75),

[0.75, 1.75), [1.75, oo), respectively. Also, petal length is broken into categories

low, med,’ium, and hi,gh, which correspond to the intervals [0′ 2.5), [2.5,5), [5′ oo), respectively. Based on these categories of petal width and length, the

following rules can be derived:

Petal width low and petal length low implies Setosa. Petal width medium and petal length medium implies Versicolour. Petal width high and petal length high implies Virginica.

While these rules do not classify all the flowers, they do a good (but not

perfect) job of classifying most of the flowers. Note that flowers from the

Setosa species are well separated from the Versicolour and Virginica species

with respect to petal width and length, but the latter two species overlap

somewhat with respect to these attributes. I



r Setosa . Versicolour o Virginica

L.4 Data Mining Tasks I

l – – – – a – – f o – – – – – – – i l a o r , f t f o o t o a i : o o o I I

‘ t 0 f 0 a o 0?oo r a a r f I

? 1 . 7 5 E() r 1 . 5 E

= (t’ ()

( L l

!0_l! _.! o_ _o. t a a r O

. .4. a?o o a a a a

a aaaaaaa a a a a a

aa a a a a a



l l t l l

l l t I

I l l l l t t I

I t !

1 2 2 . 5 3 4 5 ( Petal Length (cm)

Figure 1.4. Petal width versus petal length for 1 50 lris flowers,

Association analysis is used to discover patterns that describe strongly as- sociated features in the data. The discovered patterns are typically represented in the form of implication rules or feature subsets. Because of the exponential size of its search space, the goal of association analysis is to extract the most interesting patterns in an efficient manner. Useful applications of association analysis include finding groups of genes that have related functionality, identi- fying Web pages that are accessed together, or understanding the relationships between different elements of Earth’s climate system.

Example 1.2 (Market Basket Analysis). The transactions shown in Ta- ble 1.1 illustrate point-of-sale data collected at the checkout counters of a grocery store. Association analysis can be applied to find items that are fre- quently bought together by customers. For example, we may discover the rule {Diapers} —–* {lt:.ft}, which suggests that customers who buy diapers also tend to buy milk. This type of rule can be used to identify potential cross-selling opportunities among related items. I

Cluster analysis seeks to find groups of closely related observations so that observations that belong to the same cluster are more similar to each other



10 Chapter 1 Introduction

Table 1 .1. Market basket data.

Tlansaction ID Items 1 2 3 4 r

o 7 8 9 10

{Bread, Butter, Diapers, Milk}

{Coffee, Sugar, Cookies, Sakoon}

{Bread, Butter, Coffee, Diapers, Milk, Eggs}

{Bread, Butter, Salmon, Chicken}

{fgg”, Bread, Butter}

{Salmon, Diapers, Milk}

{Bread, Tea, Sugar, Eggs}

{Coffee, Sugar, Chicken, Eggs}

{Bread, Diapers, Mi1k, Salt}

{Tea, Eggs, Cookies, Diapers, Milk}

than observations that belong to other clusters. Clustering has been used to

group sets of related customers, find areas of the ocean that have a significant

impact on the Earth’s climate, and compress data.

Example 1.3 (Document Clustering). The collection of news articles

shown in Table 1.2 can be grouped based on their respective topics. Each

article is represented as a set of word-frequency pairs (r, “),

where tu is a word

and c is the number of times the word appears in the article. There are two

natural clusters in the data set. The first cluster consists of the first four ar-

ticles, which correspond to news about the economy, while the second cluster

contains the last four articles, which correspond to news about health care. A

good clustering algorithm should be able to identify these two clusters based

on the similarity between words that appear in the articles.

Table 1.2. Collection of news articles.

Article Words I 2 .) A

r J


7 8

dollar: 1, industry: 4, country: 2, loan: 3, deal: 2, government: 2

machinery: 2, labor: 3, market: 4, industry: 2, work: 3, country: 1 job: 5, inflation: 3, rise: 2, jobless: 2, market: 3, country: 2, index: 3

domestic: 3, forecast: 2, gain: 1, market: 2, sale: 3, price: 2 patient: 4, symptom: 2, drug: 3, health: 2, clinic: 2, doctor: 2 pharmaceutical:2, company: 3, drug: 2,vaccine:1, f lu: 3

death: 2, cancer: 4, drug: 3, public: 4, health: 3, director: 2

medical: 2, cost: 3, increase: 2, patient: 2, health: 3, care: 1



1.5 Scope and Organization of the Book 11

Order your essay today and save 10% with the discount code ESSAYHELP