Z- 1f-minsup x (1- minsup)lN

(c.20)

728 Appendix C Probability and Statistics

Z has a standard normal distribution with mean 0 and variance 1. The statistic essentially measures the difference between the observed support s(X) and the m’insup threshold in units of standard deviations. Let ltr : 10000, s(X) : 17%o, and mi,nsup : 7070. The Z-statistic under the null hypothesis is Z : (0.11 – 0.1)//0.09n0-000 : 3.33. From the probability table of a standard normal distribution, a one-sided test witln Z : 3.33 corresponds to a p-value of 4.34 x 10-4.

Suppose a : 0.001 is the desired significance level. a controls the prob- ability of falsely rejecting the null hypothesis even though the hypothesis is true (in the statistics literature, this is known as the Type 1 error). For ex- ample, an a value of 0.01 suggests that there is one in a hundred chance the discovered pattern is spurious. At each significance level a, there is a corre- sponding threshold Zo, srch that when the Z value of a pattern exceeds the threshold, the pattern is considered statistically significant. The threshold Zo can be looked upin a probability table for the standard normal distribution. For example, the choice of a :0.001 sets up a rejection region with Zo: 3.09. Since p < o, or equivalently, Z ) Zo, lhe null hypothesis is rejected and the pattern is considered statistically interesting.

Regression

Regression is a predictive modeling technique where the target variable to be estimated is continuous. Examples of applications of regression include predicting a stock market index using other economic indicators, forecasting the amount of precipitation in a region based on characteristics of the jet

stream, projecting the total sales of a company based on the amount spent for advertising, and estimating the age of a fossil according to the amount of carbon-14 Ieft in the organic material.

D.1 Preliminaries

Let D denote a data set that contains ly’ observations,

D : { ( x t , a t ) l i : 1 , 2 , . . . , l ‘ r } .

Each xa corresponds to the set of attributes of the ith observation (also known as the explanatory variables) and !, corresponds to the target (or response) variable. The explanatory attributes of a regression task can be either discrete or continuous.

Definition D.1 (Regression). Regression is the task of learning a target function / that maps each attribute set x into a continuous-valued output 3t.

The goal of regression is to find a target function that can fit the input data with minimum error. The error function for a regression task can be

73O Appendix D Regression

expressed in terms of the sum of absolute or squared error:

Absolute Error : \,lru

– /(*r)l 1,

Squared Error : D,@o

– f!r))2 1,

(D .1 )

(D.2)

Heat Flux Skin Temperature 6.3221 6.0325 5.7429 5.5016 5 2603 5 1638 5 0673 4.9708 48743 4.7777 4.7295 4.633

A AAA’

31.581 3 1 . 6 1 8 31-674 31 712 31.768 31 825 31.862 3 1 . 9 1 9 31 _975 32.013 32-O7

32.126 32.164

32.5

I f

9 3 2 o o E o- c 31 .5 a

3U.5

D.2 Simple Linear Regression

Consider the physiological data shown in Figure D.1. The data corresponds to measurements of heat flux and skin temperature of a person during sleep. Sup- pose we are interested in predicting the skin temperature of a person based on the heat flux measurements generated by a heat sensor. The two-dimensional scatter plot shows that there is a strong linear relationship between the two variables.

Heat Flux Skln TemDerature tu.oco 1 0 6 1 7 1 0.1 83 9.7003 Y OCZ

10 086 9.459

I 3972 7 6251 71907 7.046

6.9494 6.7081

5f ,uvz 31.021 31.058 31.095 31 .133 31 ‘188

31 226 31 263 31 319 31 356 31.412 31.468 31.524

+ +T + .

3 4 5 6 7 I 9 1 0 1 1 Heat Flux

Figute D.1, Measurements of heat flux and skin temperature of a person.

Heat Flux Skin Temperature

4-2951 4.2469 4 0056 3 716 3 523

3 4265 33782 3 4265 3.3752 3.3299 3 3299 3 4265

J I Z Z l

32.259 32.296 32.334 32.391 32.448 32.505 32543 32.6

32.657 32 696 32.753 32-791

D.2 Simple Linear Regression 73L

D.2.1 Least Square Method

Suppose we wish to fit the following linear model to the observed data:

f (r) : uLr + ao) (D.3)

where aJg and url are parameters of the model and are called the regression coefficients. A standard approach for doing this is to apply the method of least squares, which attempts to find the parameters (r,.16,&r1) that minimize the sum of the squared error

(D.4) ; – 1 i : I

which is also known as the residual sum of squares. This optimization problem can be solved by taking the partial derivative

of -E with respect to c.rg and c,,r1, setting them to zero, and solving the corre-

sponding system of linear equations.

N N

ssE :Llou – f (“0)12 :D[vo – a1r – u012,

OE 0uo

AE 0at

AI ^ \-.-Z \ lU r -a r r i – uo ] : 0

; – 1

-lV ^ S – .-Z\[Ut – arri – ws]ri: Q

i : l

39 229.9 \ – ‘

1 |Z+Z.O \ 22s.s r56s.2 ) \ zzzo.r ) 0.1881 -0.0276 \ / 1242.e \ -0.0276 o.oo47 ) \ zzzs.z ) 33.1699 \-0.2208 I

These equations can be summarized by the following matrix equation’

which is also known as the normal equation:

(d”, ?,u) (:: ) : 1r,D::;,)

(D.5)

(D.6)

Since !o rt : 229.9, Dt”? – 1569.2, DtAt : L242.9, and !, niyi : 7279.7, the normal equations can be solved to obtain the following estimates for the parameters.

( ; : )

: (

: (

: (

732 Appendix D Regression

Thus, the linear model that best fits the data in terms of minimizing the SSE is

f ( r ) :33.17 – 0 .22n-

Figure D.2 shows the line corresponding to this model.

30.5

q)

8 3 2 o

c o) F .E 31.5 x a

6 7 8 Heat Flux

I 1 0 1 1

Figure D.2. A linear model that fits the data given in Figure D.1.

We can show that the general solution to the normal equations given in D.6 can be expressed as follow:

where 7 ðŸ˜€ tq lN , y ðŸ˜€ ry i f N , and

u -a f x onA

ort

\ – r – \ /o r s : ) . \ r t – r ) \ h – A ) x

orr : \ t *o-z) ‘

1,

[ – . . 2 ova : ) – \u t -T)

L

wu

w l (D.7)

(D.8)

(D.e)

(D.10)

Skin Temp = 33.t2- 0.22 Heat Flux

\ . . .

D.2 Simple Linear Regression 733

Thus, linear model that results in the minimum squared error is given by

f (*) :g + lzY l” -zl. ” o r r ‘

(D.11)

(D.13)

(D.14)

In summary, the least squares method is a systematic approach to fit a lin-

ear model to the response variable g by minimizing the squared error between the true and estimated value of g. Although the model is relatively simple, it seems to provide a reasonably accurate approximation because a linear model is the first-order Taylor series approximation for any function with continuous derivatives.

D.2.2 Analyzing Regression Errors

Some data sets may contain errors in their measurements of x and y. In addition, there may exist confounding factors that affect the response variable g, but are not included in the model specification. Because of this, the response variable gr in regression tasks can be non-deterministic, i.e., it may produce a

different value even though the same attribute set x is provided.

We can model this type of situation using a probabilistic approach, where g is treated as a random variable:

/(*)+ [s – /(“)] / ( x )+ e . (D.12)

Both measurement errors and errors in model specification have been absorbed into a random noise term, e. The random noise present in data is typically

assumed to be independent and follow a certain probability distribution. For example, if the random noise comes from a normal distribution with

zero mean and variance o2, then

P(e lx , O)

log[P(e lx, o)]

1 – ts-J(**o)]2 —-: eXD 2oz Jhro2

1 – -:(u – f (x. Q)) ‘ * constant

2 ‘ , ”

This analysis shows that minimizing the SSE, [g – f (*,O]2, impticitly assumes

that the random noise follows a normal distribution. Furthermore, it can be

shown that the constant model, /(*, A) : c, that best minimizes this type of error is the mean, i.e., c : y.

734 Appendix D Regression

Another typical probability model for noise uses the Laplacian distribution:

P(e lx , Q)

Iog[P(e lx,0)]

c exp-clu-“f (x’ft) |

-clA – /(*, CI)l * constant

(D.15)

(D.16)

(D.17)

(D.18)

(D.1e)

This suggests that minimizing the absolute error lA – f 6,O)l implicitly as- sumes that the random noise follows a Laplacian distribution. The best con- stant model for this case corresponds to /(x, Q) : A, the median value of a.

Besides the SSE given in Equation D.4, we can also define two other types of errors:

ssr : \,(uo -d,

ssM : \ , ( f (“0)- i l ,