where ,9^9? is known as the total sum of squares and SSM is known as the regression sum of squares. S^97 represents the prediction error when the average value g is used as an estimate for the response variable. SSM, on the other hand, represents the amount of error in the regression model. The relationship among SST, SSE, and SSM is derived as follows:

ssE : Dlro -T +T – f @)12

: \W,

-s)2 +DVI-S -sl2 +z\(u, -r)@ – f (re))

: Llro -vl2 +Dlr(,,) -sl ‘ – z\(u – f iay(r; -r) i i 1 .

: Dlro -sl ‘ +DIti,,o) -sl ‘ – zlal@r -z)2 i i i

: Dlr ‘ -T l2 – \ l f @r) -d2 i t

: SST _ SSM

where we have applied the following relationships:

T- f@r) / – \-o.’l\ri – I ) \ – r – t2

ora LDlorn : u7 ) Lri – Ir.Dlr -ellq -nl

D.2 Simple Linear Regression 735

Thus, we can write Ì‚ 9S7: SSE + SSM.

D.2.3 Analyzing Goodness of Fit

One way to measure measure:

the goodness of the fit is by computing the following

p 2 – SSM Dnlf @o) -el’ (D.20)

SSr Dtlat -Tl”

The R2 (or coffici,ent of determinati,on) for a regression model may range between 0 and 1. Its value is close to 1 if most of the variability observed in

the response variable can be explained by the regression model. -R2 is also related to the correlation coefficient. r. which measures the

strength of the linear relationship between the explanatory and response vari-

ables u r|.t

T : – – 2 . t/o’tona

From Equations D.9, D.10, and D.11, we can write

(D.21)

Dolf @u) -s)’

Dulao -g)’

Dolffi@ -E))” oaa

R 2 :

The above analysis shows that the correlation coefficient is equivalent to the

square root of the coefficient of determination (except for its sign, which de- pends on the direction of the relationship, whether positive or negative).

It is worth noting that R2 increases as we add more explanatory variables into the model. One way to correct for the number of explanatory variables

added to the model is by using the following adjusted fi2 measure:

(D.22)

Adjusted R2 : r- (ffi) $ – R2), (D.23)

where l/ is the number of data points and d * 1 is the number of parameters of the regression model.

D.3 Multivariate Linear Regression

The normal equations can be written in a more compact form using the fol lowing matr ix notat ion. Let X : (1 x), where 1 : (1,1,1,. . . )” and x: (r t rr2t . . . , rN)T . Then, we can show that

which is equivalent to the left-hand side matrix of the normal equation. Sim- i lar ly, i f y : (At,Az,. . . ,Ux)T, we can show that

X”X: ( 1;l l;x ) : (;,, B:#), Q24)

( 1 * )’y: ( 1;l ) : ( j,:;,), (D.25)

which is equivalent to the right-hand side matrix of the normal equation. Sub- stituting Equations D.24 and D.25 into Equation D.6 we obtain the following equation:

x”xe – x’y, (D.26)

where Q: (r0,c,,,1)”. We can solve for the parameters in f) can as follows:

E2: (XrX) – tX ty , (D.27)

The above notation is useful because it allows us to extend the linear regression method to the multivariate case. More specifically, if the attribute set consists of d explanatory attr ibutes (r1, r2t . . . , ra),X becomes an ly ‘ x d design matrix:

(D.28)

while Q : (oo, (trt.. .,aa-t)r is a d-dimensional vector. The parameters can be computed by solving the matrix equation given in Equation D.26.

I rn rr2 r1.d I rZt r22 r2d

1 rl,rt rN2 rNd.

D.4 Alternative Least-Square Regression Methods 737

D.4 Alternative Least-Square Regression Methods

The Ieast squares method can also be used to find other types of regression models that minimize the SSE. More specifically, if the regression model is

A : / ( x ,C l ) *e

: , o+ Iuag i ( x ) l e ,

(D.2e) (D.30)

and the random noise is normally distributed, then we can apply the same methodology as before to determine the parameter vector O. The g6’s can be any type of basis functions, including polynomial, kernel, and other nonlinear functions.

For example, suppose x is a two-dimensional feature vector and the regres- sion model is a polynomial function of degree 2

f (r t , rz,0) : ,o I u1r1* azrz I wsr1r2 – l uar l * u5n?r. (D.31)

If we create the following design matrix:

X- (D.32)

where r;7 is the jth attribute of the zth observation, then the regression prob-

lem becomes equivalent to solving Equation D.26. The least-square solution to

the parameter vector Q is given by EquationD.27. By choosing the appropri- ate design matrix, we can extend this method to any type of basis functions.

I rn rr2 rrrr1z r?t *72 I rzr !x22 r2rr22 *3r rZz

I rNt rN2 rNtfrNz ,2Nt ,2Nz

Opti mization

Optimization is a methodology for finding the maximum or minimum value of

a function. It is an important topic in data mining because there are many data mining tasks that can be cast as optimization problems. For example, the K-means clustering algorithm described in Section 8.2.1 seeks to find a

set of clusters that minimizes the sum of the squared error (SSE). Similarly, the method of least squares presented in Section D.2.1 is designed to learn

the regression coefficients that minimize the SSE of the model. This section presents a brief overview of the various techniques used to solve optimization problems.

E.1 lJnconstrained Optimization

Suppose /(r) is a univariate function with continuous first-order and second-

order derivatives. In an unconstrained optimization problem, the task is to

locate the solution r* that maximizes or minimizes f (r) without imposing any

constraints on tr*. The solution r*, which is known as a stationary point,

can be found by taking the first derivative of / and setting it to zero:

df l hl,__,.: o’

f (r*) can take a maximum or minimum value depending on the second-order derivative of the function:

o r* is a maximum stationary point if #J a 0 at r : r*.

o u* is a minimum stationary point if # , O at tr : tr*.

74O Appendix E Optimization

o r* is a point of inflection when ffi: 0 at tr : tr* .

Figure E.1 illustrates an example of a function that contains all three station- ary points (maximum, minimum, and point of inflection).

Minimum

Figure E.l. Stationary points of a function.

This definition can be extended to a multivariate function, f (rt, 12,…, rd), where the condition for finding a stationary point **: [r{, ri,…,rfi]T is

(E.1)

However, unlike univariate functions, it is more difficult to determine whether x* corresponds to a maximum or minimum stationary point. The difficulty arises because we need to consider the partial derivatives #J, for all possible pairs of i, and j. The complete set of second-order partial derlvatives is given by the Hessian matrix

a2f ” ‘ } r t 0 r s

o’J ” ‘ 0 r20 ra

a2f ” ‘ Ox40 r4

(8.2)

A Hessian matrix H is positive definite if and only if xTHx > 0 for any non-zero vector x. If H(x.) is positive definite, then x* is a minimum stationary point.

Point of Inflection

E.1 Unconstrained Optimization 74I

o A Hessian is negative definite if and only if x?Hx ( 0 for any non-zero vector x. If H(x-) is negative definite, then x* is a maximum stationary point.

o A Hessian is indefinite if x”Hx is positive for some value of x and neg- ative for others. A stationary point with indefinite Hessian is a saddle point, which can have a minimum value in one direction, and a maxi- mum value in another.

Example E.1. Suppose f (r,A) :3r2 *2A3 -2ry. Figtxe E.2 shows a plot of this function. The conditions for finding the stationary points of this function are

6 r – 2 y : Q

6a ‘ -2 r :0 (E.3)

whose solutions are tr* : A* :0 or r* : L127, A* : Il9.

-2 v-

FigureE.2. Plotforthefunct ion f (” ,a):3r2 *2y3 -2ry.

af 0r af 0s

742 Appendix E

The Hessian of / is

A t r : A : 0 ,

Optimization

H(r,a) : [ j , , ; ]

H(o,o) : lj, ?]

H(7l27,r l r) : I u,

L – a

Since [r g] H(0,0) l” A]r :612 – 4rA :2r(3r _ 2A), which can be either positive or negative, the Hessian is indefinite and (0,0) is a saddle point.

At r : I l 2T , y : I l g ,

Since [r y] H(I I 27, t I 9) [r a)r : 4r2 – 2rs t +a2 I J : 4(r – a I q2 + Ba2 I 4 > 0 for non-zero z and y, the Hessian is positive definite. Therefore, (Il2T,1/g) is a minimum stationary point. The minimum value of / is -0.0014. I

E.1.1 Numerical Methods

The preceding approach works if Equation E.l can be solved analytically for x*. In many cases, finding analytical solutions is a very difficult problem, thus necessitating the use of numerical methods to find approximate solutions. Some of the numerical methods for finding the minimum value of a function include golden search, Newton’s method, and gradient descent search. While the techniques presented here are used to minimize the objective function /(x), they are also applicable to maximization problems because a maximization problem can be easily turned into a minimization problem by converting the function /(x) to -/(“).

Golden Search Consider the unimodal distribution illustrated in Figure E.3, where the minimum value is bracketed between a and b. The golden search method iteratively finds successively smaller brackets that contain the minimum value until the interval width is small enough to approximate the stationary point. To determine the smaller brackets, two additional points, c and d, are chosen so that the intervals (a,c,d) and (c,d,b) have equal width. Le t c – a :b – d : a (b _ a ) and d_ c : 0 x (b _ a ) . There fore ,

1 – (b – d) + (d – c) + (c – o)

b – a : e -1 0+a ,

E.l Unconstrained Optimization 743

Figure E.3. Example of a unimodalfunction.

or equivalently,

0 : L – 2 a . ( E . 4 )

The widths are also chosen to obey the following condition so that a re-

cursive procedure can be applied:

or equivalently,

d – c c – a r )

b – c o – a

– – a a .

l – a

Together, Equations E.4 and E.5 can be solved to yield c : 0.382 and B :

0.236. By comparing /(c) with /(d), it is possible to detect whether the mini-

mum value occurs in the interval (a,c,d) or (c,d,b). The interval that contains

the minimum value is then recursively partitioned until the interval width is

small enough to approximate the minimum value, as shown in Algorithm E.1.

The golden search method makes no assumption about the function, other

than it must be continuous and unimodal within the initial bracket [a,b]. lt

converges linearly to the solution for the minimum value.

Newtonts Method Newton’s method is based on using a quadratic approx-

imation to the function f (“). BV using a Taylor series expansion of / around

(E.5)

744 Appendix E Optimization