5.5 Support Vector Machine (SVM) 263
Because the Lagrange multipliers are unknown, we still cannot solve for w and b. If Definition 5.1 contains only equality instead of inequality constraints, then we can use the ly’ equations from equality constraints along with Equations 5.39 and 5.40 to find the feasible solutions for w, b, and ).;. Note that the Lagrange multipliers for equality constraints are free parameters that can take any values.
One way to handle the inequality constraints is to transform them into a set of equality constraints. This is possible as long as the Lagrange multipliers are restricted to be non-negative. Such transformation leads to the following constraints on the Lagrange multipliers, which are known as the Karush-Kuhn- T\rcker (KKT) conditions:
) ; ) 0 ,
\u[sn(* ‘xt * b) – 1] : O.
(5.41)
(5.42)
At first glance, it may seem that there are as many Lagrange multipli- ers as there are training instances. It turns out that many of the Lagrange multipliers become zero after applying the constraint given in trquation 5.42. The constraint states that the Lagrange multiplier Aa must be zero unless the training instance x6 satisfies the equation At(w .xt I b) : t. Such training instance, with )i ) 0, lies along the hyperplanes bi1 or b.i2 and is known as a support vector. taining instances that do not reside along these hyperplanes have )6:0. Equations 5.39 and5.42 also suggest that the parameters w and b, which define the decision boundary, depend only on the support vectors.
Solving the preceding optimization problem is still quite a daunting task because it involves a large number of parameters: w, b, and )a. The problem can be simplified by transforming the Lagrangian into a function of the La- grange multipliers only (this is known as the dual problem). To do this, we first substitute Equations 5.39 and 5.40 into Equation 5.38. This will lead to the following dual formulation of the optimization problem:
,A|I
Ln:1 ,^ ‘ – * Ls ; \a tu ix i . x j .- z-/ 2 z-.J i:1. i,i
(5.43)
The key differences between the dual and primary Lagrangians are as fol- lows:
1. The dual Lagrangian involves only the Lagrange multipliers and the training data, while the primary Lagrangian involves the Lagrange mul- tipliers as well as parameters of the decision boundary. Nevertheless, the solutions for both optimization problems are equivalent.
264 Chapter 5 Classification: Alternative Techniques
2. The quadratic term in Equation 5.43 has a negative sign, which means that the original minimization problem involving the primary Lagrangian, Lp, has turned into a maximization problem involving the dual La- grangian, ,L2.
For large data sets, the dual optimization problem can be solved using numerical techniques such as quadratic programming, a topic that is beyond the scope of this book. Once the );’s are found, we can use Equations 5.39 and 5.42 to obtain the feasible solutions for w and b. The decision boundary can be expressed as follows:
/ N \ ( f \ yn* t ‘ * )+b :0 . \ – /
(5.44)
b is obtained by solving Equation 5.42 for the support vectors. Because the )1’s are calculated numerically and can have numerical errors, the value computed for b may not be unique. Instead it depends on the support vector used in Equation 5.42. In practice, the average value for b is chosen to be the parameter of the decision boundary.
Example 5.5. Consider the two-dimensional data set shown in Figure 5.24, which contains eight training instances. Using quadratic programming, we can solve the optimization problem stated in Equation 5.43 to obtain the Lagrange multiplier .\6 for each training instance. The Lagrange multipliers are depicted in the last column of the table. Notice that only the fi.rst two instances have non-zero Lagrange multipliers. These instances correspond to the support vectors for this data set.
Let w : (1q,w2) and b denote the parameters of the decision boundary. Using Equation 5.39, we can solve for w1 and w2 in the following way:
wr : D\nroror :65.562L x 1x 0.3858+65.5621 x -1 x 0 ‘487r : -6 ’64. z
\ – ,w2 : ) \ ;Apn:65.562L x 1x 0.4687 +65.5621x -1 x0.611 : -9 ’32. 2
The bias term b can be computed using Equation 5.42 for each support vector:
6 ( t ) : 1 – w . x 1 : 1 – ( – 6 . 6 4 ) ( 0 . 3 8 5 3 ) – ( – 9 . 3 2 ) ( 0 . 4 6 8 7 ) : 7 . 9 3 0 0 .
6(z ) : -1 – w .x2 : -1 – ( -6 .64) (0 .4871) – ( -9 .32) (0 .611) : 7 .9289.
Averaging these values, we obtain b : 7.93. The decision boundary corre- sponding to these parameters is shown in Figure 5.24. r
X1 X2 v Lagrange Multiplier
0.3858 0.4871 0.9218 o.7382 0.1763 o.4057 0.9355 0.2146
0.4687 0 .611 0.4103 0.8936 0.0579 0.3529 0.8132 0.0099
1 -1 -1 -1
1 1
-1 1
65.5261 65.5261
0 0 0 0 0 0
5.5 Support Vector Machine (SVM) 265
1
0.9
0.8
0.7
0.6
s 0.5
0.4
0.3
o.2
0 .1
0
-6.64 x1 – 9.32 x2 + 7.93 = 0 tr tr
tr
o o
n
o 0 0.2 o.4 0.6 0.8 1
X1
Figure 5.24. Example of a linearly separable data set.
Once the parameters of the decision boundary are found, a test instance z is classified as follows:
/ N \
f ( r ) : s ien(w ‘z+b) : s ‘ i sn( f ^uru” t z+b) . \ d : l
It f (z): 1, then the test instance is classified as a positive class; otherwise, it is classified as a negative class.
266 Chapter 5 Classification: Alternative Techniques
5.5.3 Linear SVM: Nonseparable Case
Figure 5.25 shows a data set that is similar to Figure 5.22, except it has two new examples, P and Q. Although the decision boundary 81 misclassifies the new examples, while 82 classifies them correctly, this does not mean that .B2 is a better decision boundary than 81 because the new examples may correspond to noise in the training data. .B1 should still be preferred over 82 because it has a wider margin, and thus, is less susceptible to overfitting. However, the SVM formulation presented in the previous section constructs only decision boundaries that are mistake-free. This section examines how the formulation can be modified to learn a decision boundary that is tolerable to small training errors using a method known as the soft margin approach. More importantly, the method presented in this section allows SVM to construct a linear decision boundary even in situations where the classes are not linearly separable. To do this, the learning algorithm in SVM must consider the trade-off between the width of the margin and the number of training errors committed by the Iinear decision boundary.
f; ;;;;i” ro,, i,–‘o’,
Figure 5.25, Decision boundary of SVM for the nonseparable case.
margin tor e2’l
r r al ‘ ‘ , r r l ‘ \ , ,
l l f ‘ .
n
o\,,o o
oo o o
o
! t .
[‘?
OO
OO
o o
o
5.5 Support Vector Machine (SVM) 267
N X
-0.5 0 0.5 1 1.5 x1
Figure 5,26. Slack variables for nonseparable data.
While the original objective function given in Equation 5.37 is still appli- cable, the decision boundary .B1 no longer satisfies all the constraints given in Equation 5.36. The inequality constraints must therefore be relaxed to ac- commodate the’nonlinearly separable data. This can be done by introducing positive-valued slack variables ({) into the constraints of the optimization problem, as shown in the following equations:
v / ‘ x i +b>7 -€n
w .x i+b< -1+€ , (5.45)
where Vz : {6 > 0. To interpret the meaning of the slack variables {a, consider the diagram
shown in Figure 5.26. The circle P is one of the instances that violates the constraints given in Equation 5.35. Let w.x* b: -7 *{ denote a line that is parallel to the decision boundary and passes through the point P. It can be shown that the distance between this line and the hyperplane w’x * b: -L
is {/llwll. Thus, { provides an estimate of the error of the decision boundary on the training example P.
In principle, we can apply the same objective function as before and impose the conditions given in Equation 5.45 to find the decision boundary. However,
1 . 2
0.8
i f y 6 – I ,
i f y a : – 1 ,
w . x + b = 0
t r t r t r tr
t1 LI
tr
. w . x + b = – 1 + (
\
268 Chapter 5 Classification: Alternative Techniques
Figure 5.27. A decision boundary that has a wide margin but large training error.
since there are no constraints on the number of mistakes the decision boundary can make, the learning algorithm may find a decision boundary with a very wide margin but misclassifies many of the training examples, as shown in Figure 5.27. To avoid this problem, the objective function must be modified to penalize a decision boundary with large values of slack variables. The modified objective function is given by the following equation:
. l lw l l 2 N
, f (*) :#* c(D,to)*, i,:L
where C and k are user-specified parameters representing the penalty of mis- classifying the training instances. For the remainder of this section, we assume k : I to simplify the problem. The parameter C can be chosen based on the model’s performance on the validation set.
It follows that the Lagrangian for this constrained optimization problem can be written as follows:
. l /
Lp: | l lw l l2+c tF , O r l ” l l ‘ *
L \ 1
i :7
N ,^/ S – . . \ r . – ) S — LAt |a t (w.x i +b) – 1+€r i –
\u&, (5 .46) i : I i : l
where the first two terms are the objective function to be minimized, the third term represents the inequality constraints associated with the slack variables,
og \ o
o
o o
T \ \ \
l l I
\,I I
r \- ‘ . r
l l .
T I
, l l
a\,
t.,, I
I
I
I
P I
D . O Support Vector Machine (SVM) 269
and the last term is the result of the non-negativity requirements on the val-
ues of {,;’s. F\rrthermore, the inequality constraints can be transformed into
equality constraints using the following KKT conditions:
€r )0 , ) r20 , h )0 , l ; {sr(* . x; * b) – 1 +€,} : 0,
l t t€ t :0-
-^r s – .- ) A i U & i i : u 1 u j /-t
i :7
(5.47)
(5.48)
(5.4e)
Note that the Lagrange multiplier ,\; given in Equation 5.48 is non-vanishing
only if the training instance resides along the lines w’x, * b : il or has
€r > O. On the other hand, the Lagrange multipliers;.ri given in Equation 5.49
arezero for any training instances that are misclassified (i.e., having {z > 0).
Setting the first-order derivative of ,L with respect to w, b, and {; to zero
would result in the following equations: