which turns out to be identical to the dual Lagrangian for linearly separable
data (see Equation 5.40 on page 262). Nevertheless, the constraints imposed
27O Chapter 5 Classification: Alternative Techniques
on the Lagrange multipliers )1’s are slightly different those in the linearly separable case. In the linearly separable case, the Lagrange multipliers must be non-negative, i.e., )e > 0. On the other hand, Equation 5.52 suggests that )z should not exceed C (since both pri and )a are non-negative). Therefore, the Lagrange multipliers for nonlinearly separable data are restricted to 0 (
\ r<C. The dual problem can then be solved numerically using quadratic pro-
gramming techniques to obtain the Lagrange multipliers .\a. These multipliers can be replaced into Equation 5.50 and the KKT conditions to obtain the parameters of the decision boundary.
5.5.4 Nonlinear SVM
The SVM formulations described in the previous sections construct a linear de- cision boundary to separate the training examples into their respective classes. This section presents a methodology for applying SVM to data sets that have nonlinear decision boundaries. The trick here is to transform the data from its original coordinate space in x into a new space O(x) so that a linear decision boundary can be used to separate the instances in the transformed space. Af- ter doing the transformation, we can apply the methodology presented in the previous sections to find a linear decision boundary in the transformed space.
Attribute TYansformation
To illustrate how attribute transformation can lead to a linear decision bound- ary, Figure 5.28(a) shows an example of a two-dimensional data set consisting of squares (classified as A – 1) and circles (classified as A – -1). The data set is generated in such a way that all the circles are clustered near the center of the diagram and all the squares are distributed farther away from the center. Instances of the data set can be classified using the following equation:
a(r t , rz) : {1,
l fM)0 .2 , (5.54)
otherwise.
The decision boundary for the data can therefore be written as follows:
M-0.2,
which can be further simplified into the following quadratic equation:
*?-q+”3- rz : -0 .46.
(a) Decision boundary in the original two-dimensional soace.
o . o Support Vector Machine (SVM) 27L
tr
d tr
tr
tr tr
&\’
x?-x,
(b) Decision boundary in the formed soace.
I
– { 1
-0 15
Figure 5.28. Classifying data with a nonlinear decision boundary.
A nonlinear transformation O is needed to map the data from its original feature space into a new space where the decision boundary becomes linear. Suppose we choose the following transformation:
Q: (q , r2 ) – ( r? , *Z , t [2q , r t r2 , t \ . Io .ooJ
In the transformed space, we can find the parameters w : (t10, lrrt …, w+) such that:
w4r21 t .s*f + uz{2q + wrrtr2t tr.rs : fl.
For illustration purposes, let us plot the graph of r/ – t2 versus rl – q for the previously given instances. Figure 5.28(b) shows that in the transformed space, all the circles are located in the lower right-hand side of the diagram. A Iinear decision boundary can therefore be constructed to separate the instances into their respective classes.
One potential problem with this approach is that it may suffer from the curse of dimensionality problem often associated with high-dimensional data. We will show how nonlinear SVM avoids this problem (using a method known as the kernel trick) later in this section.
Learning a Nonlinear SVM Model
Although the attribute transformation approach seems promising, it raises several implementation issues. First, it is not clear what type of mapping
272 Chapter 5 Classification: Alternative Techniques
function should be used to ensure that a linear decision boundary can be constructed in the transformed space. One possibility is to transform the data into an infinite dimensional space, but such a high-dimensional space may not be that easy to work with. Second, even if the appropriate mapping function is known, solving the constrained optimization problem in the high-dimensional feature space is a computationally expensive task.
To illustrate these issues and examine the ways they can be addressed, let us assume that there is a suitable function, O(x), to transform a given data set. After the transformation, we need to construct a linear decision boundary that will separate the instances into their respective classes. The linear decision boundary in the transformed space has the following form: w.O(*) * b: 0.
Definition 5.2 (Nonlinear SVM). The learning task for a nonlinear SVM can be formalized as the following optimization problem:
-,n ll:Yll’ w 2
s u b j e c t t o A i ( w . A ( x 6 ) + b ) > 1 , i : I , 2 , . . . , N .
Note the similarity between the learning task of a nonlinear SVM to that of a linear SVM (see Definition 5.1 on page 262). The main difference is that, instead of using the original attributes x, the learning task is performed on the transformed attributes O(x). Following the approach taken in Sections 5.5.2 and 5.5.3 for linear SVM, we may derive the following dual Lagrangian for the constrained optimization problem:
(5 .56) zrJ
Once the );’s are found using quadratic programming techniques, the param- eters w and b can be derived using the following equations:
LD :1^, – I f ) , ;Aisiyie(xz) . a(xr)
-: T
)’iY’;Q(x)
r,;{s,,(I \1y1a(x).o(*r) + b) – 1} :0, J
l o .o / /
(5 .58)
5.5 Support Vector Machine (SVM) 273
which are analogous to Equations 5.39 and 5.40 for linear SVM. Finally, a test
instance z canbe classified using the following equation:
f ( ” ) : s i ‘en(w ‘o(z) + b) (5.5e)
Except for Equation 5.57, note that the rest of the computations (Equa-
tions 5.58 and 5.59) involve calculating the dot product (i.e., similarity) be-
tween pairs of vectors in the transformed space, O(*r) ‘O(“i). Such computa-
tion can be quite cumbersome and may suffer from the curse of dimensionality
problem. A breakthrough solution to this problem comes in the form ,rf a
method known as the kernel trick.
I(ernel TYick
The dot product is often regarded as a measure of similarity between two
input vectors. For example, the cosine similarity described in Section 2.4.5
on page 73 can be defined as the dot product between two vectors that are
normalized to unit length. Analogously, the dot product O(x6)’O(*;) can also
be regarded as a measure of similarity between two instances, x2 and x3, in
the transformed space. The kernel trick is a method for computing similarity in the transformed
space using the original attribute set. Consider the mapping function O given
in Equation 5.55. The dot product between two input vectors u and v in the
transformed space can be written as follows:
O(u) . O(v) : (u?,u\, t /2ut, Jiu2,L) ‘ (u?,r| , r trr , r tur, t1 :
“lul +
“|rl + 2upt * 2uzuz l7
: (u .v + 1 )2 .
This analysis shows that the dot product in the transformed space can be
expressed in terms of a similarity function in the original space:
K(u ,v ) : O (u ) .O( ” ) : ( u . v + 1 )2 (5.61)
The similarity function, K, which is computed in the original attribute space’
is known as the kernel function. The kernel trick helps to address some
of the concelns about how to implement nonlinear SVM. First, we do not
have to know the exact form of the mapping function (D because the kernel
: sisn(i ^,r,*(x1) ‘ o(z) . ,
(5.60)
274 Chapter 5 Classification: Alternative Techniques
functions used in nonlinear SVM must satisfy a mathematical principle known as Mercerts theorem. This principle ensures that the kernel functions can always be expressed as the dot product between two input vectors in some high-dimensional space. The transformed space of the SVM kernels is called a reproducing kernel Hilbert space (RKHS). Second, computing the dot products using kernel functions is considerably cheaper than using the transformed attribute set O(x). Third, since the computations are performed in the original space, issues associated with the curse of dimensionality problem can be avoided.
Figure 5.29 shows the nonlinear decision boundary obtained by SVM using the polynomial kernel function given in Equation 5.61. A test instance x is classified according to the following equation:
rL . , \ – ,
f (z) : s isn( \ \s iQ(x i ) .a@) + b) t : I
n . , \ . – .: s ign(\) , ig iK(xi .z) + b)
; – 1
n . z \ – ‘ , . ‘ ,
“: s lgn \ ) kA i \x i ‘ z + I ) ‘ + 0 ) , (5.62) 1,: I
where b is the parameter obtained using Equation 5.58. The decision boundary obtained by nonlinear SVM is quite close to the true decision boundary shown in Figure 5.28(a).
Mercerts Theorem
The main requirement for the kernel function used in nonlinear SVM is that there must exist a corresponding transformation such that the kernel function computed for a pair of vectors is equivalent to the dot product between the vectors in the transformed space. This requirement can be formally stated in the form of Mercer’s theorem.
Theorem 5.1 (Mercer’s Theorem). ,4 kernel functi,on K can be erpressed AS
K ( u , a ) : O ( z ) ‘ O ( r )
i,f and only i,f, for any functi,on g(r) such that I g(n)2dris fini,te, then
| * ,* ,s) s@) g@) dn da>-0.
1
0.9
0.8
o.7
0.6
x u,c
0.4
0.3
0.2
0.1
0
5.5 Support Vector Machine (SVM) 275
tr
tr tr
t ry
tr tr
tr
xl o.7
Figure 5.29. Decision boundary produced by a nonlinear SVM with polynomial kernel.
Kernel functions that satisfy Theorem 5.1 are called positive definite kernel functions. Examples of such functions are listed below:
K( * , y ) : ( x . y+ l )P
K(*, y) : “-ll*-vll2
/ {2″2)