Design Issues in ANN Learning

Before we train a neural network to design issues must be considered.

Artificial Neural Network (ANN) 255

Iearn a classification task, the following

1. The number of nodes in the input layer should be determined. Assign an input node to each numerical or binary input variable. If the input vari- able is categorical, we could either create one node for each categorical value or encode the k-ary variable using flog2 k-l input nodes.

2. The number of nodes in the output layer should be established. For a two-class problem, it is sufficient to use a single output node. For a k-class problem, there are k output nodes.

3. The network topology (e.g., the number of hidden layers and hidden nodes, and feed-forward or recurrent network architecture) must be se- lected. Note that the target function representation depends on the weights of the links, the number of hidden nodes and hidden layers, bi- ases in the nodes, and type of activation function. Finding the right topology is not an easy task. One way to do this is to start from a fully connected network with a sufficiently large number of nodes and hid- den layers, and then repeat the model-building procedure with a smaller number of nodes. This approach can be very time consuming. Alter- natively, instead of repeating the model-building procedure, we could remove some of the nodes and repeat the model evaluation procedure to select the right model complexity.

4. The weights and biases need to be initialized. Random assignments are usually acceptable.

5. Tlaining examples with missing values should be removed or replaced with most likely values.

5.4.3 Characteristics of ANN

Following is a summary of the general characteristics of an artificial neural network:

1. Multilayer neural networks with at least one hidden layer are univer- sal approximators; i.e., they can be used to approximate any target functions. Since an ANN has a very expressive hypothesis space, it is im- portant to choose the appropriate network topology for a given problem to avoid model overfitting.

256 Chapter 5 Classification: Alternative Techniques

ANN can handle redundant features because the weights are automat- ically learned during the training step. The weights for redundant fea- tures tend to be very small.

Neural networks are quite sensitive to the presence of noise in the train- ing data. One approach to handling noise is to use a validation set to determine the generalization error of the model. Another approach is to decrease the weight by some factor at each iteration.

The gradient descent method used for learning the weights of an ANN often converges to some local minimum. One way to escape from the local minimum is to add a momentum term to the weight update formula.

Training an ANN is a time consuming process, especially when the num- ber of hidden nodes is large. Nevertheless, test examples can be classified rapidly.

5.5 Support Vector Machine (SVM)

A classification technique that has received considerable attention is support vector machine (SVM). This technique has its roots in statistical learning the- ory and has shown promising empirical results in many practical applications, from handwritten digit recognition to text categorization. SVM also works very well with high-dimensional data and avoids the curse of dimensionality problem. Another unique aspect of this approach is that it represents the deci- sion boundary using a subset of the training examples, known as the support vectors.

To illustrate the basic idea behind SVM, we first introduce the concept of a maximal margin hyperplane and explain the rationale of choosing such a hyperplane. We then describe how a linear SVM can be trained to explicitly look for this type of hyperplane in linearly separable data. We conclude by showing how the SVM methodology can be extended to non-linearly separable data.

5.5.1 Maximum Margin Hyperplanes

Figure 5.21 shows a plot of a data set containing examples that belong to two different classes, represented as squares and circles. The data set is also linearly separable; i.e., we can find a hyperplane such that all the squares reside on one side of the hyperplane and all the circles reside on the other

2 .

. t .

4.

tr i r .

D . D Support Vector Machine (SVM) 257

Figure 5.21. Possible decision boundaries for a linearly separable data set.

side. However, as shown in Figure 5.21, there are infinitely many such hyper- planes possible. Although their training errors are zerol there is no guarantee that the hyperplanes will perform equally well on previously unseen examples. The classifier must choose one of these hyperplanes to represent its decision boundary, based on how well they are expected to perform on test examples.

To get a clearer picture of how the different choices of hyperplanes affect the generalization errors, consider the two decision boundaries, ,B1 and 82, shown in Figure 5.22. Both decision boundaries can separate the training examples into their respective classes without committing any misclassification errors. Each decision boundary B; is associated with a pair of hyperplanes, denoted as bi1 and bi2, respectively. b61 is obtained by moving a parallel hyperplane away from the decision boundary until it touches the closest square(s), whereas b;2 is obtained by moving the hyperplane until it touches the closest circle(s). The distance between these two hyperplanes is known as the margin of the classifier. From the diagram shown in Figure 5.22,notice that the margin for .B1 is considerably larger than that for Bz. In this example, 81 turns out to be the maximum margin hyperplane of the training instances.

Rationale for Maximum Margin

Decision boundaries with large margins tend to have better generalization errors than those with small margins. Intuitively, if the margin is small, then

I I

I

T

I

T

T

I

T

TT

l r

T I

o oo

o oo

oo

ooo o

OO

oo oo

o o

258 Chapter 5 Classification: Alternative Technioues

brr margin for 81 bp

Figure 5.22. Margin of a decision boundary.

any slight perturbations to the decision boundary can have quite a significant impact on its classification. Classifiers that produce decision boundaries with small margins are therefore more susceptible to model overfitting and tend to generalize poorly on previously unseen examples.

A more formal explanation relating the margin of a linear classifier to its generalization error is given by a statistical learning principle known as struc- tural risk minimization (SRM). This principle provides an upper bound to the generalization error of a classifier (R) in terms of its training error (,R”), the number of training examples (l/), and the model complexity otherwise known as its capacity (h). More specifically, with a probability of 1 – q, the generalization error of the classifier can be at worst

(5.27)

where cp is a monotone increasing function of the capacity h. The preced- ing inequality may seem quite familiar to the readers because it resembles the equation given in Section 4.4.4 (on page 179) for the minimum descrip- tion length (MDL) principle. In this regard, SRM is another way to express generalization error as a tradeoff between training error and model complexity.

R1R.+r ( * ,

+ – – – – – – – – – – – – – – – – t brr margin for 81

o . o Support Vector Machine (SVM) 259

The capacity of a linear model is inversely related to its margin. Models with small margins have higher capacities because they are more flexible and can fit many training sets, unlike models with large margins. However, accord- ing to the SRM principle, as the capacity increases, the generalization error bound will also increase. Therefore, it is desirable to design linear classifiers that maximize the margins of their decision boundaries in order to ensure that their worst-case generalization errors are minimized. One such classifier is the linear SVM, which is explained in the next section.

5.5.2 Linear SVM: Separable Case

A linear SVM is a classifier that searches for a hyperplane with the largest margin, which is why it is often known as a maximal margin classifier. To understand how SVM learns such a boundary, we begin with some preliminary discussion about the decision boundary and margin of a linear classifier.

Linear Decision Boundary

Consider a binary classification problem consisting of l/ training examples. Each example is denoted by a tuple (*i,At) (i, : I,2,. . . ,ly’), where x1 : (*nt.,*t2,…,nu)r correspond.s to the attribute set for the i,th example. By convention, Iet At â‚¬ {-1, 1} denote its class label. The decision boundary of a linear classifier can be written in the followins form:

w.x *b :0 , (5.28)

where w and b are parameters of the model. Figure 5.23 shows a two-dimensional training set consisting of squares and

circles. A decision boundary that bisects the training examples into their respective classes is illustrated with a solid line. Any example located along the decision boundary must satisfy Equation 5.28. For example, if xo and x6 are two points located on the decision boundary, then

\ n / . X o * b : 0 ,

w ‘ x b f b : 0 .

Subtracting the two equations will yield the following:

w. (xa – x r ) : 0 ,

260 Chapter 5 Classification: Alternative Techniques

Figure 5.23. Decision boundary and margin of SVM.

where Xb – Xo is a vector parallel to the decision boundary and is directed from xo to x6. Since the dot product is zero, the direction for w must be perpendicular to the decision boundary, as shown in Figure 5.23.

For any square x” located above the decision boundary, we can show that

W ‘ X s l b : k , (5.2e)

where k > 0. Similarly, for any circle x. located below the decision boundary, we can show that

w . X c * b : k ” (5.30)

where k’ < 0. If we label all the squares as class f 1 and all the circles as class -1, then we can predict the class label gr for any test example z in the following way:

(5 .31)

Margin of a Linear Classifier

Consider the square and the circle that are closest to the decision boundary. Since the square is located above the decision boundary, it must satisfy Equa- tion 5.29 for some positive value k, whereas the circle must satisfy Equation

I l , i f w . z *b>0 ; ‘ : t – t . i fw . z+b<0 .

5.5 Support Vector Machine (SVM) 261

5.30 for some negative value kt. We can rescale the parameters w and b of the decision boundary so that the two parallel hyperplanes b;1 and bi2 can be expressed as follows:

b i 1 : w . x * b : 1 ,

b n ; w ‘ x * b – – 1 .

(5.32)

(5.33)

The margin of the decision boundary is given by the distance between these two hyperplanes. To compute the margin, Iet x1 be a data point located on br1 and x2 be a data point on bi2, as shown in Figure 5.23. Upon substituting these points into Equations 5.32 and 5.33, the margin d can be computed by subtracting the second equation from the first equation:

Learning a Linear SVM Model

The training phase of SVM involves estimating the parameters w and b of the decision boundary from the training data. The parameters must be chosen in such a way that the following two conditions are met:

w . x i + b > l i f y , ; : 1 ,

u / . x i + b < – 7 i f y i – – 1 . (5.35)

These conditions impose the requirements that all training instances from class gr : 1 (i.e., the squares) must be located on or above the hyperplane w.x* b: L, whi le those instances from class U: – I ( i .e. , the circ les) must be located on or below the hyperplane w .x * b – -1. Both inequalities can be summarized in a more comoact form as follows:

g t ( w . x i * b ) ) 1 , ‘ i : 7 , 2 , . . . , N . (5.36)

Although the preceding conditions are also applicable to any linear classi- fiers (including perceptrons), SVM imposes an additional requirement that the margin of its decision boundary must be maximal. Maximizing the margin, however, is equivalent to minimizing the following objective function:

w’ ( x r – xz ) : 2

l lw l l x d : 2 ,.

l l * l l ‘ (5.34)

f r * r : l lw l l2 ‘ 2 (5.37)

262 Chapter 5 Classification: Alternative Techniques

Definition 5.1 (Linear SVM: Separable Case). The learning task in SVM can be formalized as the following constrained optimization problem:

Since the objective function is quadratic and the constraints are linear in the parameters w and b, this is known as a convex optimization problem, which can be solved using the standard Lagrange multiplier method. Fol- lowing is a brief sketch of the main ideas for solving the optimization problem. A more detailed discussion is given in Appendix E.

First, we must rewrite the objective function in a form that takes into account the constraints imposed on its solutions. The new objective function is known as the Lagrangian for the optimization problem:

, I l * l l ‘mln –

subject to ,it* i * ,, – ,, ,i : r,2,. . . ,N.

r, : ill*ll’

l/ ^ \ – .- U + w : ) A i ? J t x i ,

^Lt t – 1

1V ^ \ — U + ) A ; t t ; : U .

,Lt

i,:1

– f ^, (r , t* ‘x i * b) – 1), (5.38)

(5.3e)

(5.40)

where the parameters ); are called the Lagrange multipliers. The first term in the Lagrangian is the same as the original objective function, while the second term captures the inequality constraints. To understand why the objective function must be modified, consider the original objective function given in Equation 5.37. It is easy to show that the function is minimized when w : 0, a null vector whose components are all zeros. Such a solution, however, violates the constraints given in Definition 5.1 because there is no feasible solution for b. The solutions for w and b are infeasible if they violate the inequality constraints; i.e., if Ailw-xi+b) – 1 < 0. The Lagrangian given in Equation 5.38 incorporates this constraint by subtracting the term from its original objective function. Assuming that .\; > 0, it is clear that any infeasible solution may only increase the value of the Lagrangian.

To minimize the Lagrangian, we must take the derivative of ,Lp with respect to w and b and set them to zero: