5.4 Artificial Neural Network (ANN) Z4T
5.4.1 Perceptron
Consider the diagram shown in Figure 5.14. The table on the left shows a data set containing three boolean variables (*t, ,r, 13) and an output variable, gr, that takes on the value -1 if at least two of the three inputs arezero, and +1 if at least two of the inputs are greater than zero.
Input nodes
(a) Data set. (b) perceptron.
Figure 5.14. Modeling a boolean function using a perceptron.
Figure 5.14(b) illustrates a simple neural network architecture known as a perceptron. The perceptron consists of two types of nodes: input nodes, which are used to represent the input attributes, and an output node, which is used to represent the model output. The nodes in a neural network architecture are commonly known as neurons or units. In a perceptron, each input node is connected via a weighted link to the output node. The weighted link is used to emulate the strength of synaptic connection between neurons. As in biological neural systems, training a perceptron model amounts to adapting the weights of the links until they fit the input-output relationships of the underlying data.
A perceptron computes its output value, Q, by performing a weighted sum on its inputs, subtracting a bias factor t from the sum, and then examining the sign of the result. The model shown in Figure 5.14(b) has three input nodes, each of which has an identical weight of 0.3 to the output node and a bias factor of f : 0.4. The output computed by the model is
X1 X2 x3 v 1 1 1 1 0 0 0 0
0 0 1 1 0 1 1 0
0 1 0 1 “l
0 1 0
-1
1 1 1
-1 -1
1 -1
(a) Data set.
” I t , i f 0.311 * o.Jr2f 0.313 – 0.4 > o; t i -”
[ -1, i f o.Jq * 0.Jr2 * 0.313 – 0.4 < 0. (5 .21)
248 Chapter 5 Classification: Alternative Techniques
For example, if 11 : \, fi2 : L, rz : 0, then 0 : +1 because 0.32r * 0.312 + 0.313 – 0 .4 i s pos i t i ve . On the o ther hand, i f ry :0 , f r2 : L , f rB : 0 , then
A : -t because the weighted sum subtracted by the bias factor is negative. Note the difference between the input and output nodes of a perceptron.
An input node simply transmits the value it receives to the outgoing link with- out performing any transformation. The output node, on the other hand, is a mathematical device that computes the weighted sum of its inputs, subtracts the bias term, and then produces an output that depends on the sign of the resulting sum. More specifically, the output of a perceptron model can be expressed mathematicallv as follows:
0 : s i ‘gn(wdrd + wd- t .xd- t+ . . . + wzrz I wp1 – t ) , (5.22)
where u) r t l r2 t . . . ,11)d . a re the we igh ts o f the input l inks and 11 , r2 t . . . ) rdare the input attribute values. The sign function, which acts as an activation function for the output neuron, outputs a value *1 if its argument is positive
and -1 if its argument is negative. The perceptron model can be written in a more compact form as follows:
0 : s i ‘ g n l w a r a l w d , – L r d – r + . . . + w y r l l u t o r o f : s i g n ( w ‘ x ) , ( 5 . 2 3 )
where u)0: -t, frl: I, and w’x is the dot product between the weight vector w and the input attribute vector x.
Learning Perceptron Model
During the training phase of a perceptron model, the weight parameters w are adjusted until the outputs of the perceptron become consistent with the true outputs of training examples. A summary of the perceptron learning algorithm is given in Algorithm 5.4.
The key computation for this algorithm is the weight update formula given in Step 7 of the algorithm:
.\k+r) :.f, + \(ur – g[r))”ni, (5.24)
where tr(k) is the weight parameter associated with the i’h input link after the kth ileration, ) is a parameter known as the learning rate, and rii is the value of lhe jth attribute of the training example 4. The justification for the weight update formula is rather intuitive. Equation 5.24 shows that the new weight ?r(k+l) is a combination of the old weight tr(ft) and a term proportional
5.4 Artificial Neural Network (ANN) 249
Algorithm 5.4 Perceptron learning algorithm. t: Let D: {(xr, yt) | i : I,2,. . .,1[] be the set of training examples. 2: Initialize the weight vector with random values, w(0) 3: repeat 4: for each training example (xt,llt) € D do
5: Compute the predicted output yjk) 6: for each weight wi do
7: Update the weight, .\k+r) :.:*) + )(g, – g[r))*0,. 8: end for 9: end for
10: until stopping condition is met
to the prediction error, (g – i). If the prediction is correct, then the weight remains unchanged. Otherwise, it is modified in the following ways:
o If y : +1 and f – -1, then the prediction error is (a – 0) : 2. To compensate for the error, we need to increase the value of the predicted output by increasing the weights of all links with positive inputs and decreasing the weights of all links with negative inputs.
o I f at : -1 and 0: l I , then (g -0): -2. To compensate for the error, we need to decrease the value of the predicted output by decreasing the weights of all links with positive inputs and increasing the weights of all Iinks with negative inputs.
In the weight update formula, Iinks that contribute the most to the error term are the ones that require the largest adjustment. However, the weights should not be changed too drastically because the error term is computed only for the current training example. Otherwise, the adjustments made in earlier iterations will be undone. The learning rate ), a parameter whose value is between 0 and 1, can be used to control the amount of adjustments made in each iteration. If ,\ is close to 0, then the new weight is mostly influenced by the value of the old weight. On the other hand, if ) is close to 1, then the new weight is sensitive to the amount of adjustment performed in the current iteration. In some cases) an adaptive ) value can be used; initially, ,\ is moderately large during the first few iterations and then gradually decreases in subsequent iterations.
The perceptron model shown in Equation 5.23 is linear in its parameters w and attributes x. Because of this, the decision boundary of a perceptron, which is obtained by setting 0 : 0, is a linear hyperplane that separates the data into two classes, -1 and *1. Figure 5.15 shows the decision boundary
25O Chapter 5 Classification: Alternative Techniques
Figure 5.15. Perceptron decision boundary for the data given in Figure 5.14,
obtained by applying the perceptron learning algorithm to the data set given in Figure 5.14. The perceptron learning algorithm is guaranteed to converge to an optimal solution (as long as the learning rate is sufficiently small) for linearly separable classification problems. If the problem is not linearly separable, the algorithm fails to converge. Figure 5.16 shows an example of nonlinearly separable data given by the XOR function. Perceptron cannot find the right solution for this data because there is no linear hyperplane that can perfectly separate the training instances.
0.5 x1
1 . 5
1
Xe 0.5
0
-0 .5L -0.5 0.5
x1
Figure 5.16. XOR classification problem. No linear hyperplane can separate the two classes.
1 . 5
5.4 Artificial Neural Network (ANN) 251
5.4.2 Multilayer Artificial Neural Network
An artificial neural network has a more complex structure than that of a perceptron model. The additional complexities may arise in a number of ways:
1. The network may contain several intermediary layers between its input and output layers. Such intermediary layers are called hidden layers and the nodes embedded in these layers are called hidden nodes. The resulting structure is known as a multilayer neural network (see Fig- ure 5.17). In a feed-forward neural network, the nodes in one layer
Figure 5.17. Example of a multilayer feed{onrvard artificial neural network (ANN).
are connected only to the nodes in the next layer. The perceptron is a single-layer, feed-forward neural network because it has only one layer of nodes-the output layer that performs complex mathematical op- erations. fn a recurrent neural network, the links may connect nodes within the same layer or nodes from one layer to the previous layers.
2. The network may use types of activation functions other than the sign function. Examples of other activation functions include linear, sigmoid (logistic), and hyperbolic tangent functions, as shown in Figure 5.18. These activation functions allow the hidden and output nodes to produce output values that are nonlinear in their input parameters.
These additional complexities allow multilayer neural networks to model more complex relationships between the input and output variables. For ex-
252 Chapter 5 Classification: Alternative Techniques
Linear function Sigmoid tun
-0.5 0 0.s 1 -1 -05 0 0.5
Figure 5,18. Types of activation functions in artificial neural networks.
ample, consider the XOR problem described in the previous section. The in- stances can be classified using two hyperplanes that partition the input space into their respective classes, as shown in Figure 5.19(a). Because a percep-
tron can create only one hyperplane, it cannot find the optimal solution. This problem can be addressed using a two-layer, feed-forward neural network, as shown in Figure 5.19(b). Intuitively, we can think of each hidden node as a perceptron that tries to construct one of the two hyperplanes, while the out- put node simply combines the results of the perceptrons to yield the decision boundary shown in Figure 5.19(a).
To learn the weights of an ANN model, we need an efficient algorithm that converges to the right solution when a sufficient amount of training data is provided. One approach is to treat each hidden node or output node in the network as an independent perceptron unit and to apply the same weight update formula as Equation 5.24. Obviously, this approach will not work because we lack a priori, knowledge about the true outputs of the hidden nodes. This makes it difficult to determine the error term, (gr – f), associated
Sigmoid tunction
Sign function
-1 .5
5.4 Artificial Neural Network (ANN) 253
1 . 5
1
X z 0 5
0-0.5 0.5
x1
(a) Decision boundary. (b) Neural network topology.
Figure 5.19. A two-layer, feed{orward neural network for the XOR problem.
with each hidden node. A methodology for learning the weights of a neural network based on the gradient descent approach is presented next.
Learning the ANN Model
The goal of the ANN learning algorithm is to determine a set of weights w that minimize the total sum of souared errors:
1 ‘A{
E(w) : il,rr,-o)2. ; – l
(5.25)
Note that the sum of squared errors depends on w because the predicted class
f is a function of the weights assigned to the hidden and output nodes. Figure 5.20 shows an example of the error surface as a function of its two parameters, to1 and ‘u2. This type of error surface is typically encountered when Qi is a linear function of its parameters, w. If we replace Q : w ‘x into Equation 5.25, then the error function becomes quadratic in its parameters and a global minimum solution can be easily found.
In most cases) the output of an ANN is a nonlinear function of its param- eters because of the choice of its activation functions (e.g., sigmoid or tanh function). As a result, it is no longer straightforward to derive a solution for w that is guaranteed to be globally optimal. Greedy algorithms such as those based on the gradient descent method have been developed to efficiently solve the optimization problem. The weight update formula used by the gradient
254 Chapter 5 Classification: Alternative Techniques
E(w1,w2)
1 . 8
w 1 0 ‘ l
Figure 5,20. Enor sudace E(w1,w2) for a two-parameter model.
descent method can be written as follows:
1 . 6
1 . 4
1 . 2
1 I
(5.26)
where ) is the Iearning rate. The second term states that the weight should be increased in a direction that reduces the overall error term. However, because the error function is nonlinear, it is possible that the gradient descent method may get trapped in a local minimum.
The gradient descent method can be used to learn the weights of the out- put and hidden nodes ofa neural network. For hidden nodes, the computation is not trivial because it is difficult to assess their error Lerm,0Ef 0t17, without knowing what their output values should be. A technique known as back- propagation has been developed to address this problem. There are two phases in each iteration of the algorithm: the forward phase and the backward phase. During the forward phase, the weights obtained from the previous iter- ation are used to compute the output value of each neuron in the network. The computation progresses in the forward direction; i.e., outputs of the neurons at level k are computed prior to computing the outputs at level /c + 1. Dur- ing the backward phase, the weight update formula is applied in the reverse direction. In other words, the weights at level k + 7 arc updated before the weights at level k are updated. This back-propagation approach allows us to use the errors for neurons at layer k + t to estimate the errors for neurons at Iaver k.
.OE(w) W; +’ l I ; – A—^r r
dw.i