Home‎ > ‎MLP‎ > ‎

Back-propagation algorithm

Back-propagation algorithm is described by David E. Rumelhart, Geoffrey E. Hinton† & Ronald J. Williams in the famous paper "Learning representations by back-propagating errors" .
You can download the original paper at link http://www.nature.com/nature/journal/v323/n6088/pdf/323533a0.pdf. In the paper, the authors described a procedure called back-propagation. Here I will show just short summary of main concepts related to it.

This algorithm can be decomposed in the following steps:

Network is first initialized by setting up all its weights to be small random numbers
Until Convergence (low error or other stopping criteria) do

  • Input pattern is applied and the output calculated (feed-forward computation)
  • Calculate the error of the output nodes (based on Target – Actual Output)
  • Calculate the error of the hidden nodes based on the error of the output nodes which is propagated back to the hidden nodes
  • Update all weights based on the standard delta rule with the appropriate error function δ, this error is used mathematically to change the weights in such a way that the error will get smaller
  • Continue propagating error back until the input layer is reached

The algorithm repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output of the network and the desired one.
This difference is computed by means of an error function, which is commonly given as the sum of the squares of the differences between all target ti and actual node activations yi for the output layer.
The algorithm looks for the minimum of the error function E in weight space using the method of gradient descent.
E is calculated by the network through composition of the node functions, so It is a continuous and differentiable function of the weights in the network.
This method requires computation of the gradient of the error function at each iteration step:


Using a method to compute this gradient, it is possible adjust the network weights iteratively, in order to find a minimum of the error function, where ∇E = 0.

In a way very similar to that of the delta rule, with the back-propagation algorithm, each weight is updated using the increment which is calculate as follows (where γ is the learning rate):



To update all the network weights we have to select the appropriate error term δ by distinguishing the output layer from hidden layers.
If we denote the back-propagated error at the j-th node by δj, we can then express the partial derivative of E with respect to wkj as:
where yi is the output of unit i.
For hidden units we need to see how the error caused by yj has propagated into the activations of the next (ith) layer.
So we have

Putting it all together we have that error term is given by following expressions:

For a weight connecting a node in layer k to a node in layer j the change in weight is given by

where α is the learning rate, a real value on the interval (0,1], yk is the activation of the node in layer k, n refers to the training epoch (the number of iteration in the training algorithm loop), η is the momentum.
Introduction of the momentum rate η allows the attenuation of oscillations in the iteration process. With momentum, once the weights start moving in a particular direction in weight space, they tend to continue moving in that direction.
Despite the complexity of the previous formulas, essence of back-propagation algorithm is in the last three. 
If you trust that they are correct there is no need to know more.

Reference
Comments