Back-propagation algorithm is described by David E. Rumelhart, Geoffrey E. Hinton† & Ronald J. Williams in the famous paper "Learning representations by back-propagating errors" .
You can download the original paper at link http://www.nature.com/nature/journal/v323/n6088/pdf/323533a0.pdf. In the paper, the authors described a procedure called back-propagation. Here I will show just short summary of main concepts related to it.
This algorithm can be decomposed in the following steps:
The algorithm repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output of the network and the desired one.
This difference is computed by means of an error function, which is commonly given as the sum of the squares of the differences between all target ti and actual node activations yi for the output layer.
E is calculated by the network through composition of the node functions, so It is a continuous and differentiable function of the weights in the network.
This method requires computation of the gradient of the error function at each iteration step:
Using a method to compute this gradient, it is possible adjust the network weights iteratively, in order to find a minimum of the error function, where ∇E = 0.
In a way very similar to that of the delta rule, with the back-propagation algorithm, each weight is updated using the increment which is calculate as follows (where γ is the learning rate):
If we denote the back-propagated error at the j-th node by δj, we can then express the partial derivative of E with respect to wkj as:
where yi is the output of unit i.
So we have
Putting it all together we have that error term is given by following expressions:
For a weight connecting a node in layer k to a node in layer j the change in weight is given by
where α is the learning rate, a real value on the interval (0,1], yk is the activation of the node in layer k, n refers to the training epoch (the number of iteration in the training algorithm loop), η is the momentum.
Introduction of the momentum rate η allows the attenuation of oscillations in the iteration process. With momentum, once the weights start moving in a particular direction in weight space, they tend to continue moving in that direction.
Despite the complexity of the previous formulas, essence of back-propagation algorithm is in the last three.
If you trust that they are correct there is no need to know more.