Reputation: 1248
Im researching about MultiLayer Perceptrons, a kind of Neural Networks. When I read about Back Propagation Algorithm I see some authors suggest to update weights inmediately after we computed all errors for specific layer, but another authors explain we need to update weights after we get all errors for all layers. What are correct approach?
1st Approach:
function void BackPropagate(){
ComputeErrorsForOutputLayer();
UpdateWeightsOutputLayer();
ComputeErrorsForHiddenLayer();
UpdateWeightsHiddenLayer();
}
2nd Approach:
function void BackPropagate(){
ComputeErrorsForOutputLayer();
ComputeErrorsForHiddenLayer();
UpdateWeightsOutputLayer();
UpdateWeightsHiddenLayer();
}
Thanks for everything.
Upvotes: 0
Views: 2063
Reputation: 717
The question is different to choose between batch or online backpropagation.
Your question is legitimate one and I think that both approaches are good. The both approches are almost similar on many epochs but the 2nd looks just a little better even if everyone use the 1st.
PS : The 2nd approch works only on online backpropagation
Upvotes: -1
Reputation: 4275
@lejlot's answer is entirely correct
Batch backpropagation
Update weights after all errors for all the input vectors are calculated.
Online backpropagation
Update weights after all errors for one input vector are calculated.
There is a third method called Stochastic backpropagation, which is really just an online backpropagation with a random selection training pattern sequence.
On average, the batch backpropagation method is the fastest one to converge - but the most difficult to implement. See a simple comparison here.
Here you can see the mathmatical equation for calculating the derivative
of the Error with respect to the weights. (using Sidmoid)
O_i = the layer below # ex: input
O_k = the current layer # ex: hidden layer
O_o = the layer above # ex: output layer
As you can see, the dE/dW depends on the weights of the layer above.
So you may not alter them before calculating the deltas for each layer.
Upvotes: 3
Reputation: 66795
I am pretty sure that you have misunderstood the concept here. Two possible strategies are:
which is completely different from what you have written. These two method are sample/batch strategies, both having their pros and cons, due to simplicity the first approach is much more common in implementations.
Regarding your "methods", second method is the only correct one, process of "propagating" the error is just a computational simplification of computing derivative of error function, and the (basic) process of learning is a steepest descent method. If you compute the derivative only for part of dimensions (output layer), perform a step in the direction, and then recalculate the error derivatives according to new values, you are not performing a gradient descent. The only scenario, where first method is acceptable is when your weights update do not interfer with your error computation, then it does not matter what order is used, as they are independent.
Upvotes: 5