Reputation: 2568
During back propagation it appears that it's assumed that any error created in a hidden layer only affects one layer higher (for example see the derivation here, specifically equation 16).
That is, when calculating dE/dy_j
the derivation states it uses the chain rule, however it only differentiates over nodes with indices in I_j
(i.e. only over nodes one layer higher than y_j
). Why is it the case that higher layers are ignored in this calculation? We could take into account the i+1
layer as well by considering that x_{i+1} = \sum_i w_{i,i+1} f(\sum_{j} w_{j,i} y_j)
(which clearly has y_j
dependence).
Upvotes: 1
Views: 58
Reputation: 19169
Higher layers aren't being ignored. In equation 16, the E
in the dE/dy_i
is the error of the final output, so that gradient already includes effects of all subsequent layers. That's the whole point of backpropagation. You start with the error at the output and compute the gradient with respect to the previous layer. Then, you use that gradient to compute the gradient for the next previous layer, etc.
You could do what you are describing but it would make for a much more complicated formulation. A convenient/efficient aspect of the backpropagation formulation is that since you only need to use the error term for the subsequent layer, it doesn't matter whether you have a total of 3 layers or 4 or 50. You apply the same simple formula to each hidden layer, accumulating chain rule terms as you work your way backward through the network.
Upvotes: 2