Reputation: 14839
I am trying to implement a feed forward neural network in CUDA. So far, I've used Jeff Heaton's YouTube videos as a guide to infer the algorithms and implement them. I'm not clear on one thing:
Heaton in his Gradient Calculation video, explains how to get the node delta δ[i] for output neurons. Then I have to reuse it as δ[k] in my previous (hidden layer) in order to calculate δ[i].
However, there is no mention of what happens when I have more than one node delta, or more than one outgoing weights for layer i to layer k. Similarly its confusing on how I calculate the gradient for a specific weight; do I use a node delta, or a layer's delta?
For example, what happens if my output layer has 2 neurons? Do I sum the δ[k] for all nodes in layer k?
The formula provided by Heaton:
f'( Sum (weight[i] * input[i] ) ) * Σ w[ki] * δ[k]
Appears to suggest that δ[k] is representative of the previous layer (and not just the output node) as shown in the video @ 9:25'.
In fact, two of the comments in that YouTube video also ask the same thing, but no satisfactory reply is given.
As far as I understand, δ[k] represents the error from the previous/following layer k to which layer i connects to, and not the node to which current node in layer i connects to?
EDIT
I've read a few papers and tutorials/lessons online, but the one that seems to somewhat answer my question, can be found here. In specific, the formula the blog's author uses, is the same as the one used by Heaton, but he explains:
HLN = Hidden Layer Neuron,
LLN = Last Layer Neuron,
aO=actualOUTPUT,
dE=deltaError
HLN.dE = (HLN.aO) x (1-HLN.aO) x (Sum of [LLN.dE x LLN to HLN connection weight])
This seems to imply that the formula actually is:
Si = Σ (w[ji] * a[j])
δ[i] = f'( Si ) * Σ( w[ki] * δ[k] )
In words:
The Sum of the previous activation outputs multiplied by their edges (weights), is put through the sigmoid derivative.
Then for each out-going weight w[ki] is multiplied by the respective δ[k] (e.g., the w[ki] * δ[κ]). Those values are summed, and multiplied with the result of the sigmoid derivative.
I'd still like to hear from someone who has implemented a feed-forward neural network, if that is what happens.
Upvotes: 2
Views: 1628
Reputation: 10985
You're right. Probably the most easy way to formalize the update is:
with f'(x) being the derivative of the activation function, x_i the output of the sending unit i and x_j from the receiving unit j. So in words, you sum the deltas over all connections that the current unit has towards the previously visited layer (hence, backpropagation).
Upvotes: 3