Reputation: 129
I am trying to understand how backpropagation works mathematically, and want to implement it in python with numpy. I use a feedforward neural network with one hidden layer for my calculations, sigmoid as activation function, mean squared error as error function. This is the screenshot of the result of my calculations: , and the problem is that there is a bunch of matrices, and i cannot multiply them out completely because they don't have same dimensions.
(In the screenshot L is the output layer, L-1 is hidden layer, L-2 is input layer, W is weight, E is error function, lowercase A is activations)
(In the code the first layer has 28*28 nodes, [because i am using MNIST database of 0-9 digits as training data], hidden layer is 15 nodes, output layer is 10 nodes).
# ho stands for hidden_output
# ih stands for input_hidden
def train(self, input_, target):
self.input_ = input_
self.output = self.feedforward(self.input_)
# Derivative of error with respect to weight between output layer and hidden layer
delta_ho = (self.output - target) * sigmoid(np.dot(self.weights_ho, self.hidden), True)) * self.hidden
# Derivative of error with respect to weight between input layer and hidden layer
delta_ih = (self.output - target) * sigmoid(np.dot(self.weights_ho, self.hidden), True)) * self.weights_ho * sigmoid(np.dot(self.weights_ih, self.input_), True) * self.input_
# Adjust weights
self.weights_ho -= delta_ho
self.weights_ih -= delta_ih
At the delta_ho = ...
line, the dimensions of the matrices are (10x1 - 10x1) * (10x1) * (1x15) so how do i compute this? Thanks for any help!
Upvotes: 4
Views: 4408
Reputation: 1250
Here is a note from CS231 of Stanford: http://cs231n.github.io/optimization-2/.
For back-propagation with matrix/vectors, one thing to remember is that the gradient w.r.t. (with respect to) a variable (matrix or vector) always have the same shape as the variable.
For example, if the loss is l
, there is a matrix multiplication operation in the calculation of loss: C = A.dot(B)
. Let's suppose A
has shape (m, n)
and B
has shape (n, p)
(hence C
has shape (m, p)
). The gradient w.r.t. C
is dC
, which also has shape (m, p)
. To obtain a matrix that has the shape as A
using dC
and B
, we can only to dC.dot(B.T)
which is the multiplication of two matrices of shape (m, p)
and (p, n)
to obtain dA
, the gradient of the loss w.r.t. A
. Similarly the gradient of the loss w.r.t. B is dB = A.T.dot(dC)
.
For any added operation such as sigmoid you can chain them backwards as everywhere else.
Upvotes: 5