Reputation: 4275
Weight layers:
(n_inputs+1, n_units_layer)-matrix
(n_units_layer+1, n_units_layer)-matrix
(n_units_layer+1, n_outputs)-matrix
Notes:
inputs --first_layer-> network_unit --second_layer-> output
weight_layers = [ layer1, layer2 ] # a list of layers as described above
input_values = [ [0,0], [0,0], [1,0], [0,1] ] # our test set (corresponds to XOR)
target_output = [ 0, 0, 1, 1 ] # what we want to train our net to output
output_layers = [] # output for the corresponding layers
for layer in weight_layers:
output <-- calculate the output # calculate the output from the current layer
output_layers <-- output # store the output from each layer
n_samples = input_values.shape[0]
n_outputs = target_output.shape[1]
error = ( output-target_output )/( n_samples*n_outputs )
""" calculate the gradient here """
The final implementation is available at github.
Upvotes: 2
Views: 4554
Reputation: 3088
With Python and numpy that is easy.
You have two options:
num_instances
instances orI will now give some hints how to implement option 1. I would suggest that you create a new class that is called Layer
. It should have two functions:
forward: inputs: X: shape = [num_instances, num_inputs] inputs W: shape = [num_outputs, num_inputs] weights b: shape = [num_outputs] biases g: function activation function outputs: Y: shape = [num_instances, num_outputs] outputs backprop: inputs: dE/dY: shape = [num_instances, num_outputs] backpropagated gradient W: shape = [num_outputs, num_inputs] weights b: shape = [num_outputs] biases gd: function calculates the derivative of g(A) = Y based on Y, i.e. gd(Y) = g'(A) Y: shape = [num_instances, num_outputs] outputs X: shape = [num_instances, num_inputs] inputs outputs: dE/dX: shape = [num_instances, num_inputs] will be backpropagated (dE/dY of lower layer) dE/dW: shape = [num_outputs, num_inputs] accumulated derivative with respect to weights dE/db: shape = [num_outputs] accumulated derivative with respect to biases
The implementation is simple:
def forward(X, W, b):
A = X.dot(W.T) + b # will be broadcasted
Y = g(A)
return Y
def backprop(dEdY, W, b, gd, Y, X):
Deltas = gd(Y) * dEdY # element-wise multiplication
dEdX = Deltas.dot(W)
dEdW = Deltas.T.dot(X)
dEdb = Deltas.sum(axis=0)
return dEdX, dEdW, dEdb
X
of the first layer is your taken from your dataset and then you pass each Y
as the X
of the next layer in the forward pass.
The dE/dY
of the output layer is computed (either for softmax activation function and cross entropy error function or for linear activation function and sum of squared errors) as Y-T
, where Y
is the output of the network (shape = [num_instances, num_outputs]) and T
(shape = [num_instances, num_outputs]) is the desired output. Then you can backpropagate, i.e. dE/dX
of each layer is dE/dY
of the previous layer.
Now you can use dE/dW
and dE/db
of each layer to update W
and b
.
Here is an example for C++: OpenANN.
Btw. you can compare the speed of instance-wise and batch-wise forward propagation:
In [1]: import timeit
In [2]: setup = """import numpy
...: W = numpy.random.rand(10, 5000)
...: X = numpy.random.rand(1000, 5000)"""
In [3]: timeit.timeit('[W.dot(x) for x in X]', setup=setup, number=10)
Out[3]: 0.5420958995819092
In [4]: timeit.timeit('X.dot(W.T)', setup=setup, number=10)
Out[4]: 0.22001314163208008
Upvotes: 2