Reputation: 7338
Following Andrew Traks's example, I want to implement a 3 layer neural network - 1 input, 1 hidden, 1 output - with a simple dropout, for binary classification.
If I include bias terms b1
and b2
, then I would need to slightly modify Andrew's code as below.
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim,dropout_percent = (0.5,4,0.2)
synapse_0 = 2*np.random.random((X.shape[1],hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
b1 = np.zeros(hidden_dim)
b2 = np.zeros(1)
for j in range(60000):
# sigmoid activation function
layer_1 = (1/(1+np.exp(-(np.dot(X,synapse_0) + b1))))
# dropout
layer_1 *= np.random.binomial([np.ones((len(X),hidden_dim))],1-dropout_percent)[0] * (1.0/(1-dropout_percent))
layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1) + b2)))
# sigmoid derivative = s(x)(1-s(x))
layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
synapse_0 -= (alpha * X.T.dot(layer_1_delta))
b1 -= alpha*layer_1_delta
b2 -= alpha*layer_2_delta
The problem is, of course, with the code above the dimensions of b1
dont match with the dimensions of layer_1_delta
, similarly with b2
and layer_2_delta
.
I don't understand how the delta is calculated to update b1
and b2
- according to Michael Nielsen's example, b1
and b2
should be updated by a delta which in my code I believe to be layer_1_delta
and layer_2_delta
respectively.
What am I doing wrong here? Have I messed up the dimensionality of the deltas or of the biases? I feel it is the latter, because if I remove the biases from this code it works fine. Thanks in advance
Upvotes: 0
Views: 208
Reputation: 60065
So first I would change X
in bX
to 0 and 1 to correspond to synapse_X
, because this is where they belong and it makes it:
b1 -= alpha * 1.0 / m * np.sum(layer_2_delta)
b0 -= alpha * 1.0 / m * np.sum(layer_1_delta)
Where m
is the number of examples in the training set. Also, the drop rate is stupidly high and actually hurts convergence. So in all considered the whole code:
import numpy as np
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
m = X.shape[0]
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim,dropout_percent = (0.5,4,0.02)
synapse_0 = 2*np.random.random((X.shape[1],hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
b0 = np.zeros(hidden_dim)
b1 = np.zeros(1)
for j in range(10000):
# sigmoid activation function
layer_1 = (1/(1+np.exp(-(np.dot(X,synapse_0) + b0))))
# dropout
layer_1 *= np.random.binomial([np.ones((len(X),hidden_dim))],1-dropout_percent)[0] * (1.0/(1-dropout_percent))
layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1) + b1)))
# sigmoid derivative = s(x)(1-s(x))
layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
synapse_0 -= (alpha * X.T.dot(layer_1_delta))
b1 -= alpha * 1.0 / m * np.sum(layer_2_delta)
b0 -= alpha * 1.0 / m * np.sum(layer_1_delta)
print layer_2
Upvotes: 1