Reputation: 17
I am trying to implement a neural network which have around 2000 inputs.
I have made some tests with the iris data set in order to check it and it seems to work, but when I am running my test it throws wrong results, most of the time, for all the tests, I obtain the same output for every data. I am afraid if it is somehow related to the bias process and the gradient update, maybe you guys can spot the error or give me some advice. Here is part of the code for the backpropagation process.
def backward_propagation(parameters, cache, X, Y):
#weights
W1 = parameters['W1']
W2 = parameters['W2']
#Outputs after activation function
A1 = cache['A1']
A2 = cache['A2']
dZ2= A2 - Y
dW2 = np.dot(dZ2, A1.T)
db2 = np.sum(dZ2, axis=1, keepdims=True)
dZ1 = np.multiply(np.dot(W2.T, dZ2), 1 - np.power(A1, 2))
dW1 = np.dot(dZ1, X.T)
db1 = np.sum(dZ1, axis=1, keepdims=True)
gradient = {"dW1": dW1,
"db1": db1,
"dW2": dW2,
"db2": db2}
return gradient
Upvotes: 1
Views: 128
Reputation: 934
It is extremely difficult to see if it is really working as it should if you do not provide the prediction and forward function.
That way we can know what is being done exactly and see if the backpropagation is really correct.
You are not correctly deriving the sigmoid function and I think that you are not correctly applying the chain rule either.
From what I see you are using this architecture:
The gradients would be (apply chain rule):
In your code it is translated in the following way:
W1 = parameters['W1']
W2 = parameters['W2']
#Outputs after activation function
A1 = cache['A1']
A2 = cache['A2']
dA2= A2 - Y
dfc2 = dA2*A2*(1 - A2)
dA1 = np.dot(dfc2, W2.T)
dW2 = np.dot(A1.T, dfc2)
db2 = np.sum(dA2, axis=1, keepdims=True)
dfc1 = dA1*A1*(1 - A1)
dA1 = np.dot(dfc1, W1.T)
dW1 = np.dot(X.T, dfc1)
db1 = np.sum(dA1, axis=1, keepdims=True)
gradient = {
"dW1": np.sum(dW1, axis=0),
"db1": np.sum(db1, axis=0),
"dW2": np.sum(dW2, axis=0),
"db2": np.sum(db2, axis=0)
}
I check doing the following code:
import numpy as np
W1 = np.random.rand(30, 10)
b1 = np.random.rand(10)
W2 = np.random.rand(10, 1)
b2 = np.random.rand(1)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
X = np.random.rand(100, 30)
Y = np.ones(shape=(100, 1)) #...
for i in range(100000000):
fc1 = X.dot(W1) + b1
A1 = sigmoid(fc1)
fc2 = A1.dot(W2) + b2
A2 = sigmoid(fc2)
L = np.sum(A2 - Y)**2
print(L)
dA2= A2 - Y
dfc2 = dA2*A2*(1 - A2)
dA1 = np.dot(dfc2, W2.T)
dW2 = np.dot(A1.T, dfc2)
db2 = np.sum(dA2, axis=1, keepdims=True)
dfc1 = dA1*A1*(1 - A1)
dA1 = np.dot(dfc1, W1.T)
dW1 = np.dot(X.T, dfc1)
db1 = np.sum(dA1, axis=1, keepdims=True)
gradient = {
"dW1": dW1,
"db1": db1,
"dW2": dW2,
"db2": db2
}
W1 -= 0.1*np.sum(dW1, axis=0)
W2 -= 0.1*np.sum(dW2, axis=0)
b1 -= 0.1*np.sum(db1, axis=0)
b2 -= 0.1*np.sum(db2, axis=0)
If your last activation is a sigmoid the value will be between 0 and 1. You should keep in mind that normally this is used to indicate a probability and that the cross entropy is normally used as a loss.
Upvotes: 0