Reputation: 1289
I created a neural network and attempted training it, all was well until I added in a bias.
From what I gather when training the bias adjusts to move the expected output up or down, and the weights tend towards a value that helps YHat emulate some function, so for a two layer network:
output = tanh(tanh(X0W0 + b0)W1 + b1)
In practice what I've found is W sets all weights to near 0, and b almost echos the trained output of Y. Which essentially makes the output work perfectly for the trained data, but when you give it different kinds of data it will always give the same output.
This has caused quite some confusion. I know that the bias' role is to move the activation graph up or down but when it comes to training it seems to make the entire purpose of the neural network irrelevant. Here is the code from my training method:
def train(self, X, Y, loss, epoch=10000):
for i in range(epoch):
YHat = self.forward(X)
loss.append(sum(Y - YHat))
err = -(Y - YHat)
for l in self.__layers[::-1]:
werr = np.sum(np.dot(l.localWGrad, err.T), axis=1)
werr.shape = (l.height, 1)
l.adjustWeights(werr)
err = np.sum(err, axis=1)
err.shape = (X.shape[0], 1)
l.adjustBiases(err)
err = np.multiply(err, l.localXGrad)
and the code for adjusting the weghts and biases. (Note: epsilon is my training rate and lambda is the regularisation rate)
def adjustWeights(self, err):
self.__weights = self.__weights - (err * self.__epsilon + self.__lambda * self.__weights)
def adjustBiases(self, err):
a = np.sum(np.multiply(err, self.localPartialGrad), axis=1) * self.__epsilon
a.shape = (err.shape[0], 1)
self.__biases = self.__biases - a
And here is the math I've done for this network.
Z0 = X0W0 + b0
X1 = relu(Z0)
Z1 = X1W1 + b1
X2 = relu(Z1)
a = YHat-X2
#Note the second part is for regularisation
loss = ((1/2)*(a^2)) + (lambda*(1/2)*(sum(W1^2) + sum(W2^2)))
And now the derivatives
dloss/dW1 = -(YHat-X2)*relu'(X1W1 + b1)X1
dloss/dW0 = -(YHat-X2)*relu'(X1W1 + b1)W1*relu'(X0W0 + b0)X0
dloss/db1 = -(YHat-X2)*relu'(X1W1 + b1)
dloss/db0 = -(YHat-X2)*relu'(X1W1 + b1)W1*relu'(X0W0 + b0)
I'm guessing I'm doing something wrong, but I have no idea what it is. I tried training this network on the following inputs
X = np.array([[0.0], [1.0], [2.0], [3.0]])
Xnorm = X / np.amax(X)
Y = np.array([[0.0], [2.0], [4.0], [6.0]])
Ynorm = Y / np.amax(Y)
And I get this as the output:
post training:
shape: (4, 1)
[[0. ]
[1.99799666]
[3.99070622]
[5.72358125]]
Expected:
[[0.]
[2.]
[4.]
[6.]]
Which seems great... until you forward something else:
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
Then I get:
shape: (4, 1)
[[0.58289512]
[2.59967085]
[4.31654068]
[5.74322541]]
Expected:
[[4.]
[6.]
[8.]
[10.]]
I thought "perhapse this is the evil 'Overfitting I've heard of" and decided to add in some regularisation, but even that doesn't really solve the issue, why would it when it makes sense from a logical perspective that it's faster, and more optimal to set the biases to equal the output and make the weights zero... Can someone explain what's going wrong in my thinking?
Here is the network structure post training, (note if you multiply the output by the max of the training Y you will get the expected output:)
===========================NeuralNetwork===========================
Layers:
===============Layer 0 :===============
Weights: (1, 3)
[[0.05539559 0.05539442 0.05539159]]
Biases: (4, 1)
[[0. ]
[0.22897166]
[0.56300199]
[1.30167665]]
==============\Layer 0 :===============
===============Layer 1 :===============
Weights: (3, 1)
[[0.29443245]
[0.29442639]
[0.29440642]]
Biases: (4, 1)
[[0. ]
[0.13199981]
[0.32762199]
[1.10023446]]
==============\Layer 1 :===============
==========================\NeuralNetwork===========================
The graph y = 2x has a y intercept crosses at x=0, and thus it would make sense for all the bias' to be 0 as we aren't moving the graph up or down... right?
Thanks for reading this far!
edit:
Here is the loss graph:
edit 2:
I just tried to do this with a single weight and output and here is the output structure I got:
===========================NeuralNetwork===========================
Layers:
===============Layer 0 :===============
Weights: (1, 1)
[[0.47149317]]
Biases: (4, 1)
[[0. ]
[0.18813419]
[0.48377987]
[1.33644038]]
==============\Layer 0 :===============
==========================\NeuralNetwork===========================
and for this input:
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
I got this output:
shape: (4, 1)
[[4.41954787]
[5.53236625]
[5.89599366]
[5.99257962]]
when again it should be:
Expected:
[[4.]
[6.]
[8.]
[10.]]
Note the problem with the biases persist, you would think in this situation the weight would be 2, and the bias would be 0.
Upvotes: 1
Views: 369
Reputation: 10139
Moved answer from OP's question
Turns out I never dealt with my training data properly. The input vector:
[[0.0], [1.0], [2.0], [3.0]]
was normalised, I divided this vector by the max value in the input which was 3, and thus I got
[[0.0], [0.3333], [0.6666], [1.0]]
And for the input Y training vector I had
[[0.0], [2.0], [4.0], [6.0]]
and I foolishly decided to do the same with this vector,but with the max of Y 6:
[[0.0], [0.333], [0.666], [1.0]]
So basically I was saying "hey network, mimic my input". This was my first error. The second error came as a result of more misunderstanding of the scaling.
Although 1 was 0.333, and 0.333*2 = 0.666, which I then multiplied by the max of y (6) 6*0.666 = 2, if I try this again with a different set of data say:
[[2.0], [3.0], [4.0], [5.0]]
2 would be 2/5 = 0.4 and 0.4*2 = 0.8, which times 5 would be 2, however in the real world we would have no way of knowing that 5 was the max output of the dataset, and thus I thought maybe it would have been the max of the Y training, which was 6 so instead of 2/5 = 0.4, 0.4*2 = 0.8 * 5, I done 2/5 = 0.4, 0.4*2 = 0.8 * 6 = 4.8.
So then I got some strange behaviours of the biases and weights as a result. So after essentially getting rid of the normalisation, I was free to tweak the hyperparameters and now as an output for the base training data:
input:
X:
[[0.]
[1.]
[2.]
[3.]]
I get this output:
shape: (4, 1)
[[0.30926124]
[2.1030826 ]
[3.89690395]
[5.6907253 ]]
and for the extra testing data (not trained on):
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
I get this output:
shape: (4, 1)
[[3.89690395]
[5.6907253 ]
[7.48454666]
[9.27836801]]
So now I'm happy. I also changed my activation to a leaky relu as it should fit a linear equation better (I think.). I'm sure with more testing data and more tweaking of the hyperparameters it would be a perfect fit. Thanks for the help everyone. Trying to explain my problem really put things into perspective.
Upvotes: 1