Basic Backpropagation Implementation Not Working

Question

I'm in the early stages of understanding backpropagation and I attempted to implement it myself.

The dataset I attempted to work with was the iris dataset of size (150, 4).

I'm only worried about backpropagation and not gradient descent, so I'm just trying my algorithm on one example to see if I can get a seemingly proper output.

However, my issue is trying to get my gradients for my initial weight matrix, I'm getting an error with the shapes.

I'd like my network to be something like this - 4 inputs, 8 hidden neurons, and 1 output neuron

My code is below. The error is with the last line because x is of size (4,1) and delta2 is of size (8,8) so I can't get the dot product I just don't understand how I am supposed to get a correct size of delta2 if I'm following the algorithm correctly according to other sources.

from sklearn.datasets import load_iris
from keras.utils import to_categorical
import numpy as np

# LOAD DATA
data = load_iris()
X = data.data[:-20]
y = to_categorical(data.target[:-20])
# only 20 samples because we have a small dataset
X_test = data.data[-20:]
y_test = to_categorical(data.target[-20:])

# INIT WEIGHTS  - will try to add bias later on
w1 = np.random.rand(np.shape(X)[1], h_neurons)
w2 = np.random.rand(h_neurons, 3)

def sigmoid(x, deriv=False):
    if deriv:
        return sigmoid(x)*(1-sigmoid(x))
    else:
        return 1/(1+np.exp(-x))

# Feed forward
x = X[1].reshape(4,1)
z1 = w1.T.dot(x) # need to transpose weight matrix
a1 = sigmoid(z1)
z2 = w2.T.dot(a1)
y_hat = sigmoid(z2,deriv=True) # output


# BACKPROP
y_ = y[1].reshape(3,1)
delta3 = np.multiply((y_hat - y_), sigmoid(z2, deriv=True))
dJdW2 = a1.dot(delta3) ## ERROR !!!

delta2 = np.dot(delta3, w2.T) * sigmoid(z1, deriv=True)
dJdW1 = np.dot(x.T, delta2) ## ERROR !!!

I thought I implemented backpropagation correctly, but apparently not, can someone please point out where I went wrong?

I'm stuck and I've looked at various sources and the code to compute dJdW (derivative of cost with respect to weights) is roughly the same.

Andrey Lukyanenko · Accepted Answer

I think there are several problems in your code. Let's solve them step by step. First of all, here is the complete code:

from sklearn.preprocessing import StandardScaler

def sigmoid(x, deriv=False):
    if deriv:
        return sigmoid(x)*(1-sigmoid(x))
    else:
        return 1/(1+np.exp(-x))


data = load_iris()
X = data.data[:-20]
X = StandardScaler().fit_transform(X)
y = data.target[:-20]
y = y.reshape(-1,1)

w1 = np.random.rand(np.shape(X)[1], 8)
w2 = np.random.rand(8, 1)

z1 = np.dot(X, w1) #shape (130, 8)
a1 = sigmoid(z1)
z2 = np.dot(a1, w2) #shape (130,1)
y_hat = sigmoid(z2) # l2 should also use sigmoid activation
delta3 = ((y - y_hat) * sigmoid(z2, deriv=True)) #shape (130,1)
dJdW2 = a1.T.dot(delta3) #shape (8,1)
delta2 = np.dot(delta3, w2.T) * sigmoid(z1, deriv=True) #shape (130,8)
dJdW1 = np.dot(X.T, delta2) #shape (4,8)

It isn't completely relevant for your problem, but I advice to scale the input data
At the beginning y shape is (130,), it is worth reshaping is to (130,1), as otherwise some problems could appear. Important: I don't use one hot encoding and leave y with shape 130,1 because one hot encoding requires softmax, sigmoid will worse.
I think it is better to use vectorized version and not write code for one sample, this way it should be easier to understand. And you need to use less transposes at forward pass.

So you have input X of shape 130, 4 and weight w1 with shape 4, 8. The result should have shape 130, 8. You do this like that:

z1 = np.dot(X, w1)
a1 = sigmoid(z1)

Then you move from hidden layer to output layer, from shape 130,8 to shape 130,1. And don't forget to apply activation function to y_hat:

z2 = np.dot(a1, w2)
y_hat = sigmoid(z2)

Now we can backpropagate. You have calculated delta correctly:

delta3 = np.multiply((y_hat - y_), sigmoid(z2, deriv=True)) #shape (130,1)

So you have delta3 with shape (130,1), a1 with shape 130,8 and need to get a value to update w2, so the result should have shape (8,1):

dJdW2 = a1.T.dot(delta3) #shape (8,1)

In a similar way you get the value to update w1:

delta2 = np.dot(delta3, w2.T) * sigmoid(z1, deriv=True) #shape (130,8)
dJdW1 = np.dot(X.T, delta2) #shape (4,8)

So here is it. But I want to point out, that you won't be able to have a good prediction for this dataset using such a model: sigmoid's output has range from 0 to 1 and you have 3 classes in iris dataset. There are several ways you can go: take only data belonging to 2 classes; use separate sigmoid for each class or use softmax activation for output layer.

Basic Backpropagation Implementation Not Working

Answers (1)

Related Questions