Reputation: 63
I am currently in the process of trying to implement my own neural network from scratch to test my understanding of the method. I thought things were going well, as my network managed to approximate AND and XOR functions without an issue, but it turns out it is having a problem with learning to approximate a simple square function.
I have attempted to use a variety of different network configurations, with anywhere from 1 to 3 layers, and 1-64 nodes. I have varied the learning rate from 0.1 to 0.00000001, and implemented weight decay as I thought some regularisation might provide some insight as to what went wrong. I have also implemented gradient check, which is giving me conflicting answers, as it varies greatly from attempt to attempt, ranging from a dreadful 0.6 difference to a fantastic 1e-10. I am using the leaky ReLU activation function, and MSE as my cost function.
Could somebody help me spot what I am missing? Or is this purely down to optimising the hyper parameters?
My code is as follows:
import matplotlib.pyplot as plt
import numpy as np
import Sub_Script as ss
# Create sample data set using X**2
X = np.expand_dims(np.linspace(0, 1, 201), axis=0)
y = X**2
plt.plot(X.T, y.T)
# Hyper-parameters
layer_dims = [1, 64, 1]
learning_rate = 0.000001
iterations = 50000
decay = 0.00000001
num_ex = y.shape[1]
# Initializations
num_layers = len(layer_dims)
weights = [None] + [np.random.randn(layer_dims[l], layer_dims[l-1])*np.sqrt(2/layer_dims[l-1])for l in range(1, num_layers)]
biases = [None] + [np.zeros((layer_dims[l], 1)) for l in range(1, num_layers)]
dweights, dbiases, dw_approx, db_approx = ss.grad_check(weights, biases, num_layers, X, y, decay, num_ex)
# Main function: Iteration loop
for iter in range(iterations):
# Main function: Forward Propagation
z_values, acts = ss.forward_propagation(weights, biases, num_layers, X)
dweights, dbiases = ss.backward_propagation(weights, biases, num_layers, z_values, acts, y)
weights, biases = ss.update_paras(weights, biases, dweights, dbiases, learning_rate, decay, num_ex)
if iter % (1000+1) == 0:
print('Cost: ', ss.mse(acts[-1], y, weights, decay, num_ex))
# Gradient Checking
dweights, dbiases, dw_approx, db_approx = ss.grad_check(weights, biases, num_layers, X, y, decay, num_ex)
# Visualization
plt.plot(X.T, acts[-1].T)
With Sub_Script.py containing the neural network functions:
import numpy as np
import copy as cp
# Construct sub functions, forward, backward propagation and cost and activation functions
# Leaky ReLU Activation Function
def relu(x):
return (x > 0) * x + (x < 0) * 0.01*x
# Leaky ReLU activation Function Gradient
def relu_grad(x):
return (x > 0) + (x < 0) * 0.01
# MSE Cost Function
def mse(prediction, actual, weights, decay, num_ex):
return np.sum((actual - prediction) ** 2)/(actual.shape[1]) + (decay/(2*num_ex))*np.sum([np.sum(w) for w in weights[1:]])
# MSE Cost Function Gradient
def mse_grad(prediction, actual):
return -2 * (actual - prediction)/(actual.shape[1])
# Forward Propagation
def forward_propagation(weights, biases, num_layers, act):
acts = [[None] for i in range(num_layers)]
z_values = [[None] for i in range(num_layers)]
acts[0] = act
for layer in range(1, num_layers):
z_values[layer] = np.dot(weights[layer], acts[layer-1]) + biases[layer]
acts[layer] = relu(z_values[layer])
return z_values, acts
# Backward Propagation
def backward_propagation(weights, biases, num_layers, z_values, acts, y):
dweights = [[None] for i in range(num_layers)]
dbiases = [[None] for i in range(num_layers)]
zgrad = mse_grad(acts[-1], y) * relu_grad(z_values[-1])
dweights[-1] = np.dot(zgrad, acts[-2].T)
dbiases[-1] = np.sum(zgrad, axis=1, keepdims=True)
for layer in range(num_layers-2, 0, -1):
zgrad = np.dot(weights[layer+1].T, zgrad) * relu_grad(z_values[layer])
dweights[layer] = np.dot(zgrad, acts[layer-1].T)
dbiases[layer] = np.sum(zgrad, axis=1, keepdims=True)
return dweights, dbiases
# Update Parameters with Regularization
def update_paras(weights, biases, dweights, dbiases, learning_rate, decay, num_ex):
weights = [None] + [w - learning_rate*(dw + (decay/num_ex)*w) for w, dw in zip(weights[1:], dweights[1:])]
biases = [None] + [b - learning_rate*db for b, db in zip(biases[1:], dbiases[1:])]
return weights, biases
# Gradient Checking
def grad_check(weights, biases, num_layers, X, y, decay, num_ex):
z_values, acts = forward_propagation(weights, biases, num_layers, X)
dweights, dbiases = backward_propagation(weights, biases, num_layers, z_values, acts, y)
epsilon = 1e-7
dw_approx = cp.deepcopy(weights)
db_approx = cp.deepcopy(biases)
for layer in range(1, num_layers):
height = weights[layer].shape[0]
width = weights[layer].shape[1]
for i in range(height):
for j in range(width):
w_plus = cp.deepcopy(weights)
w_plus[layer][i, j] += epsilon
w_minus = cp.deepcopy(weights)
w_minus[layer][i, j] -= epsilon
_, temp_plus = forward_propagation(w_plus, biases, num_layers, X)
cost_plus = mse(temp_plus[-1], y, w_plus, decay, num_ex)
_, temp_minus = forward_propagation(w_minus, biases, num_layers, X)
cost_minus = mse(temp_minus[-1], y, w_minus, decay, num_ex)
dw_approx[layer][i, j] = (cost_plus - cost_minus)/(2*epsilon)
b_plus = cp.deepcopy(biases)
b_plus[layer][i, 0] += epsilon
b_minus = cp.deepcopy(biases)
b_minus[layer][i, 0] -= epsilon
_, temp_plus = forward_propagation(weights, b_plus, num_layers, X)
cost_plus = mse(temp_plus[-1], y, weights, decay, num_ex)
_, temp_minus = forward_propagation(weights, b_minus, num_layers, X)
cost_minus = mse(temp_minus[-1], y, weights, decay, num_ex)
db_approx[layer][i, 0] = (cost_plus - cost_minus)/(2*epsilon)
dweights_flat = [dw.flatten() for dw in dweights[1:]]
dweights_flat = np.concatenate(dweights_flat, axis=None)
dw_approx_flat = [dw.flatten() for dw in dw_approx[1:]]
dw_approx_flat = np.concatenate(dw_approx_flat, axis=None)
dbiases_flat = [db.flatten() for db in dbiases[1:]]
dbiases_flat = np.concatenate(dbiases_flat, axis=None)
db_approx_flat = [db.flatten() for db in db_approx[1:]]
db_approx_flat = np.concatenate(db_approx_flat, axis=None)
d_paras = np.concatenate([dweights_flat, dbiases_flat], axis=None)
d_approx_paras = np.concatenate([dw_approx_flat, db_approx_flat], axis=None)
difference = np.linalg.norm(d_paras - d_approx_paras)/(np.linalg.norm(d_paras) +
np.linalg.norm(d_approx_paras))
if difference > 2e-7:
print(
"\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
else:
print(
"\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
return dweights, dbiases, dw_approx, db_approx
Edit: Made some corrections to some old comments I had in the code, to avoid confusion
Edit 2: Thanks to @sid_508 for helping me find the main problem with my code! I also wanted to mention in this edit that I found out that there was some mistake in the way I had implemented the weight decay. After making the suggested changes and removing the weight decay element entirely for now, the neural network appears to work!
Upvotes: 2
Views: 543
Reputation: 119
I ran your code and this is the output it gave:
The issue is that you use ReLU for the final layer too, so you can't get the best fit, use no activation in the final layer and it should produce way better results.
The final layer activation usually always varies from what you use for the hidden layers and it depends on what type of output you are going for. For continuous outputs use linear activation (basically no activation), and for classification use sigmoid/softmax.
Upvotes: 2