My neural network is approximating X^2 with a straight line

Question

I am currently in the process of trying to implement my own neural network from scratch to test my understanding of the method. I thought things were going well, as my network managed to approximate AND and XOR functions without an issue, but it turns out it is having a problem with learning to approximate a simple square function.

I have attempted to use a variety of different network configurations, with anywhere from 1 to 3 layers, and 1-64 nodes. I have varied the learning rate from 0.1 to 0.00000001, and implemented weight decay as I thought some regularisation might provide some insight as to what went wrong. I have also implemented gradient check, which is giving me conflicting answers, as it varies greatly from attempt to attempt, ranging from a dreadful 0.6 difference to a fantastic 1e-10. I am using the leaky ReLU activation function, and MSE as my cost function.

Could somebody help me spot what I am missing? Or is this purely down to optimising the hyper parameters?

My code is as follows:

import matplotlib.pyplot as plt
import numpy as np
import Sub_Script as ss

# Create sample data set using X**2

X = np.expand_dims(np.linspace(0, 1, 201), axis=0)
y = X**2

plt.plot(X.T, y.T)


# Hyper-parameters

layer_dims = [1, 64, 1]
learning_rate = 0.000001
iterations = 50000
decay = 0.00000001
num_ex = y.shape[1]


# Initializations

num_layers = len(layer_dims)
weights = [None] + [np.random.randn(layer_dims[l], layer_dims[l-1])*np.sqrt(2/layer_dims[l-1])for l in range(1, num_layers)]
biases = [None] + [np.zeros((layer_dims[l], 1)) for l in range(1, num_layers)]

dweights, dbiases, dw_approx, db_approx = ss.grad_check(weights, biases, num_layers, X, y, decay, num_ex)

# Main function: Iteration loop

for iter in range(iterations):
# Main function: Forward Propagation
z_values, acts = ss.forward_propagation(weights, biases, num_layers, X)
dweights, dbiases = ss.backward_propagation(weights, biases, num_layers, z_values, acts, y)
weights, biases = ss.update_paras(weights, biases, dweights, dbiases, learning_rate, decay, num_ex)

if iter % (1000+1) == 0:
    print('Cost: ', ss.mse(acts[-1], y, weights, decay, num_ex))


# Gradient Checking

dweights, dbiases, dw_approx, db_approx = ss.grad_check(weights, biases, num_layers, X, y, decay, num_ex)


# Visualization

plt.plot(X.T, acts[-1].T)

With Sub_Script.py containing the neural network functions:

import numpy as np
import copy as cp

# Construct sub functions, forward, backward propagation and cost and activation functions
# Leaky ReLU Activation Function

def relu(x):
    return (x > 0) * x + (x < 0) * 0.01*x


# Leaky ReLU activation Function Gradient

def relu_grad(x):
    return (x > 0) + (x < 0) * 0.01


# MSE Cost Function

def mse(prediction, actual, weights, decay, num_ex):
    return np.sum((actual - prediction) ** 2)/(actual.shape[1]) + (decay/(2*num_ex))*np.sum([np.sum(w) for w in weights[1:]])


# MSE Cost Function Gradient

 def mse_grad(prediction, actual):
    return -2 * (actual - prediction)/(actual.shape[1])


# Forward Propagation

def forward_propagation(weights, biases, num_layers, act):
    acts = [[None] for i in range(num_layers)]
    z_values = [[None] for i in range(num_layers)]
    acts[0] = act

    for layer in range(1, num_layers):
        z_values[layer] = np.dot(weights[layer], acts[layer-1]) + biases[layer]
        acts[layer] = relu(z_values[layer])
    return z_values, acts


# Backward Propagation

def backward_propagation(weights, biases, num_layers, z_values, acts, y):
    dweights = [[None] for i in range(num_layers)]
    dbiases = [[None] for i in range(num_layers)]
    zgrad = mse_grad(acts[-1], y) * relu_grad(z_values[-1])
    dweights[-1] = np.dot(zgrad, acts[-2].T)
    dbiases[-1] = np.sum(zgrad, axis=1, keepdims=True)

    for layer in range(num_layers-2, 0, -1):
        zgrad = np.dot(weights[layer+1].T, zgrad) * relu_grad(z_values[layer])
        dweights[layer] = np.dot(zgrad, acts[layer-1].T)
        dbiases[layer] = np.sum(zgrad, axis=1, keepdims=True)

    return dweights, dbiases


# Update Parameters with Regularization

def update_paras(weights, biases, dweights, dbiases, learning_rate, decay, num_ex):
    weights = [None] + [w - learning_rate*(dw + (decay/num_ex)*w) for w, dw in zip(weights[1:], dweights[1:])]
    biases = [None] + [b - learning_rate*db for b, db in zip(biases[1:], dbiases[1:])]

    return weights, biases


# Gradient Checking

def grad_check(weights, biases, num_layers, X, y, decay, num_ex):
    z_values, acts = forward_propagation(weights, biases, num_layers, X)
    dweights, dbiases = backward_propagation(weights, biases, num_layers, z_values, acts, y)
epsilon = 1e-7
    dw_approx = cp.deepcopy(weights)
    db_approx = cp.deepcopy(biases)
    for layer in range(1, num_layers):
        height = weights[layer].shape[0]
        width = weights[layer].shape[1]
        for i in range(height):
            for j in range(width):
                w_plus = cp.deepcopy(weights)
                w_plus[layer][i, j] += epsilon
                w_minus = cp.deepcopy(weights)
                w_minus[layer][i, j] -= epsilon
                _, temp_plus = forward_propagation(w_plus, biases, num_layers, X)
                cost_plus = mse(temp_plus[-1], y, w_plus, decay, num_ex)
                _, temp_minus = forward_propagation(w_minus, biases, num_layers, X)
                cost_minus = mse(temp_minus[-1], y, w_minus, decay, num_ex)
                dw_approx[layer][i, j] = (cost_plus - cost_minus)/(2*epsilon)
            b_plus = cp.deepcopy(biases)
            b_plus[layer][i, 0] += epsilon
            b_minus = cp.deepcopy(biases)
            b_minus[layer][i, 0] -= epsilon
            _, temp_plus = forward_propagation(weights, b_plus, num_layers, X)
            cost_plus = mse(temp_plus[-1], y, weights, decay, num_ex)
            _, temp_minus = forward_propagation(weights, b_minus, num_layers, X)
            cost_minus = mse(temp_minus[-1], y, weights,  decay, num_ex)
            db_approx[layer][i, 0] = (cost_plus - cost_minus)/(2*epsilon)
    dweights_flat = [dw.flatten() for dw in dweights[1:]]
    dweights_flat = np.concatenate(dweights_flat, axis=None)
    dw_approx_flat = [dw.flatten() for dw in dw_approx[1:]]
    dw_approx_flat = np.concatenate(dw_approx_flat, axis=None)
    dbiases_flat = [db.flatten() for db in dbiases[1:]]
    dbiases_flat = np.concatenate(dbiases_flat, axis=None)
    db_approx_flat = [db.flatten() for db in db_approx[1:]]
    db_approx_flat = np.concatenate(db_approx_flat, axis=None)
    d_paras = np.concatenate([dweights_flat, dbiases_flat], axis=None)
    d_approx_paras = np.concatenate([dw_approx_flat, db_approx_flat], axis=None)
    difference = np.linalg.norm(d_paras - d_approx_paras)/(np.linalg.norm(d_paras) + 
np.linalg.norm(d_approx_paras))

    if difference > 2e-7:
        print(
        "\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
else:
    print(
        "\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")

return dweights, dbiases, dw_approx, db_approx

Edit: Made some corrections to some old comments I had in the code, to avoid confusion

Edit 2: Thanks to @sid_508 for helping me find the main problem with my code! I also wanted to mention in this edit that I found out that there was some mistake in the way I had implemented the weight decay. After making the suggested changes and removing the weight decay element entirely for now, the neural network appears to work!

sid_508 · Accepted Answer

I ran your code and this is the output it gave:

The issue is that you use ReLU for the final layer too, so you can't get the best fit, use no activation in the final layer and it should produce way better results.

The final layer activation usually always varies from what you use for the hidden layers and it depends on what type of output you are going for. For continuous outputs use linear activation (basically no activation), and for classification use sigmoid/softmax.

My neural network is approximating X^2 with a straight line

Answers (1)

Related Questions