Solving XOR with 3 data points using Multi-Layered Perceptron

Question

The XOR problem is known to be solved by the multi-layer perceptron given all 4 boolean inputs and outputs, it trains and memorizes the weights needed to reproduce the I/O. E.g.

import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def cost(predicted, truth):
    return truth - predicted

xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T

X = xor_input
Y = xor_output

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))

num_epochs = 10000
learning_rate = 1.0

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    layer2_error = cost(layer2, Y)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_delta = layer2_error * sigmoid_derivative(layer2)


    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 +=  learning_rate * np.dot(layer1.T, layer2_delta)
    W1 +=  learning_rate * np.dot(layer0.T, layer1_delta)

We see that we've fully trained the network to memorize the outputs for XOR:

# On the training data
[int(prediction > 0.5) for prediction in layer2]

[out]:

[0, 1, 1, 0]

If we re-feed the same inputs, we get the same output:

for x, y in zip(X, Y):
    layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
    prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
    print(int(prediction > 0.5), y)

[out]:

0 [0]
1 [1]
1 [1]
0 [0]

But if we retrain the parameters (W1 and W2) without one of the data points, i.e.

xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
xor_output = np.array([[0,1,1,0]]).T

Let's drop the last row of data and use that as unseen test.

X = xor_input[:-1]
Y = xor_output[:-1]

And with the rest of the same code, regardless of how I change the hyperparameters, it's un-able to learn the XOR function and reproduce the I/O.

for x, y in zip(xor_input, xor_output):
    layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
    prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
    print(int(prediction > 0.5), y)

[out]:

0 [0]
1 [1]
1 [1]
1 [0]

Even if we shuffle the in-/output:

# Shuffle the order of the inputs
_temp = list(zip(X, Y))
random.shuffle(_temp)
xor_input_shuff, xor_output_shuff = map(np.array, zip(*_temp))

We can't train the XOR function fully:'

for x, y in zip(xor_input, xor_output):
    layer1_prediction = sigmoid(np.dot(W1.T, x)) # Feed the unseen input into trained W.
    prediction = layer2_prediction = sigmoid(np.dot(W2.T, layer1_prediction)) # Feed the unseen input into trained W.
    print(x, int(prediction > 0.5), y)

[out]:

[0 0] 1 [0]
[0 1] 1 [1]
[1 0] 1 [1]
[1 1] 0 [0]

So when the literature states that the multi-layered perceptron (Aka the basic deep learning) solves XOR, does it mean that it can fully learn and memorize the weights given the fully set of in-/outputs but cannot generalize the XOR problem given that one of data point is missing?

Here's the link of the Kaggle dataset that answerers can test the network for themselves: https://www.kaggle.com/alvations/xor-with-mlp/

Solving XOR with 3 data points using Multi-Layered Perceptron

Let's drop the last row of data and use that as unseen test.

Even if we shuffle the in-/output:

Answers (1)

Related Questions