alvas
alvas

Reputation: 122260

Unable to Learn XOR Representation using 2 layers of Multi-Layered Perceptron (MLP)

Using PyTorch nn.Sequential model, I'm unable to learn all four representation of the XOR booleans:

import numpy as np

import torch
from torch import nn
from torch.autograd import Variable
from torch import FloatTensor
from torch import optim

use_cuda = torch.cuda.is_available()

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Converting the X to PyTorch-able data structure.
X_pt = Variable(FloatTensor(X))
X_pt = X_pt.cuda() if use_cuda else X_pt
# Converting the Y to PyTorch-able data structure.
Y_pt = Variable(FloatTensor(Y), requires_grad=False)
Y_pt = Y_pt.cuda() if use_cuda else Y_pt

hidden_dim = 5

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())
criterion = nn.L1Loss()
learning_rate = 0.03
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
num_epochs = 10000

for _ in range(num_epochs):
    predictions = model(X_pt)
    loss_this_epoch = criterion(predictions, Y_pt)
    loss_this_epoch.backward()
    optimizer.step()
    print([int(_pred > 0.5) for _pred in predictions], list(map(int, Y_pt)), loss_this_epoch.data[0])

After learning:

for _x, _y in zip(X_pt, Y_pt):
    prediction = model(_x)
    print('Input:\t', list(map(int, _x)))
    print('Pred:\t', int(prediction))
    print('Ouput:\t', int(_y))
    print('######')

[out]:

Input:   [0, 0]
Pred:    0
Ouput:   0
######
Input:   [0, 1]
Pred:    1
Ouput:   1
######
Input:   [1, 0]
Pred:    0
Ouput:   1
######
Input:   [1, 1]
Pred:    0
Ouput:   0
######

I've tried running the same code over a couple of random seeds but it didn't manage to learn all for XOR representation.

Without PyTorch, I could easily train a model with self-defined derivative functions and manually perform the backpropagation, see https://www.kaggle.io/svf/2342536/635025ecf1de59b71ea4fa03eb84f9f9/results.html#After-some-enlightenment

Why is it that the 2-layered MLP using PyTorch didn't learn the XOR representation?


How is the model in PyTorch:

hidden_dim = 5

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

different from the one that is hand-written with the derivatives and the manually written backpropagation and optimizer step from https://www.kaggle.com/alvations/xor-with-mlp ?

Are the same the one hidden layered perceptron network?


Updated

Strangely, adding a nn.Sigmoid() between the nn.Linear layers didn't work:

hidden_dim = 5

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Sigmoid(),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())
criterion = nn.L1Loss()
learning_rate = 0.03
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
num_epochs = 10000

for _ in range(num_epochs):
    predictions = model(X_pt)
    loss_this_epoch = criterion(predictions, Y_pt)
    loss_this_epoch.backward()
    optimizer.step()

for _x, _y in zip(X_pt, Y_pt):
    prediction = model(_x)
    print('Input:\t', list(map(int, _x)))
    print('Pred:\t', int(prediction))
    print('Ouput:\t', int(_y))
    print('######')

[out]:

Input:   [0, 0]
Pred:    0
Ouput:   0
######
Input:   [0, 1]
Pred:    1
Ouput:   1
######
Input:   [1, 0]
Pred:    1
Ouput:   1
######
Input:   [1, 1]
Pred:    1
Ouput:   0
######

But adding nn.ReLU() did:

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.ReLU(), 
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

...
for _x, _y in zip(X_pt, Y_pt):
prediction = model(_x)
print('Input:\t', list(map(int, _x)))
print('Pred:\t', int(prediction))
print('Ouput:\t', int(_y))
print('######')

[out]:

Input:   [0, 0]
Pred:    0
Ouput:   0
######
Input:   [0, 1]
Pred:    1
Ouput:   1
######
Input:   [1, 0]
Pred:    1
Ouput:   1
######
Input:   [1, 1]
Pred:    1
Ouput:   0
######

Isn't a sigmoid enough for the non-linear activation?

I understand that the ReLU fits the task of boolean output but shouldn't the Sigmoid function produce the same/similar effect?


UPDATED 2

Running the same training 100 times:

from collections import Counter 
import random
random.seed(100)

import torch
from torch import nn
from torch.autograd import Variable
from torch import FloatTensor
from torch import optim
use_cuda = torch.cuda.is_available()


all_results=[]

for _ in range(100):
    hidden_dim = 2

    model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                          nn.ReLU(), # Does the sigmoid has a build in biased? 
                          nn.Linear(hidden_dim, output_dim),
                          nn.Sigmoid())

    criterion = nn.MSELoss()
    learning_rate = 0.03
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    num_epochs = 3000

    for _ in range(num_epochs):
        predictions = model(X_pt)
        loss_this_epoch = criterion(predictions, Y_pt)
        loss_this_epoch.backward()
        optimizer.step()
        ##print([float(_pred) for _pred in predictions], list(map(int, Y_pt)), loss_this_epoch.data[0])

    x_pred = [int(model(_x)) for _x in X_pt]
    y_truth = list([int(_y[0]) for _y in Y_pt])
    all_results.append([x_pred == y_truth, x_pred, loss_this_epoch.data[0]])


tf, outputsss, losses__ = zip(*all_results)
print(Counter(tf))

It only managed to learn the XOR representation 18 out of 100 times... -_-|||

Upvotes: 4

Views: 2214

Answers (4)

osipov
osipov

Reputation: 890

You are almost there with your 2nd update. Here's a notebook with a working solution: https://colab.research.google.com/github/osipov/edu/blob/master/misc/xor.ipynb

Your mistake is to use sigmoid after the last linear layer which makes it difficult for the optimizer to converge to the 0 and 1 values expected in your training dataset. Recall that sigmoid approaches 0 and 1 at negative and positive infinities respectively.

So, your implementation (assuming PyTorch 1.7) should be

import torch as pt
from torch.nn.functional import mse_loss
pt.manual_seed(33);

model = pt.nn.Sequential(
    pt.nn.Linear(2, 5),
    pt.nn.ReLU(),
    pt.nn.Linear(5, 1)
)

X = pt.tensor([[0, 0],
               [0, 1],
               [1, 0],
               [1, 1]], dtype=pt.float32)

y = pt.tensor([0, 1, 1, 0], dtype=pt.float32).reshape(X.shape[0], 1)

EPOCHS = 100

optimizer = pt.optim.Adam(model.parameters(), lr = 0.03)

for epoch in range(EPOCHS):
  #forward
  y_est = model(X)
  
  #compute mean squared error loss
  loss = mse_loss(y_est, y)

  #backprop the loss gradients
  loss.backward()

  #update the model weights using the gradients
  optimizer.step()

  #empty the gradients for the next iteration
  optimizer.zero_grad()

which after execution trains the model, so that

model(X).round().abs()

returns

tensor([[0.],
        [1.],
        [1.],
        [0.]], grad_fn=<AbsBackward>)

which is the correct output.

Upvotes: 2

xtof54
xtof54

Reputation: 1383

With the sigmoid between layers and at the end, the most important thing to consider is to update the weights in a purely stochastic way, i.e., update after every single sample, and pick at every iteration a sample randomly.

When respecting this, and when using a large learning rate (around 1.0), I've observed that the model is usually learning fine the XOR with a standard 2 layers pytorch implementation (2-2-1 layers size), with standard weights initialization, without regularization.

Upvotes: -1

zbdx
zbdx

Reputation: 1

Here are a few simple changes to your code that should help put you on a better path. I've used ReLU activation functions internally, but sigmoid will also work if used correctly. Also, if you want to try using the SGD optimizer you may want to turn down the learning rate by an order of magnitude or so.

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),    
                      nn.ReLU(),       
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())
if use_cuda:
  model.cuda()

criterion = nn.BCELoss()
#criterion = nn.L1Loss()
#learning_rate = 0.03
#optimizer = optim.SGD(model.parameters(), lr=learning_rate)
optimizer = optim.Adam(model.parameters())
num_epochs = 10000


for epoch in range(num_epochs):
    predictions = model(X_pt)
    loss_this_epoch = criterion(predictions, Y_pt)
    model.zero_grad()
    loss_this_epoch.backward()
    optimizer.step()
    if epoch%1000 == 0: 
      print([float(_pred) for _pred in predictions], list(map(int, Y_pt)), loss_this_epoch.data[0])

Upvotes: -1

wesolyromek
wesolyromek

Reputation: 650

It's because nn.Linear has no activation built in, so your model is effectively a linear classifier, and XOR is the canonical example of a problem that can't be solved using linear classifiers.

Change this:

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

to that:

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Sigmoid(),
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

and only then will your model be equivalent to the one from the linked Kaggle notebook.

Upvotes: 5

Related Questions