Reputation: 122260
Using PyTorch nn.Sequential
model, I'm unable to learn all four representation of the XOR booleans:
import numpy as np
import torch
from torch import nn
from torch.autograd import Variable
from torch import FloatTensor
from torch import optim
use_cuda = torch.cuda.is_available()
X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T
# Converting the X to PyTorch-able data structure.
X_pt = Variable(FloatTensor(X))
X_pt = X_pt.cuda() if use_cuda else X_pt
# Converting the Y to PyTorch-able data structure.
Y_pt = Variable(FloatTensor(Y), requires_grad=False)
Y_pt = Y_pt.cuda() if use_cuda else Y_pt
hidden_dim = 5
model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
nn.Linear(hidden_dim, output_dim),
nn.Sigmoid())
criterion = nn.L1Loss()
learning_rate = 0.03
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
num_epochs = 10000
for _ in range(num_epochs):
predictions = model(X_pt)
loss_this_epoch = criterion(predictions, Y_pt)
loss_this_epoch.backward()
optimizer.step()
print([int(_pred > 0.5) for _pred in predictions], list(map(int, Y_pt)), loss_this_epoch.data[0])
After learning:
for _x, _y in zip(X_pt, Y_pt):
prediction = model(_x)
print('Input:\t', list(map(int, _x)))
print('Pred:\t', int(prediction))
print('Ouput:\t', int(_y))
print('######')
[out]:
Input: [0, 0]
Pred: 0
Ouput: 0
######
Input: [0, 1]
Pred: 1
Ouput: 1
######
Input: [1, 0]
Pred: 0
Ouput: 1
######
Input: [1, 1]
Pred: 0
Ouput: 0
######
I've tried running the same code over a couple of random seeds but it didn't manage to learn all for XOR representation.
Without PyTorch, I could easily train a model with self-defined derivative functions and manually perform the backpropagation, see https://www.kaggle.io/svf/2342536/635025ecf1de59b71ea4fa03eb84f9f9/results.html#After-some-enlightenment
Why is it that the 2-layered MLP using PyTorch didn't learn the XOR representation?
How is the model in PyTorch:
hidden_dim = 5
model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
nn.Linear(hidden_dim, output_dim),
nn.Sigmoid())
different from the one that is hand-written with the derivatives and the manually written backpropagation and optimizer step from https://www.kaggle.com/alvations/xor-with-mlp ?
Are the same the one hidden layered perceptron network?
Strangely, adding a nn.Sigmoid()
between the nn.Linear
layers didn't work:
hidden_dim = 5
model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
nn.Sigmoid(),
nn.Linear(hidden_dim, output_dim),
nn.Sigmoid())
criterion = nn.L1Loss()
learning_rate = 0.03
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
num_epochs = 10000
for _ in range(num_epochs):
predictions = model(X_pt)
loss_this_epoch = criterion(predictions, Y_pt)
loss_this_epoch.backward()
optimizer.step()
for _x, _y in zip(X_pt, Y_pt):
prediction = model(_x)
print('Input:\t', list(map(int, _x)))
print('Pred:\t', int(prediction))
print('Ouput:\t', int(_y))
print('######')
[out]:
Input: [0, 0]
Pred: 0
Ouput: 0
######
Input: [0, 1]
Pred: 1
Ouput: 1
######
Input: [1, 0]
Pred: 1
Ouput: 1
######
Input: [1, 1]
Pred: 1
Ouput: 0
######
But adding nn.ReLU()
did:
model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim),
nn.Sigmoid())
...
for _x, _y in zip(X_pt, Y_pt):
prediction = model(_x)
print('Input:\t', list(map(int, _x)))
print('Pred:\t', int(prediction))
print('Ouput:\t', int(_y))
print('######')
[out]:
Input: [0, 0]
Pred: 0
Ouput: 0
######
Input: [0, 1]
Pred: 1
Ouput: 1
######
Input: [1, 0]
Pred: 1
Ouput: 1
######
Input: [1, 1]
Pred: 1
Ouput: 0
######
Isn't a sigmoid enough for the non-linear activation?
I understand that the ReLU
fits the task of boolean output but shouldn't the Sigmoid
function produce the same/similar effect?
Running the same training 100 times:
from collections import Counter
import random
random.seed(100)
import torch
from torch import nn
from torch.autograd import Variable
from torch import FloatTensor
from torch import optim
use_cuda = torch.cuda.is_available()
all_results=[]
for _ in range(100):
hidden_dim = 2
model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
nn.ReLU(), # Does the sigmoid has a build in biased?
nn.Linear(hidden_dim, output_dim),
nn.Sigmoid())
criterion = nn.MSELoss()
learning_rate = 0.03
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
num_epochs = 3000
for _ in range(num_epochs):
predictions = model(X_pt)
loss_this_epoch = criterion(predictions, Y_pt)
loss_this_epoch.backward()
optimizer.step()
##print([float(_pred) for _pred in predictions], list(map(int, Y_pt)), loss_this_epoch.data[0])
x_pred = [int(model(_x)) for _x in X_pt]
y_truth = list([int(_y[0]) for _y in Y_pt])
all_results.append([x_pred == y_truth, x_pred, loss_this_epoch.data[0]])
tf, outputsss, losses__ = zip(*all_results)
print(Counter(tf))
It only managed to learn the XOR representation 18 out of 100 times... -_-|||
Upvotes: 4
Views: 2214
Reputation: 890
You are almost there with your 2nd update. Here's a notebook with a working solution: https://colab.research.google.com/github/osipov/edu/blob/master/misc/xor.ipynb
Your mistake is to use sigmoid after the last linear layer which makes it difficult for the optimizer to converge to the 0 and 1 values expected in your training dataset. Recall that sigmoid approaches 0 and 1 at negative and positive infinities respectively.
So, your implementation (assuming PyTorch 1.7) should be
import torch as pt
from torch.nn.functional import mse_loss
pt.manual_seed(33);
model = pt.nn.Sequential(
pt.nn.Linear(2, 5),
pt.nn.ReLU(),
pt.nn.Linear(5, 1)
)
X = pt.tensor([[0, 0],
[0, 1],
[1, 0],
[1, 1]], dtype=pt.float32)
y = pt.tensor([0, 1, 1, 0], dtype=pt.float32).reshape(X.shape[0], 1)
EPOCHS = 100
optimizer = pt.optim.Adam(model.parameters(), lr = 0.03)
for epoch in range(EPOCHS):
#forward
y_est = model(X)
#compute mean squared error loss
loss = mse_loss(y_est, y)
#backprop the loss gradients
loss.backward()
#update the model weights using the gradients
optimizer.step()
#empty the gradients for the next iteration
optimizer.zero_grad()
which after execution trains the model
, so that
model(X).round().abs()
returns
tensor([[0.],
[1.],
[1.],
[0.]], grad_fn=<AbsBackward>)
which is the correct output.
Upvotes: 2
Reputation: 1383
With the sigmoid between layers and at the end, the most important thing to consider is to update the weights in a purely stochastic way, i.e., update after every single sample, and pick at every iteration a sample randomly.
When respecting this, and when using a large learning rate (around 1.0), I've observed that the model is usually learning fine the XOR with a standard 2 layers pytorch implementation (2-2-1 layers size), with standard weights initialization, without regularization.
Upvotes: -1
Reputation: 1
Here are a few simple changes to your code that should help put you on a better path. I've used ReLU activation functions internally, but sigmoid will also work if used correctly. Also, if you want to try using the SGD optimizer you may want to turn down the learning rate by an order of magnitude or so.
model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim),
nn.Sigmoid())
if use_cuda:
model.cuda()
criterion = nn.BCELoss()
#criterion = nn.L1Loss()
#learning_rate = 0.03
#optimizer = optim.SGD(model.parameters(), lr=learning_rate)
optimizer = optim.Adam(model.parameters())
num_epochs = 10000
for epoch in range(num_epochs):
predictions = model(X_pt)
loss_this_epoch = criterion(predictions, Y_pt)
model.zero_grad()
loss_this_epoch.backward()
optimizer.step()
if epoch%1000 == 0:
print([float(_pred) for _pred in predictions], list(map(int, Y_pt)), loss_this_epoch.data[0])
Upvotes: -1
Reputation: 650
It's because nn.Linear
has no activation built in, so your model is effectively a linear classifier, and XOR is the canonical example of a problem that can't be solved using linear classifiers.
Change this:
model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
nn.Linear(hidden_dim, output_dim),
nn.Sigmoid())
to that:
model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
nn.Sigmoid(),
nn.Linear(hidden_dim, output_dim),
nn.Sigmoid())
and only then will your model be equivalent to the one from the linked Kaggle notebook.
Upvotes: 5