Kun Hu
Kun Hu

Reputation: 427

XOR neural network, the losses don't go down

I'm using Mxnet to train a XOR neural network, but the losses don't go down, they are always above 0.5.

Below is my code in Mxnet 1.1.0; Python 3.6; OS X El Capitan 10.11.6

I tried 2 loss functions - squared loss and softmax loss, both didn't work.

from mxnet import ndarray as nd
from mxnet import autograd
from mxnet import gluon
import matplotlib.pyplot as plt

X = nd.array([[0,0],[0,1],[1,0],[1,1]])
y = nd.array([0,1,1,0])
batch_size = 1
dataset = gluon.data.ArrayDataset(X, y)
data_iter = gluon.data.DataLoader(dataset, batch_size, shuffle=True)

plt.scatter(X[:, 1].asnumpy(),y.asnumpy())
plt.show()

net = gluon.nn.Sequential()
with net.name_scope():
    net.add(gluon.nn.Dense(2, activation="tanh"))
    net.add(gluon.nn.Dense(1, activation="tanh"))
net.initialize()

softmax_cross_entropy = gluon.loss.SigmoidBCELoss()#SigmoidBinaryCrossEntropyLoss()
square_loss = gluon.loss.L2Loss()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.3})

train_losses = []

for epoch in range(100):
    train_loss = 0
    for data, label in data_iter:
        with autograd.record():
            output = net(data)
            loss = square_loss(output, label)
        loss.backward()
        trainer.step(batch_size)

        train_loss += nd.mean(loss).asscalar()
    train_losses.append(train_loss)

plt.plot(train_losses)
plt.show()

Upvotes: 3

Views: 338

Answers (1)

Kun Hu
Kun Hu

Reputation: 427

I got this question figured out in somewhere else, so I'm going to post the answer here.

Basically, the issue in my original code was multi-dimensional.

  1. Weight initialization. Notice that I used default initialization

net.initialize()

which actually does

net.initialize(initializer.Uniform(scale=0.07))

Apparently these initial weights were too small, and the network could never jump out of them. So the fix is

net.initialize(mx.init.Uniform(1))

After doing this, the network could converge using sigmoid/tanh as the activation, and using L2Loss as the loss function. And it worked with sigmoid and SigmoidBCELoss. However, it still didn't work with tanh and SigmoidBCELoss, which can be fixed by the second item below.

  1. SigmoidBCELoss has to be used in these 2 scenarios in the output layer.

    2.1. Linear activation and SigmoidBCELoss(from_sigmoid=False);

    2.2. Non-linear activation and SigmoidBCELoss(from_sigmoid=True), in which the output of the non-linear function falls into (0, 1).

In my original code, when I used SigmoidBCELoss, I was using either all sigmoid, or all tanh. So just need to change the activation in the output layer from tanh to sigmoid, and the network could converge. I can still have tanh in the hidden layers.

Hope this helps!

Upvotes: 1

Related Questions