Reputation: 427
I'm using Mxnet to train a XOR neural network, but the losses don't go down, they are always above 0.5.
Below is my code in Mxnet 1.1.0; Python 3.6; OS X El Capitan 10.11.6
I tried 2 loss functions - squared loss and softmax loss, both didn't work.
from mxnet import ndarray as nd
from mxnet import autograd
from mxnet import gluon
import matplotlib.pyplot as plt
X = nd.array([[0,0],[0,1],[1,0],[1,1]])
y = nd.array([0,1,1,0])
batch_size = 1
dataset = gluon.data.ArrayDataset(X, y)
data_iter = gluon.data.DataLoader(dataset, batch_size, shuffle=True)
plt.scatter(X[:, 1].asnumpy(),y.asnumpy())
plt.show()
net = gluon.nn.Sequential()
with net.name_scope():
net.add(gluon.nn.Dense(2, activation="tanh"))
net.add(gluon.nn.Dense(1, activation="tanh"))
net.initialize()
softmax_cross_entropy = gluon.loss.SigmoidBCELoss()#SigmoidBinaryCrossEntropyLoss()
square_loss = gluon.loss.L2Loss()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.3})
train_losses = []
for epoch in range(100):
train_loss = 0
for data, label in data_iter:
with autograd.record():
output = net(data)
loss = square_loss(output, label)
loss.backward()
trainer.step(batch_size)
train_loss += nd.mean(loss).asscalar()
train_losses.append(train_loss)
plt.plot(train_losses)
plt.show()
Upvotes: 3
Views: 338
Reputation: 427
I got this question figured out in somewhere else, so I'm going to post the answer here.
Basically, the issue in my original code was multi-dimensional.
net.initialize()
which actually does
net.initialize(initializer.Uniform(scale=0.07))
Apparently these initial weights were too small, and the network could never jump out of them. So the fix is
net.initialize(mx.init.Uniform(1))
After doing this, the network could converge using sigmoid/tanh
as the activation, and using L2Loss
as the loss function. And it worked with sigmoid
and SigmoidBCELoss
. However, it still didn't work with tanh
and SigmoidBCELoss
, which can be fixed by the second item below.
SigmoidBCELoss
has to be used in these 2 scenarios in the output layer.
2.1. Linear activation and SigmoidBCELoss(from_sigmoid=False)
;
2.2. Non-linear activation and SigmoidBCELoss(from_sigmoid=True)
, in which the output of the non-linear function falls into (0, 1).
In my original code, when I used SigmoidBCELoss
, I was using either all sigmoid
, or all tanh
. So just need to change the activation in the output layer from tanh
to sigmoid
, and the network could converge. I can still have tanh
in the hidden layers.
Hope this helps!
Upvotes: 1