blue-sky
blue-sky

Reputation: 53916

Keras giving unexpected output for simple binary classification

Here is a simple keras neural network that attempts to map 1->1 and 2->0 (binary classification)

X = [[1] , [2]]
Y = [[1] , [0]]

from keras.callbacks import History 
history = History()

from keras import optimizers

inputDim = len(X[0])
print('input dim' , inputDim)
model = Sequential()

model.add(Dense(1, activation='sigmoid', input_dim=inputDim))
model.add(Dense(1, activation='sigmoid'))

sgd = optimizers.SGD(lr=0.009, decay=1e-10, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd , metrics=['accuracy'])
model.fit(X,Y , validation_split=0.1 , verbose=2 , callbacks=[history] , epochs=20,batch_size=32)

Using SGD optimizer :

optimizers.SGD(lr=0.009, decay=1e-10, momentum=0.9, nesterov=True)

Output for epoch 20 :

Epoch 20/20
0s - loss: 0.5973 - acc: 1.0000 - val_loss: 0.4559 - val_acc: 0.0000e+00

If I use the adam optomizer :

sgd = optimizers.adam(lr=0.009, decay=1e-10)

Output for epoch 20 :

Epoch 20/20
0s - loss: 1.2140 - acc: 0.0000e+00 - val_loss: 0.2930 - val_acc: 1.0000

Switching between adam and sgd optimizers appears to reverse values for acc and val_acc . val_acc = 1 using adam but as acc is 0 , how can validation accuracy be at maximum and training accuracy be at minimum ?

Upvotes: 1

Views: 350

Answers (1)

Marcin Możejko
Marcin Możejko

Reputation: 40516

Using sigmoid after sigmoid is a really bad idea. E.g. in this paper it's written why sigmoid suffers from a so-called saturation problem. Moreover - when you use sigmoid after sigmoid you push the overall saturation of your network to by sky-rocketing in fact. To understand why - notice that the output from a first layer is always from an interval (0, 1). As binary_crossentropy tries to make this output (transformed as linear transformation) as close to +/- inf as possible this makes your network to have extremely high weights. This is probably causing your total instability.

In order to solve your problem, I would simply leave only one layer with sigmoid as your problem has a linear separation property.

UPDATE: As @Daniel mentioned - when you split your dataset containing two examples you end-up having one example in a dataset and other in a validation set. This is causing this weird behavior.

Upvotes: 1

Related Questions