Reputation: 1519
I'm trying to replicate some of the examples from Neural Networks and Deep Learning with Keras, but I'm having problems training a network based on the architecture from chapter 1. The aim is to classify written digits from the MNIST dataset. The architecture:
Hyper-parameters:
My code:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from keras.initializers import RandomNormal
# import data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# input image dimensions
img_rows, img_cols = 28, 28
x_train = x_train.reshape(x_train.shape[0], img_rows * img_cols)
x_test = x_test.reshape(x_test.shape[0], img_rows * img_cols)
input_shape = (img_rows * img_cols,)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
num_classes = 10
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print('y_train shape:', y_train.shape)
# Construct model
# 784 * 30 * 10
# Normal distribution for weights/biases
# Stochastic Gradient Descent optimizer
# Mean squared error loss (cost function)
model = Sequential()
layer1 = Dense(30,
input_shape=input_shape,
kernel_initializer=RandomNormal(stddev=1),
bias_initializer=RandomNormal(stddev=1))
model.add(layer1)
layer2 = Dense(10,
kernel_initializer=RandomNormal(stddev=1),
bias_initializer=RandomNormal(stddev=1))
model.add(layer2)
print('Layer 1 input shape: ', layer1.input_shape)
print('Layer 1 output shape: ', layer1.output_shape)
print('Layer 2 input shape: ', layer2.input_shape)
print('Layer 2 output shape: ', layer2.output_shape)
model.summary()
model.compile(optimizer=SGD(lr=3.0),
loss='mean_squared_error',
metrics=['accuracy'])
# Train
model.fit(x_train,
y_train,
batch_size=10,
epochs=30,
verbose=2)
# Run on test data and output results
result = model.evaluate(x_test,
y_test,
verbose=1)
print('Test loss: ', result[0])
print('Test accuracy: ', result[1])
Output (Using Python 3.6 and the TensorFlow backend):
Using TensorFlow backend.
x_train shape: (60000, 784)
60000 train samples
10000 test samples
y_train shape: (60000, 10)
Layer 1 input shape: (None, 784)
Layer 1 output shape: (None, 30)
Layer 2 input shape: (None, 30)
Layer 2 output shape: (None, 10)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 30) 23550
_________________________________________________________________
dense_2 (Dense) (None, 10) 310
=================================================================
Total params: 23,860
Trainable params: 23,860
Non-trainable params: 0
_________________________________________________________________
Epoch 1/30
- 7s - loss: nan - acc: 0.0987
Epoch 2/30
- 7s - loss: nan - acc: 0.0987
(repeated for all 30 epochs)
Epoch 30/30
- 6s - loss: nan - acc: 0.0987
10000/10000 [==============================] - 0s 22us/step
Test loss: nan
Test accuracy: 0.098
As you can see, the network isn't learning at all, and I'm not sure why. The shapes look all right as far as I can tell. What am I doing that's preventing the network from learning?
(Incidentally, I know that cross-entropy loss and a softmax output layer would be better; however, from the linked book, they don't appear to be necessary. The book's manually implemented network in chapter 1 learns successfully; I'm trying to replicate that before moving on.)
Upvotes: 3
Views: 2814
Reputation: 60321
Choosing MSE as a loss function in a classification problem is indeed odd, and I am not sure the introductory nature of the exercise is a good justification, as claimed in the linked book chapter. Nevertheless:
lr
, 3.0, is huge; try something at least 0.1, or even lower.activation='sigmoid'
at all layers (since you explicitly want to avoid softmax
, even in the final layer).stddev=1
value you use in your initializers is again huge; try something in the range of 0.05 (default value). Also, the standard practice is to initialize the biases to zeros.It would be probably better to start with the Keras MNIST MLP example, and adapt it to your learning needs (regarding number of layers, activation functions etc).
Upvotes: 2
Reputation: 6499
You need to specify the activations of each layer. So for each layer. should be something like this:
layer2 = Dense(10,
activation='sigmoid',
kernel_initializer=RandomNormal(stddev=1),
bias_initializer=RandomNormal(stddev=1))
notice I specified the activation parameter here. Also for the last layer, you should use activation="softmax"
since you have multiple categories.
Another thing to consider, is that classification (as opposed to regression) would work best with an entropy loss. So you might want to change the loss value in model.compile
to loss='categorical_crossentropy'
. However, this is not necessary, and you will still get a result using a mean_square_error
loss.
If you still get nan
value for the loss, you can try to change learning rate for SGD
.
I got test accurracy of 0.9425
using the script you show by only changing the activations of the first layer to sigmoid
and second layer to softmax
.
Upvotes: 3