tf.keras predictions are bad while evaluation is good

Question

I'm programming a model in tf.keras, and running model.evaluate() on the training set usually yields ~96% accuracy. My evaluation on the test set is usually close, about 93%. However, when I predict manually, the model is usually inaccurate. This is my code:

import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import pandas as pd

!git clone https://github.com/DanorRon/data
%cd data
!ls

batch_size = 100
epochs = 15
alpha = 0.001
lambda_ = 0.001
h1 = 50

train = pd.read_csv('/content/data/mnist_train.csv.zip')
test = pd.read_csv('/content/data/mnist_test.csv.zip')

train = train.loc['1':'5000', :]
test = test.loc['1':'2000', :]

train = train.sample(frac=1).reset_index(drop=True)
test = test.sample(frac=1).reset_index(drop=True)

x_train = train.loc[:, '1x1':'28x28']
y_train = train.loc[:, 'label']

x_test = test.loc[:, '1x1':'28x28']
y_test = test.loc[:, 'label']

x_train = x_train.values
y_train = y_train.values

x_test = x_test.values
y_test = y_test.values

nb_classes = 10
targets = y_train.reshape(-1)
y_train_onehot = np.eye(nb_classes)[targets]

nb_classes = 10
targets = y_test.reshape(-1)
y_test_onehot = np.eye(nb_classes)[targets]

model = tf.keras.Sequential()
model.add(layers.Dense(784, input_shape=(784,), kernel_initializer='random_uniform', bias_initializer='zeros'))
model.add(layers.Dense(h1, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(lambda_), kernel_initializer='random_uniform', bias_initializer='zeros'))
model.add(layers.Dense(10, activation='softmax', kernel_regularizer=tf.keras.regularizers.l2(lambda_), kernel_initializer='random_uniform', bias_initializer='zeros'))

model.compile(optimizer='SGD',
             loss = 'mse',
             metrics = ['categorical_accuracy'])

model.fit(x_train, y_train_onehot, epochs=epochs, batch_size=batch_size)

model.evaluate(x_test, y_test_onehot, batch_size=batch_size)

prediction = model.predict_classes(x_test)
print(prediction)

print(y_test[1:])

I've heard that a lot of the time when people have this problem, it's just a problem with data input. But I can't see any problem with that here since it almost always predicts wrongly (about as much as you would expect if it was random). How do I fix this problem?

Edit: Here are the specific results:

Last training step:

Epoch 15/15
49999/49999 [==============================] - 3s 70us/sample - loss: 0.0309 - categorical_accuracy: 0.9615

Evaluation output:

2000/2000 [==============================] - 0s 54us/sample - loss: 0.0352 - categorical_accuracy: 0.9310
[0.03524150168523192, 0.931]

Output from model.predict_classes:

[9 9 0 ... 5 0 5]

Output from print(y_test):

[9 0 0 7 6 8 5 1 3 2 4 1 4 5 8 4 9 2 4]

desertnaut · Accepted Answer

First thing is, your loss function is wrong: you are in a multi-class classification setting, and you are using a loss function suitable for regression and not classification (MSE).

Change our model compilation to:

model.compile(loss='categorical_crossentropy',
              optimizer='SGD',
              metrics=['accuracy'])

See the Keras MNIST MLP example for corroboration, and own answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)? for more details (although here you actually have the inverse problem, i.e. regression loss in a classification setting).

Moreover, it is not clear if the MNIST variant you are using is already normalized; if not, you should normalize them yourself:

x_train = x_train.values/255
x_test = x_test.values/255

It is also not clear why you ask for a 784-unit layer, since this is actually the second layer of your NN (the first is implicitly set by the input_shape argument - see Keras Sequential model input layer), and it certainly does not need to contain one unit for each one of your 784 input features.

UPDATE (after comments):

But why is MSE meaningless for classification?

This is a theoretical issue, not exactly appropriate for SO; roughly speaking, it is for the same reason we don't use linear regression for classification - we use logistic regression, the actual difference between the two approaches being exactly the loss function. Andrew Ng, in his popular Machine Learning course at Coursera, explains this nicely - see his Lecture 6.1 - Logistic Regression | Classification at Youtube (explanation starts at ~ 3:00), as well as section 4.2 Why Not Linear Regression [for classification]? of the (highly recommended and freely available) textbook An Introduction to Statistical Learning by Hastie, Tibshirani and coworkers.

And MSE does give a high accuracy, so why doesn't that matter?

Nowadays, almost anything you throw at MNIST will "work", which of course neither makes it correct nor a good approach for more demanding datasets...

UPDATE 2:

whenever I run with crossentropy, the accuracy just flutters around at ~10%

Sorry, cannot reproduce the behavior... Taking the Keras MNIST MLP example with a simplified version of your model, i.e.:

model = Sequential()
model.add(Dense(784, activation='linear', input_shape=(784,)))
model.add(Dense(50, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer=SGD(),
              metrics=['accuracy'])

we easily end up with a ~ 92% validation accuracy after only 5 epochs:

history = model.fit(x_train, y_train,
                    batch_size=128,
                    epochs=5,
                    verbose=1,
                    validation_data=(x_test, y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
60000/60000 [==============================] - 4s - loss: 0.8974 - acc: 0.7801 - val_loss: 0.4650 - val_acc: 0.8823
Epoch 2/10
60000/60000 [==============================] - 4s - loss: 0.4236 - acc: 0.8868 - val_loss: 0.3582 - val_acc: 0.9034
Epoch 3/10
60000/60000 [==============================] - 4s - loss: 0.3572 - acc: 0.9009 - val_loss: 0.3228 - val_acc: 0.9099
Epoch 4/10
60000/60000 [==============================] - 4s - loss: 0.3263 - acc: 0.9082 - val_loss: 0.3024 - val_acc: 0.9156
Epoch 5/10
60000/60000 [==============================] - 4s - loss: 0.3061 - acc: 0.9132 - val_loss: 0.2845 - val_acc: 0.9196

Notice the activation='linear' of the first Dense layer, which is the equivalent of not specifying anything, like in your case (as I said, practically everything you throw to MNIST will "work")...

Final advice: Try modifying your model as:

model = tf.keras.Sequential()
model.add(layers.Dense(784, activation = 'relu',input_shape=(784,)))
model.add(layers.Dense(h1, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

in order to use the better (and default) 'glorot_uniform' initializer, and remove the kernel_regularizer args (they may be the cause of any issue - always start simple!)...

tf.keras predictions are bad while evaluation is good

Answers (1)

Related Questions