Reputation: 3392
I need some help in order to understand how accuracy is calculated when fitting a model in Keras. This is a sample history of training the model:
Train on 340 samples, validate on 60 samples
Epoch 1/100
340/340 [==============================] - 5s 13ms/step - loss: 0.8081 - acc: 0.7559 - val_loss: 0.1393 - val_acc: 1.0000
Epoch 2/100
340/340 [==============================] - 3s 9ms/step - loss: 0.7815 - acc: 0.7647 - val_loss: 0.1367 - val_acc: 1.0000
Epoch 3/100
340/340 [==============================] - 3s 10ms/step - loss: 0.8042 - acc: 0.7706 - val_loss: 0.1370 - val_acc: 1.0000
...
Epoch 25/100
340/340 [==============================] - 3s 9ms/step - loss: 0.6006 - acc: 0.8029 - val_loss: 0.2418 - val_acc: 0.9333
Epoch 26/100
340/340 [==============================] - 3s 9ms/step - loss: 0.5799 - acc: 0.8235 - val_loss: 0.3004 - val_acc: 0.8833
So, validation accuracy is 1 in the first epochs? How can the validation accuracy be better than the training accuracy?
This are figures that show all values of accuracy and loss:
Then I use sklearn metrics to evaluate final results:
def evaluate(predicted_outcome, expected_outcome):
f1_score = metrics.f1_score(expected_outcome, predicted_outcome, average='weighted')
balanced_accuracy_score = metrics.balanced_accuracy_score(expected_outcome, predicted_outcome)
print('****************************')
print('| MODEL PERFORMANCE REPORT |')
print('****************************')
print('Average F1 score = {:0.2f}.'.format(f1_score))
print('Balanced accuracy score = {:0.2f}.'.format(balanced_accuracy_score))
print('Confusion matrix')
print(metrics.confusion_matrix(expected_outcome, predicted_outcome))
print('Other metrics')
print(metrics.classification_report(expected_outcome, predicted_outcome))
I get this output (as you can see, the results are terrible):
****************************
| MODEL PERFORMANCE REPORT |
****************************
Average F1 score = 0.25.
Balanced accuracy score = 0.32.
Confusion matrix
[[ 7 24 2 40]
[ 11 70 4 269]
[ 0 0 0 48]
[ 0 0 0 6]]
Other metrics
precision recall f1-score support
0 0.39 0.10 0.15 73
1 0.74 0.20 0.31 354
2 0.00 0.00 0.00 48
3 0.02 1.00 0.03 6
micro avg 0.17 0.17 0.17 481
macro avg 0.29 0.32 0.12 481
weighted avg 0.61 0.17 0.25 481
Why the accuracy and loss values of Keras fit functions are so different from the values of sklearn metrics?
This is my model, in case it helps:
model = Sequential()
model.add(LSTM(
units=100, # the number of hidden states
return_sequences=True,
input_shape=(timestamps,nb_features),
dropout=0.2,
recurrent_dropout=0.2
)
)
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(units=nb_classes,
activation='softmax'))
model.compile(loss="categorical_crossentropy",
metrics = ['accuracy'],
optimizer='adadelta')
Input data dimensions:
400 train sequences
481 test sequences
X_train shape: (400, 20, 17)
X_test shape: (481, 20, 17)
y_train shape: (400, 4)
y_test shape: (481, 4)
This is how I apply sklearn metrics:
testPredict = model.predict(np.array(X_test))
y_test = np.argmax(y_test.values, axis=1)
y_pred = np.argmax(testPredict, axis=1)
evaluate(y_pred, y_test)
It looks that I miss something.
Upvotes: 4
Views: 2190
Reputation: 60321
You sound a little confused.
To start with, you are comparing apples to oranges, i.e. the validation accuracy reported by Keras on a 60-sample set (notice the first informative message printed by Keras, Train on 340 samples, validate on 60 samples
) with the test accuracy reported by scikit-learn on your 481-sample test set.
Second, your validation set of only 60 samples is way too small; in such small samples, wild fluctuations of the calculated metrics such as the ones you report are certainly not unexpected (there is a reason why we need datasets of sufficient size, and not only training ones).
Third, your training/validation/test set division is quite unusual, to say the least; standard practice asks for allocations of roughly 70/15/15 per cent or similar, while you are using an allocation of 38/7/55 per cent (i.e. 340/60/481 samples)...
Lastly, and without knowing the details of your data, it may very well be the case that only 340 samples are not enough to fit an LSTM model such as yours for a good 4-class classification task.
For starters, start with a more appropriate allocation of your data into training/validation/test sets and be sure you compare apples to apples...
PS In similar questions, you should also include your model.fit()
part.
Upvotes: 3