Reputation: 1074
I am struggling with different metrics while evaluating neural networks. My investigations showed that Keras (version 1.2.2) calculates different values for specific metrics (using function evaluate) compared to sklearn.classification report.
Specifically, the values for the metric 'precision' (i.e. 'precision' of Keras != 'precision' of sklearn) or 'recall' (i.e. 'recall' of Keras != 'recall' of sklearn) differ. For the following working example the differences seem to be random, but evaluating bigger networks shows that 'precision' of Keras equals (almost) 'recall' of sklearn whereas both 'recall' metrics differ clearly.
I appreciate your help!
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils # numpy utils for to_categorical()
from keras import backend as K # abstract backend API (in order to generate compatible code for Theano and Tf)
from sklearn.metrics import classification_report
batch_size = 128
nb_classes = 10
nb_epoch = 30
# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
pool_size = (2, 2)
# convolution kernel size
kernel_size = (3, 3)
# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
if K.image_dim_ordering() == 'th':
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255 # range [0,1]
X_test /= 255 # range [0,1]
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes) # necessary for use of categorical_crossentropy
Y_test = np_utils.to_categorical(y_test, nb_classes) # necessary for use of categorical_crossentropy
# create model
model = Sequential()
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],
border_mode='valid',
input_shape=input_shape))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
# configure model
model.compile(loss='categorical_crossentropy',
optimizer='adadelta',
metrics=['accuracy', 'precision', 'recall'])
# train model
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
verbose=1, validation_data=(X_test, Y_test))
# evaluate model with keras
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
print('Test precision:', score[2])
print('Test recall:', score[3])
# evaluate model with sklearn
predictions_last_epoch = model.predict(X_test, batch_size=batch_size, verbose=1)
target_names = ['class 0', 'class 1', 'class 2', 'class 3', 'class 4',
'class 5', 'class 6', 'class 7', 'class 8', 'class 9']
predicted_classes = np.argmax(predictions_last_epoch, axis=1)
print('\n')
print(classification_report(y_test, predicted_classes,
target_names=target_names, digits = 6))
E D I T
The output of the script given above:
Test score: 0.0271549037314
Test accuracy: 0.9916
Test precision: 0.992290322304
Test recall: 0.9908
9728/10000 [============================>.] - ETA: 0s
precision recall f1-score support
class 0 0.987867 0.996939 0.992382 980
class 1 0.993860 0.998238 0.996044 1135
class 2 0.990329 0.992248 0.991288 1032
class 3 0.991115 0.994059 0.992585 1010
class 4 0.994882 0.989817 0.992343 982
class 5 0.991041 0.992152 0.991597 892
class 6 0.993678 0.984342 0.988988 958
class 7 0.992180 0.987354 0.989761 1028
class 8 0.989754 0.991786 0.990769 974
class 9 0.991054 0.988107 0.989578 1009
avg / total 0.991607 0.991600 0.991597 10000
For another model:
val/test loss: 0.231304548573
val/test categorical_accuracy: **0.978500002956**
val/test precision: *0.995103668976*
val/test recall: 0.941900001907
val/test fbeta_score: 0.967675107574
val/test mean_squared_error: 0.0064611148566
10000/10000 [==============================] - 0s
precision recall f1-score support
class 0 0.989605 0.971429 0.980433 980
class 1 0.985153 0.993833 0.989474 1135
class 2 0.988154 0.969961 0.978973 1032
class 3 0.981373 0.991089 0.986207 1010
class 4 0.968907 0.983707 0.976251 982
class 5 0.997633 0.945067 0.970639 892
class 6 0.995690 0.964509 0.979852 958
class 7 0.987230 0.977626 0.982405 1028
class 8 0.945205 0.991786 0.967936 974
class 9 0.951429 0.990089 0.970374 1009
avg / total *0.978964* **0.978500** 0.978522 10000
Definition of desired metrics (for model.compile):
metrics=['categorical_accuracy', 'precision', 'recall', 'fbeta_score', 'mean_squared_error']
model.compile(loss='categorical_crossentropy',
optimizer='sgd',
metrics=metrics)
Output of model.metrics_names:
['loss', 'categorical_accuracy', 'precision', 'recall', 'fbeta_score', 'mean_squared_error']
Upvotes: 2
Views: 5918
Reputation: 2138
Yes, it is different due to the fact that the sklearn classification report gives you the weighted average based on the support.
Experiment with:
from sklearn.metrics import classification_report
y_true = [0, 1,2,1]
y_pred = [0, 0,2,0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
Gives you: precision recall f1-score support
class 0 0.33 1.00 0.50 1
class 1 0.00 0.00 0.00 2
class 2 1.00 1.00 1.00 1
avg / total 0.33 0.50 0.38 **4**
However, (1+0+0.33)/3 = 0.44(3), but as it seems from the support column sklearn returns (1*1+0*2+0.33*1)/4=0.3325
Upvotes: 2