D.Laupheimer
D.Laupheimer

Reputation: 1074

Comparing metrics of Keras with metrics of sklearn.classification_report

I am struggling with different metrics while evaluating neural networks. My investigations showed that Keras (version 1.2.2) calculates different values for specific metrics (using function evaluate) compared to sklearn.classification report.

Specifically, the values for the metric 'precision' (i.e. 'precision' of Keras != 'precision' of sklearn) or 'recall' (i.e. 'recall' of Keras != 'recall' of sklearn) differ. For the following working example the differences seem to be random, but evaluating bigger networks shows that 'precision' of Keras equals (almost) 'recall' of sklearn whereas both 'recall' metrics differ clearly.

I appreciate your help!

from __future__ import print_function 
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils # numpy utils for to_categorical()
from keras import backend as K  # abstract backend API (in order to generate compatible code for Theano and Tf)
from sklearn.metrics import classification_report

batch_size = 128
nb_classes = 10
nb_epoch = 30

# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
pool_size = (2, 2)
# convolution kernel size
kernel_size = (3, 3)

# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()

if K.image_dim_ordering() == 'th':
    X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
    X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255 # range [0,1]
X_test /= 255 # range [0,1]
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes) # necessary for use of categorical_crossentropy 
Y_test = np_utils.to_categorical(y_test, nb_classes) # necessary for use of categorical_crossentropy 

# create model
model = Sequential()

model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],
                        border_mode='valid',
                        input_shape=input_shape))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

# configure model
model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy', 'precision', 'recall'])

# train model
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
          verbose=1, validation_data=(X_test, Y_test))

# evaluate model with keras
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
print('Test precision:', score[2])
print('Test recall:', score[3])

# evaluate model with sklearn
predictions_last_epoch = model.predict(X_test, batch_size=batch_size, verbose=1)
target_names = ['class 0', 'class 1', 'class 2', 'class 3', 'class 4', 
                    'class 5', 'class 6', 'class 7', 'class 8', 'class 9']

predicted_classes = np.argmax(predictions_last_epoch, axis=1)
print('\n')
print(classification_report(y_test, predicted_classes, 
        target_names=target_names, digits = 6))

E D I T

The output of the script given above:

Test score: 0.0271549037314
Test accuracy: 0.9916
Test precision: 0.992290322304
Test recall: 0.9908


9728/10000 [============================>.] - ETA: 0s

         precision    recall  f1-score   support

class 0   0.987867  0.996939  0.992382       980
class 1   0.993860  0.998238  0.996044      1135
class 2   0.990329  0.992248  0.991288      1032
class 3   0.991115  0.994059  0.992585      1010
class 4   0.994882  0.989817  0.992343       982
class 5   0.991041  0.992152  0.991597       892
class 6   0.993678  0.984342  0.988988       958
class 7   0.992180  0.987354  0.989761      1028
class 8   0.989754  0.991786  0.990769       974
class 9   0.991054  0.988107  0.989578      1009

avg / total   0.991607  0.991600  0.991597     10000

For another model:

val/test loss: 0.231304548573
val/test categorical_accuracy: **0.978500002956**
val/test precision: *0.995103668976*
val/test recall: 0.941900001907
val/test fbeta_score: 0.967675107574
val/test mean_squared_error: 0.0064611148566
10000/10000 [==============================] - 0s     


         precision    recall  f1-score   support

class 0   0.989605  0.971429  0.980433       980
class 1   0.985153  0.993833  0.989474      1135
class 2   0.988154  0.969961  0.978973      1032
class 3   0.981373  0.991089  0.986207      1010
class 4   0.968907  0.983707  0.976251       982
class 5   0.997633  0.945067  0.970639       892
class 6   0.995690  0.964509  0.979852       958
class 7   0.987230  0.977626  0.982405      1028
class 8   0.945205  0.991786  0.967936       974
class 9   0.951429  0.990089  0.970374      1009

avg / total   *0.978964*  **0.978500**  0.978522     10000

Definition of desired metrics (for model.compile):

metrics=['categorical_accuracy', 'precision', 'recall', 'fbeta_score', 'mean_squared_error']

model.compile(loss='categorical_crossentropy',
            optimizer='sgd',
            metrics=metrics)

Output of model.metrics_names:

['loss', 'categorical_accuracy', 'precision', 'recall', 'fbeta_score', 'mean_squared_error']

Upvotes: 2

Views: 5918

Answers (1)

layser
layser

Reputation: 2138

Yes, it is different due to the fact that the sklearn classification report gives you the weighted average based on the support.

Experiment with:

from sklearn.metrics import classification_report
y_true = [0, 1,2,1]
y_pred = [0, 0,2,0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

Gives you: precision recall f1-score support

    class 0       0.33      1.00      0.50         1
    class 1       0.00      0.00      0.00         2
    class 2       1.00      1.00      1.00         1

avg / total       0.33      0.50      0.38         **4**

However, (1+0+0.33)/3 = 0.44(3), but as it seems from the support column sklearn returns (1*1+0*2+0.33*1)/4=0.3325

Upvotes: 2

Related Questions