kawa biker
kawa biker

Reputation: 11

strongly different accuracy-values from model.evaluate(test_set) and from the sklearn classification_report

I'm experimenting on colab in image-classification with images of 32x32 pixels; I have 248 pics for training and 62 for testing (much too less, I know, but for experimenting...). There are only two classes and I get the data as follows:

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
                rescale=1./255,
                shear_range=0.2,
                zoom_range=0.2,
                horizontal_flip=True)
training_set = train_datagen.flow_from_directory(
               'training_set', target_size=(32,32),
               class_mode='binary')

test_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
                rescale=1./255)
test_set = test_datagen.flow_from_directory(
               'test_set', target_size=(32,32),
               class_mode='binary')

my actual cnn architecture is this:

cnn = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(64, 3, activation='relu', input_shape=[32,32,3]),
    tf.keras.layers.AveragePooling2D(2),
    tf.keras.layers.Conv2D(64, 3, activation='relu'),
    tf.keras.layers.AveragePooling2D(2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),   
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),   
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

and for compiling:

cnn.compile(optimizer='adam',loss='binary_crossentropy',
           metrics=['accuracy'])

training:

hist = cnn.fit(x=training_set, validation_data=test_set, epochs=30)

after 30 epochs, the model gives:

Epoch 30/30 8/8 [==============================] - 1s 168ms/step - loss: 0.4237 - accuracy: 0.8347 - val_loss: 0.5812 - val_accuracy: 0.7419

I evaluated on the test data:

cnn.evaluate(test_set)

which gave me:

2/2 [==============================] - 0s 80ms/step - loss: 0.5812 - accuracy: 0.7419

[0.5812247395515442, 0.7419354915618896]

this would be nice for such a small dataset, but checking the results with a classification report from sklearn gives a much lower value (which is correct) of only 0.48 accuracy. To get this value, i did

predictions = cnn.predict(test_set)

I transformed the probability values in predictions to 0 or 1 (threshold 0.5) to get the predicted labels and compared these with the correct labels of the test data in the classification_report:

from sklearn.metrics import confusion_matrix, classification_report
print(classification_report(test_labels, predicted_labels))

the report showed

              precision    recall  f1-score   support

           0       0.48      0.52      0.50        31
           1       0.48      0.45      0.47        31

    accuracy                           0.48        62
   macro avg       0.48      0.48      0.48        62
weighted avg       0.48      0.48      0.48        62

so why the model.evaluate(...) function cannot calculate the correct accuracy or otherwise: what exactly does this evaluate function calculate? what is the meaning of this number 0.7419?

does anybody have an idea for this problem?

Upvotes: 1

Views: 756

Answers (2)

Jide
Jide

Reputation: 29

You can define a new test generator but this time set shuffle to False.

new_test_datagen = ImageDataGenerator(rescale=1./255)
new_test_generator = test_datagen.flow_from_directory(test_dir,
                                  target_size=(150,150),
                                  shuffle = False,
                                  batch_size=32,
                                  seed=None)

# Display classification report and accuracy score for softmax classifier
from sklearn.metrics import classification_report, accuracy_score
softmax_y_true = new_test_generator.classes
softmax_y_pred = model.predict(new_test_generator)
softmax_y_pred = np.array(list(map(lambda x: np.argmax(x),softmax_y_pred)))

print("Accuracy: {0}".format(accuracy_score(softmax_y_true, softmax_y_pred)))

Upvotes: 0

kawa biker
kawa biker

Reputation: 11

I've found the very hided reason for this problem. it lies in the sequence of getting the list of all test_labels (the truth) and doing predictions on the test data by running model.predict(test_set).

I found that the method predict(test_set) mixes up the content of test_set !

So I saved the labels of the test_set BEFORE doing the predict(test_set) and now I have a perfect match between the accuracy in my classification_report and the accuracy from the method evaluate(test_set)/val_accuracy.

I also did predict on each single object in test_set and calculated the accuracy by myself, and this accuracy matched also with val_accuracy from last epoch.

by the way: the method evaluate(test_set) also mixes up the content of test_set ! so one has to be very careful when extracting data from test_set "manually"

Upvotes: 0

Related Questions