Keras discrepancy between .evaluate and .predict

Question

I know this question has been asked before, but I have tried all of their solutions and nothing is working for me.

My Problem:

I am running a CNN to classify some images, a typical task, nothing too crazy. I have the following compilation of my model.

model.compile(optimizer = keras.optimizers.Adam(learning_rate = exp_learning_rate),
          loss = tf.keras.losses.SparseCategoricalCrossentropy(),
          metrics = ['accuracy'])

I fit this on my training dataset, and evaluated on my validation dataset as follows:

history = model.fit(train_dataset, validation_data = validation_dataset, epochs = 5)

And then I evaluated on a separate test set as follows:

model.evaluate(test_dataset)

Which resulted in this:

4/4 [==============================] - 30s 7s/step - loss: 1.7180 - accuracy: 0.8627

However, when I run:

model.predict(test_dataset)

I have the following confusion matrix output:

This clearly isn't 86% accuracy like the .evaluate method tells me. In fact, it's actually 35.39% accuracy. To make sure it wasn't an issue with my testing dataset, I had my model predict on my training and validation datasets and I still got a similar percentage as here (~30%) despite my training, validation accuracy during fitting going up to 96%, 87%, respectively.

Question:

I don't know why .predict and .evaluate are outputting different results? What's happening there? It seems like when I call .predict, it's not using any of the weights that I trained during fitting? (in fact, given that there are 3 classes, this output is no better than just blindly guessing each label). Are the weights from my fitting not being transferred over to my prediction? My loss function is correct (I label encoded my data as tensorflow wishes to be used with sparse_categorical_crossentropy) and when I pass 'accuracy', it will just take the accuracy corresponding to my loss function. All of this should be consistent. But why is there such a discrepancy with the results of .evaluate and .predict? Which one should I trust?

My Attempts to Fix My Issue:

I thought maybe the sparse categorical cross entropy wasn't right, so I one-hot encoded my target labels and used the categorical_crossentropy loss instead. I still have the EXACT same issue as above.

Concerns:

If the .evaluate is incorrect, then doesn't that mean my training accuracy and validation accuracy during fitting are inaccurate as well? Don't those use the .evaluate method as well? If that's the case, then what can I trust? The loss isn't a good indication of if my model is doing well because it is well-known that minimal loss does not imply good accuracy (although the converse is usually true depending on what standard of "good" we're using). How do I gauge my model's effectiveness in the case that my accuracy metrics aren't correct? I don't really know what to look at anymore because I have no other way to gauge if my model is learning, if someone could please help me understand what is happening I would appreciate it so much. I'm so frustrated.

Edit: (10-28-2021: 12:26 AM)

Ok, so I'll provide some more code to really troubleshoot this.

I originally preprocessed my data as such:

image_size = (256, 256)
batch_size = 16

train_ds = keras.preprocessing.image_dataset_from_directory(
    directory = image_directory,
    label_mode = 'categorical',
    shuffle = True,
    validation_split = 0.2,
    subset = 'training',
    seed = 24,
    batch_size = batch_size
)

val_ds = keras.preprocessing.image_dataset_from_directory(
    directory = image_directory,
    label_mode = 'categorical',
    shuffle = True,
    validation_split = 0.2,
    subset = 'validation',
    seed = 24,
    batch_size = batch_size
)

Where image_directory is a string with a path containing my images. Now you could probably read documentation, but the image_dataset_from_directory method actually returns a tf.data.Dataset object containing a bunch of batches of the respective (training, validation) data.

I imported the VGG16 architecture to do my classification so I called the respective preprocessing function for VGG16 as follows:

preprocess_input = tf.keras.applications.vgg16.preprocess_input

train_ds = train_ds.map(lambda x, y: (preprocess_input(x), y))

val_ds = val_ds.map(lambda x, y: (preprocess_input(x), y))

This transformed the images into something that was suitable as input for VGG16. Then, in my last processing steps, I did the following validation/test split:

val_batches = tf.data.experimental.cardinality(val_ds)
test_dataset = val_ds.take(val_batches // 3)
validation_dataset = val_ds.skip(val_batches // 3)

Then I proceeded to cache and prefetch my data:

AUTOTUNE = tf.data.AUTOTUNE

train_dataset = train_ds.prefetch(buffer_size = AUTOTUNE)
validation_dataset = validation_dataset.prefetch(buffer_size = AUTOTUNE)
test_dataset = test_dataset.prefetch(buffer_size = AUTOTUNE)

The Problem:

The problem occurs in the method above. I'm still not sure whether or not .evaluate is a true indicator of accuracy for my model. But I realized that the .evaluate and .predict always coincide when my neural network is a keras.Sequential() model. However, (correct me if I'm wrong) what I am suspecting is that VGG16, when imported from keras.applications API, is actually NOT a keras.Sequential() model. Therefore, I don't think that the .predict and .evaluate results actually coincide when I feed my data straight into my model (I was going to post this as an answer, but I don't have sufficient knowledge nor research to confirm that any of what I said is correct, someone please chime in because I like learning things I know little to nothing about, an edit this is for now).

In the end, I worked around my problem by calling Image_Data_Generator() instead of image_dataset_from_directory() as follows:

train_datagen = ImageDataGenerator(
    preprocessing_function = preprocess_input,
    width_shift_range = 0.2,
    height_shift_range = 0.2,
    shear_range = 0.2,
    zoom_range = 0.2,
    horizontal_flip = True
)

val_datagen = ImageDataGenerator(
    preprocessing_function = preprocess_input
)


train_ds = train_datagen.flow_from_directory(
    train_image_directory,
    target_size = (224, 224),
    batch_size = 16,
    seed = 24,
    shuffle = True,
    classes = ['class1', 'class2', 'class3'],
    class_mode = 'categorical'
)

test_ds = val_datagen.flow_from_directory(
    test_image_directory,
    target_size = (224, 224),
    batch_size = 16,
    seed = 24,
    shuffle = False,
    classes = ['class1', 'class2', 'class3'],
    class_mode = 'categorical'
)

(NOTE: I got this based off the following link from tensorflow's documentation: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_directory)

This completes all the preprocessing for me. Then, when I call model.evaluate(test_ds), it returns the exact same result as when I do model.predict_generator(test_ds). After some minor processing of the prediction output, I use the following code for my confusion matrix:

Y_pred = model.predict(test_ds)
y_pred = np.argmax(Y_pred, axis=1)

cf = confusion_matrix(test_ds.classes, y_pred)
sns.heatmap(cf, annot= True, xticklabels = class_names,
           yticklabels = class_names)
plt.title('Performance of Model on Testing Set')

This eliminates the discrepancy in the confusion matrix and the result of model.evaluate(test_ds).

The Takeaway:

If you're loading images onto a classification model, and your loss and accuracy match, but you're getting discrepancy between your predictions and loss, accuracy, try preprocessing in every way possible. I usually preprocess my images using the image_dataset_from_directory() method for all my keras.sequential() models, however, for the VGG16 model, which I suspect is not a sequential() model, using the ImageDataGenerator(...).flow_from_directory(...) resulted in the correct format for the model to generate a prediction that is consistent with the performance metrics.

TLDR I didn't answer any of my original questions, but I found a workaround. Sorry if this is spam in any way. As is the nature of most Stack Overflow posts, I hope my turmoil in the last few hours helps someone way in the future.

Mario · Accepted Answer

I had the same problem. And even with the ImageDataGenerator it stayed that odd behaviour.

But I think the problem is the shuffle flag of the validation set.

You changed that from here:

 val_ds = keras.preprocessing.image_dataset_from_directory(
     directory = image_directory,
     label_mode = 'categorical',
     shuffle = True,
     validation_split = 0.2,
     subset = 'validation',
     seed = 24,
     batch_size = batch_size
 )

To here:

 test_ds = val_datagen.flow_from_directory(
     test_image_directory,
     target_size = (224, 224),
     batch_size = 16,
     seed = 24,
     shuffle = False,
     classes = ['class1', 'class2', 'class3'],
     class_mode = 'categorical'
 )

Keras discrepancy between .evaluate and .predict

Answers (1)

Related Questions