TF accuracy score and confusion matrix disagree. Is TensorFlow shuffling data on each access of BatchDataset?

Question

Accuracy reported by model.evaluate() is very different from accuracy calculated from Sklearn or TF confusion matrix.

from sklearn.metrics import confusion_matrix
...

training_data, validation_data, testing_data = load_img_datasets()
# These ^ are tensorflow.python.data.ops.dataset_ops.BatchDataset

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    model = create_model(INPUT_SHAPE, NUM_CATEGORIES)
    optimizer = tf.keras.optimizers.Adam()
    metrics = ['accuracy']
    model.compile(loss='categorical_crossentropy',
                  optimizer=optimizer,
                  metrics=metrics)

history = model.fit(training_data, epochs=epochs,
                    validation_data=validation_data)

testing_data.shuffle(len(testing_data), reshuffle_each_iteration=False)
# I think this ^ is preventing additional shuffles on access

loss, accuracy = model.evaluate(testing_data)
print(f"Accuracy: {(accuracy * 100):.2f}%")
# Prints 
# Accuracy: 78.7%

y_hat = model.predict(testing_data)
y_test = np.concatenate([y for x, y in testing_data], axis=0)
c_matrix = confusion_matrix(np.argmax(y_test, axis=-1),
                            np.argmax(y_hat, axis=-1))
print(c_matrix)
# Prints result that does not agree:
# Confusion matrix:
#[[ 72 111  54  15  69]
# [ 82 100  44  16  78]
# [ 64 114  52  21  69]
# [ 71 106  54  21  68]
# [ 79 101  51  25  64]]
# Accuracy calculated from CM = 19.3%

At first, I thought that TensorFlow was shuffling testing_data on each access so I added testing_data.shuffle(len(testing_data), reshuffle_each_iteration=False), but still results do not agree.

Have also tried TF confusion matrix:

y_hat = model.predict(testing_data)
y_test = np.concatenate([y for x, y in testing_data], axis=0)
true_class = tf.argmax(y_test, 1)
predicted_class = tf.argmax(y_hat, 1)
cm = tf.math.confusion_matrix(true_class, predicted_class, NUM_CATEGORIES)
print(cm)

...with similar result.

Obviously predicted labels must be compared with the correct labels. What am I doing wrong?

Frightera · Accepted Answer

I could not find the source but seems like Tensorflow is still shuffling the testing under the hood. You can try to iterate over the dataset to obtain predictions and real classes:

predicted_classes = np.array([])
true_classes =  np.array([])

for x, y in testing_data:
  predicted_classes = np.concatenate([predicted_classes,
                       np.argmax(model(x), axis = -1)])
  true_classes = np.concatenate([true_classes, np.argmax(y.numpy(), axis=-1)])

model(x) is for faster execution. From the source:

Computation is done in batches. This method is designed for performance in large scale inputs. For small amount of inputs that fit in one batch, directly using __call__ is recommended for faster execution, e.g., model(x)

If it does not work, you can try model.predict(x) instead.

TF accuracy score and confusion matrix disagree. Is TensorFlow shuffling data on each access of BatchDataset?

Answers (1)

Related Questions