Training the same model on the same data yielding extremely different test accuracy

Question

I am getting very inconsistent test accuracies from my model but cannot figure out why.

I was trying to benchmark some TensorFlow/Keras stuff and noticed my results were unreliable. Not the timing, but the test accuracy of the model. On some runs, the model would achieve a test accuracy = 0.65, and sometimes it would achieve only 0.35. Same architecture, same optimizer trained on the same dataset, same everything.

I tried to remove most sources of randomness, such as using the same numpy rng seed, TensorFlow rng seed, resetting the TensorFlow backend between runs, and using the same seed for shuffling the dataset, etc. The issue persists.

Can someone please help me figure out what I am doing wrong?

Here is a reproducible example on Google Colab, but the issue also happens locally with different versions of Python, TensorFlow, and Cuda/Cudnn.

And here is the dataset I am using. It is just CIFAR-10 with the training partition split into train and validation.

edit: It appears the issue is caused my some interaction between .shuffle and batch, more specifically to their parameters buffer_size, reshuffle_each_iteration, and drop_remainder. The following combination appears to avoid the problem (but incurs in large memory usage):

dataset.shuffle(buffer_size=dataset.cardinality().numpy(),
                reshuffle_each_iteration=False,
                seed=rng_seed)
        .batch(batch_size=batch_size,
               drop_remainder=True)

Little Train · Accepted Answer

For speed, in all following experiments, I fixed epoch to1.

One of the most violent solution is adding the following 2 lines before you building model, which does not need any other changes and support random:

tf.keras.utils.set_random_seed(some_seed)
tf.config.experimental.enable_op_determinism()

i.e.:

def why_u_inconsistent():
    # tf.keras.backend.clear_session()
    # tf.random.set_seed(42)
    # np.random.seed(42)
    tf.keras.utils.set_random_seed(0)
    tf.config.experimental.enable_op_determinism()
    train, val, test = get_train_val_test()
    model = get_model()
    model.fit(train, validation_data=val, epochs=1, verbose=0)
    scores = model.evaluate(test, verbose=0)
    for name, value in zip(model.metrics_names, scores):
        print(f"test {name}: {value}")

the output is (You'll be satisfied because they're exactly the same):

run: 0
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 3.872633934020996
test accuracy: 0.1818999946117401
=====
run: 1
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 3.872633934020996
test accuracy: 0.1818999946117401
=====
run: 2
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 3.872633934020996
test accuracy: 0.1818999946117401
=====
run: 3
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 3.872633934020996
test accuracy: 0.1818999946117401
=====
run: 4
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 3.872633934020996
test accuracy: 0.1818999946117401

P.S. When setting random seed, we should use tf.keras.utils.set_random_seed, which including random,numpy.random and tensorflow.random. They actually use three different single instance of random, i.e., their random seeds do not interact each other, but did influence our codes because we use them inevitably. Of course, it can be set separately, tf.keras.utils.set_random_seed is just a convenient way. Unfortunately, your original code did not set the seed of the basic module random.

Then, I checked if data pipeline's output must be the same across each run_nr. I writed each batch's reduce_mean into csv file to compare each run_nr's output, and found they had no difference at all. So here, we can draw the probelm as determinism problem and I have explained here.

When it comes to your problem, I find a better explanation, i.e., 2 key point momentum and learning_rate make your algorithm suffers from determinism problem. Provided we do not use tf.config.experimental.enable_op_determinism, any algorithm is uncertain in the last few bits of precision (it may be caused by cuda or TensorFloat-32 dtype), but momentum accumulate the uncertainty error and learning_rate amplify the error. If we use SGD(Not that you must to use it) optimizer without momentum like this:

...
    optimizer = tf.keras.optimizers.SGD(0.001)
    model.compile(
        optimizer=optimizer,
        loss="categorical_crossentropy",
        metrics=["accuracy"],
    )
...
def why_u_inconsistent(index):
    tf.keras.utils.set_random_seed(0)
    # tf.config.experimental.enable_op_determinism()
    train, val, test = get_train_val_test()
    
    model = get_model()
    model.fit(train, validation_data=val, epochs=1, verbose=0)
    scores = model.evaluate(test, verbose=0)
    for name, value in zip(model.metrics_names, scores):
        print(f"test {name}: {value}")
...

the output after 1 epoch will like the following, where outputs are very close:

run: 0
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 2.20281720161438
test accuracy: 0.1808999925851822
=====
run: 1
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 2.202397346496582
test accuracy: 0.1809999942779541
=====
run: 2
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 2.2010912895202637
test accuracy: 0.18019999563694
=====
run: 3
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 2.200165271759033
test accuracy: 0.18230000138282776
=====
run: 4
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 2.2025880813598633
test accuracy: 0.18019999563694
=====

So let's draw a better conclusion:

Use tf.keras.utils.set_random_seed(some_seed) control all random behavior is needed
Optimize the algorithm as much as possible to mitigate the impact of calculation accuracy. That is, let the algorithm not produce or accumulate small errors in accuracy. There are many adjustable keys, not only the optimizer or the optimizer's parameters, but also other stuff such as bacth_size, gradient_clip, gradient_norm etc.
Use tf.config.experimental.enable_op_determinism if you want to get completely consistent reproducible results and accept its side effects(deceleration).

Training the same model on the same data yielding extremely different test accuracy

Answers (2)

Related Questions