Reputation: 2091
I am getting very inconsistent test accuracies from my model but cannot figure out why.
I was trying to benchmark some TensorFlow/Keras stuff and noticed my results were unreliable. Not the timing, but the test accuracy of the model. On some runs, the model would achieve a test accuracy = 0.65, and sometimes it would achieve only 0.35. Same architecture, same optimizer trained on the same dataset, same everything.
I tried to remove most sources of randomness, such as using the same numpy rng seed, TensorFlow rng seed, resetting the TensorFlow backend between runs, and using the same seed for shuffling the dataset, etc. The issue persists.
Can someone please help me figure out what I am doing wrong?
Here is a reproducible example on Google Colab, but the issue also happens locally with different versions of Python, TensorFlow, and Cuda/Cudnn.
And here is the dataset I am using. It is just CIFAR-10 with the training partition split into train and validation.
edit:
It appears the issue is caused my some interaction between .shuffle
and batch
, more specifically to their parameters buffer_size
, reshuffle_each_iteration
, and drop_remainder
.
The following combination appears to avoid the problem (but incurs in large memory usage):
dataset.shuffle(buffer_size=dataset.cardinality().numpy(),
reshuffle_each_iteration=False,
seed=rng_seed)
.batch(batch_size=batch_size,
drop_remainder=True)
Upvotes: 3
Views: 3208
Reputation: 902
For speed, in all following experiments, I fixed epoch to1
.
One of the most violent solution is adding the following 2 lines before you building model, which does not need any other changes and support random
:
tf.keras.utils.set_random_seed(some_seed)
tf.config.experimental.enable_op_determinism()
i.e.:
def why_u_inconsistent():
# tf.keras.backend.clear_session()
# tf.random.set_seed(42)
# np.random.seed(42)
tf.keras.utils.set_random_seed(0)
tf.config.experimental.enable_op_determinism()
train, val, test = get_train_val_test()
model = get_model()
model.fit(train, validation_data=val, epochs=1, verbose=0)
scores = model.evaluate(test, verbose=0)
for name, value in zip(model.metrics_names, scores):
print(f"test {name}: {value}")
the output is (You'll be satisfied because they're exactly the same):
run: 0
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 3.872633934020996
test accuracy: 0.1818999946117401
=====
run: 1
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 3.872633934020996
test accuracy: 0.1818999946117401
=====
run: 2
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 3.872633934020996
test accuracy: 0.1818999946117401
=====
run: 3
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 3.872633934020996
test accuracy: 0.1818999946117401
=====
run: 4
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 3.872633934020996
test accuracy: 0.1818999946117401
P.S. When setting random seed, we should use tf.keras.utils.set_random_seed
, which including random
,numpy.random
and tensorflow.random
. They actually use three different single instance of random, i.e., their random seeds do not interact each other, but did influence our codes because we use them inevitably. Of course, it can be set separately, tf.keras.utils.set_random_seed
is just a convenient way. Unfortunately, your original code did not set the seed of the basic module random
.
Then, I checked if data pipeline's output must be the same across each run_nr
. I writed each batch's reduce_mean
into csv
file to compare each run_nr
's output, and found they had no difference at all. So here, we can draw the probelm as determinism
problem and I have explained here.
When it comes to your problem, I find a better explanation, i.e., 2 key point momentum
and learning_rate
make your algorithm suffers from determinism
problem. Provided we do not use tf.config.experimental.enable_op_determinism
, any algorithm is uncertain in the last few bits of precision (it may be caused by cuda
or TensorFloat-32
dtype), but momentum
accumulate the uncertainty error and learning_rate
amplify the error. If we use
SGD
(Not that you must to use it) optimizer without momentum
like this:
...
optimizer = tf.keras.optimizers.SGD(0.001)
model.compile(
optimizer=optimizer,
loss="categorical_crossentropy",
metrics=["accuracy"],
)
...
def why_u_inconsistent(index):
tf.keras.utils.set_random_seed(0)
# tf.config.experimental.enable_op_determinism()
train, val, test = get_train_val_test()
model = get_model()
model.fit(train, validation_data=val, epochs=1, verbose=0)
scores = model.evaluate(test, verbose=0)
for name, value in zip(model.metrics_names, scores):
print(f"test {name}: {value}")
...
the output after 1
epoch will like the following, where outputs are very close:
run: 0
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 2.20281720161438
test accuracy: 0.1808999925851822
=====
run: 1
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 2.202397346496582
test accuracy: 0.1809999942779541
=====
run: 2
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 2.2010912895202637
test accuracy: 0.18019999563694
=====
run: 3
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 2.200165271759033
test accuracy: 0.18230000138282776
=====
run: 4
Found 42500 files belonging to 10 classes.
Found 7500 files belonging to 10 classes.
Found 10000 files belonging to 10 classes.
test loss: 2.2025880813598633
test accuracy: 0.18019999563694
=====
So let's draw a better conclusion:
Use tf.keras.utils.set_random_seed(some_seed)
control all random behavior is needed
Optimize the algorithm as much as possible to mitigate the impact of calculation accuracy. That is, let the algorithm not produce or accumulate small errors in accuracy. There are many adjustable keys, not only the optimizer
or the optimizer
's parameters, but also other stuff such as bacth_size
, gradient_clip
, gradient_norm
etc.
Use tf.config.experimental.enable_op_determinism
if you want to get completely consistent reproducible results and accept its side effects(deceleration).
Upvotes: 4
Reputation: 802
So it seems that you are having 2 problems which are causing the different results for each run:
tf.data.AUTOTUNE
Solution for 1: Switch your runtime in Colab to "None"
Solution for 2:
Change your return from tf.data.AUTOTUNE
to afixed BatchSize (e.g.128):
train.prefetch(128),
val.prefetch(128),
test.prefetch(128),
You can also see my adaption of your code here.
Upvotes: 1