Impact of data shuffling on results reproducibility in Pytorch

Question

Pytorch dataloader class has the following constructor:

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None)

When shuffle is set to True, data is reshuffled at every epoch. Shuffling the order in which examples are fed to the classifier is helpful so that batches between epochs do not look alike. Doing so will eventually make our model more robust.

However, I'm unable to understand by setting shuffle=True, can we get the same accuracy value for different runs?

Shai · Accepted Answer

The main algorithm/principal deep learning is based on is weight optimization using stochastic gradient descend (and its variants). Being a stochastic algorithm you cannot expect to get exactly the same results if you run your algorithm multiple times. In fact, you should see some variations, but they should be "roughly the same".

If you need to have exactly the same results when running your algorithm multiple times, you should look into reproducibility of results - which is a very delicate subject.

In summary:
1. If you do not shuffle at all, you will have perfect reproducibility, but the resulting accuracy are expected to be very low.
2. If you randomly shuffle (what most of the world does) you should expect slightly different accuracy value for each run, but they should all be significantly larger than the values of (1) "no shuffle".
3. If you follow the guidelines of reproducible results, you should have the exact same accuracy values for each run and they should be close to the values of (2) "shuffle".

Impact of data shuffling on results reproducibility in Pytorch

Answers (1)

Related Questions