Data augmentation in cross-validation

Question

Am I understanding correctly that data augmentation in an object classification task should only be done on the training set?

If so, how do you implement 10-fold cross-validation with augmented data? Is the augmented data created every time a test fold changes (i.e. 10 times)?

Bonus question: can you direct me to a resource that shows how to do this in Tensorflow?

dedObed · Accepted Answer

Yes, your understanding is correct. Validation data is there to give you an idea of how your model behaves on real unseen examples, e.g. test data. So you should keep it real and not spoil it by augmentation.

Now to 10-fold cross validation: Engineering considerations kick in. Is it computationally expensive to do the augmentation? Perhaps you can pre-compute the augmented data, and pick original+augmented for training and original-only for validation. Do you want wast amounts of the augmented data and/or is the augmentation easy? Do it on the fly, maybe as part of fetching samples from the dataset.

I cannot help you with the TF bonus question, but there is a nice example of putting things together in PyTorch.

Data augmentation in cross-validation

Answers (2)

Related Questions