Reputation: 1004
Am I understanding correctly that data augmentation in an object classification task should only be done on the training set?
If so, how do you implement 10-fold cross-validation with augmented data? Is the augmented data created every time a test fold changes (i.e. 10 times)?
Bonus question: can you direct me to a resource that shows how to do this in Tensorflow?
Upvotes: 1
Views: 2559
Reputation: 1363
Yes, your understanding is correct. Validation data is there to give you an idea of how your model behaves on real unseen examples, e.g. test data. So you should keep it real and not spoil it by augmentation.
Now to 10-fold cross validation: Engineering considerations kick in. Is it computationally expensive to do the augmentation? Perhaps you can pre-compute the augmented data, and pick original+augmented for training and original-only for validation. Do you want wast amounts of the augmented data and/or is the augmentation easy? Do it on the fly, maybe as part of fetching samples from the dataset.
I cannot help you with the TF bonus question, but there is a nice example of putting things together in PyTorch.
Upvotes: 3
Reputation: 370
Data augmentation is usually done to help our model generalize better for test/real world data. For many practical applications the data is divided into train/valid/test. Data can be augmented in train and valid dataset. There is no point in doing data augmentation in test set.
For cross-validation, check kfold function from sklearn library which cam operate on numpy array. you can use their return value directly in model.fit() of tensorflow
Upvotes: 0