Data augmentation before splitting

Question

for my exam based around data crunching, we've received a small simpsons dataset of 4 characters (Bart, Homer, Lisa, Marge) to build a convolutional neural network around. However, the dataset contains only a rather small amount of images: around 2200 to split into test & train.

Since I'm very new to neural networks and deep learning, is it acceptable to augment my data (i'm turning the images X degrees 9 times) and splitting my data afterwards using sklearn's testtrainsplit function.

Since I've made this change, I'm getting a training and test accuracy of around 95% after 50 epochs with my current model. Since that's more than I've expected to get, I started questioning if augmenting test-data mainly is accepted without having a biased or wrong result in the end.

so:

a) Can you augment your data before splitting it with sklearn's TrainTestSplit without influencing your results in a wrong way?

b) if my method is wrong, what's another method I could try out?

Thanks in advance!

OddNorg · Accepted Answer

One should augment the data after Train and Test split. To work correctly one needs to make sure to augment data only from the train split.

If one augments data and before splitting the dataset, it will likely inject small variations of the train dataset into the test dataset. Thus the network will be overestimating its accuracy (and it might be over-fitting as well, among other issues).

A good way to avoid this pitfall it is to augment the data after the original dataset was split.

A lot of libraries implement python generators that randomly apply one or more combination of image modifications to augment the data. These might include

Image rotation
Image Shearing
Image zoom ( Cropping and re-scaling)
Adding noise
Small shift in hue
Image shifting
Image padding
Image Blurring
Image embossing

This github library has a good overview of classical image augmentation techniques: https://github.com/aleju/imgaug ( I have not used this library. Thus cannot endorse it speed or implementation quality, but their overview in README.md seems to be quite comprehensive.)

Some neural network libraries already have some utilities to do that. For example: Keras has methods for Image Preprocessing https://keras.io/preprocessing/image/

Data augmentation before splitting

Answers (1)

Related Questions