mayuresh_sa
mayuresh_sa

Reputation: 159

Data shuffling for Image Classification

I want to develop a CNN model to identify 24 hand signs in American Sign Language. I created a custom dataset that contains 3000 images for each hand sign i.e. 72000 images in the entire dataset.

For training the model, I would be using 80-20 dataset split (2400 images/hand sign in the training set and 600 images/hand sign in the validation set).

My question is: Should I randomly shuffle the images when creating the dataset? And Why?

Based on my previous experience, it led to validation loss being lower than training loss and validation accuracy more than training accuracy. Check this link.

Upvotes: 1

Views: 2575

Answers (2)

desertnaut
desertnaut

Reputation: 60318

Random shuffling of data is a standard procedure in all machine learning pipelines, and image classification is not an exception; its purpose is to break possible biases during data preparation - e.g. putting all the cat images first and then the dog ones in a cat/dog classification dataset.

Take for example the famous iris dataset:

from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
y
# result:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

As you can clearly see, the dataset has been prepared in such a way that the first 50 samples are all of label 0, the next 50 of label 1, and the last 50 of label 2. Try to perform a 5-fold cross validation in such a dataset without shuffling and you'll find most of your folds containing only a single label; try a 3-fold CV, and all your folds will include only one label. Bad... BTW, it's not just a theoretical possibility, it has actually happened.

Even if no such bias exists, shuffling never hurts, so we do it always just to be on the safe side (you never know...).

Based on my previous experience, it led to validation loss being lower than training loss and validation accuracy more than training accuracy. Check this link.

As noted in the answer there, it is highly unlikely that this was due to shuffling. Data shuffling is not anything sophisticated - essentially, it is just the equivalent of shuffling a deck of cards; it may have happened once that you insisted on "better" shuffling and subsequently you ended up with a straight flush hand, but obviously this was not due to the "better" shuffling of the cards.

Upvotes: 2

Bilguun
Bilguun

Reputation: 389

Here is my two cents on the topic.

First of all make sure to extract a test set that has equal number of samples for each hand sign. (hand sign #1 - 500 samples, hand sign #2 - 500 samples and so on) I think this is referred to as stratified sampling.

When it comes to the training set, there is no huge mistake in shuffling the entire set. However, when splitting the training set into training and validation set make sure that the validation set is good enough to be a representation for the test set.

One of my personal experiences with shuffling: After splitting the training set into training and validation sets, the validation set turned out to be very easy to predict. Therefore, I saw good learning metric values. However, the performance of the model on the test set was horrible.

Upvotes: -1

Related Questions