Why does not shuffling my train set and validation set result in huge performance gain?

Question

I am using the ImageDataGenerator class of Keras, to load my samples in a Neural Network for a Binary classification problem.

No. +ve train images =5000, -ve train images = 5000. Similarly, +ve validation images=5000 and -ve validation image=5000. My batch size is 64.

My problem is, when I use the below code for loading the data and training, my accuracy hovers around 65-67 %. However, if I set shuffle = False, it hovers around 98-100 %, after 2-3 epochs.

Why is there such a big performance gain, how is shuffling playing a part in it?

Also, I noticed that, each batch, generated from flow_from_directory, has images entirely from any one class only. Would putting both positive and negative samples in a batch help give more realistic measure of accuracy ?

# data augmentation configuration we will use for training
train_datagen = ImageDataGenerator(preprocessing_function=HPF)

# data augmentation configuration we will use for testing
test_datagen = ImageDataGenerator(preprocessing_function=HPF)

# generator, for train data
train_generator = train_datagen.flow_from_directory(
        './data/train',  # this is the target directory
        target_size=(512, 512),  # all images will be resized to 512x512
        batch_size=batch_size,
        color_mode='grayscale',
        shuffle=True,
        class_mode='binary')  # since we use binary_crossentropy loss, we need binary labels

# this is a similar generator, for validation data
validation_generator = test_datagen.flow_from_directory(
        './data/validation',
        target_size=(512, 512),
        color_mode='grayscale',
        batch_size=batch_size,
        shuffle=True,
        class_mode='binary')

What does the Shuffle = True actually does? Since the images in each batch are still either all positive or all negative.

The batches can be printed using the below code:

for i in train_generator:
    idx = (train_generator.batch_index - 1) * train_generator.batch_size
    print(train_generator.filenames[idx : idx + train_generator.batch_size])

Why does not shuffling my train set and validation set result in huge performance gain?

Answers (1)

Related Questions