Reputation: 169
I am using the ImageDataGenerator class of Keras, to load my samples in a Neural Network for a Binary classification problem.
No. +ve train images =5000, -ve train images = 5000. Similarly, +ve validation images=5000 and -ve validation image=5000. My batch size is 64.
My problem is, when I use the below code for loading the data and training, my accuracy hovers around 65-67 %. However, if I set shuffle = False, it hovers around 98-100 %, after 2-3 epochs.
Why is there such a big performance gain, how is shuffling playing a part in it?
Also, I noticed that, each batch, generated from flow_from_directory, has images entirely from any one class only. Would putting both positive and negative samples in a batch help give more realistic measure of accuracy ?
# data augmentation configuration we will use for training
train_datagen = ImageDataGenerator(preprocessing_function=HPF)
# data augmentation configuration we will use for testing
test_datagen = ImageDataGenerator(preprocessing_function=HPF)
# generator, for train data
train_generator = train_datagen.flow_from_directory(
'./data/train', # this is the target directory
target_size=(512, 512), # all images will be resized to 512x512
batch_size=batch_size,
color_mode='grayscale',
shuffle=True,
class_mode='binary') # since we use binary_crossentropy loss, we need binary labels
# this is a similar generator, for validation data
validation_generator = test_datagen.flow_from_directory(
'./data/validation',
target_size=(512, 512),
color_mode='grayscale',
batch_size=batch_size,
shuffle=True,
class_mode='binary')
What does the Shuffle = True actually does? Since the images in each batch are still either all positive or all negative.
The batches can be printed using the below code:
for i in train_generator:
idx = (train_generator.batch_index - 1) * train_generator.batch_size
print(train_generator.filenames[idx : idx + train_generator.batch_size])
Upvotes: 0
Views: 2485
Reputation: 24591
I think the most probable cause is that you have some correlation between train and validation batches. For example, when the training batches are all positives then validation also are. This could be aggravated by the fact that your training and validation sets have the same length : correlation could survive through epochs.
In any case I would rather trust the performance obtained with random shuffling. Any significant departure from those figures without shuffling is an indication of correlation in your data and a reminder why using random shuffling is needed and so commonly used.
EDIT
Here is a possible correlation in the data as you described it. Your training and validation sets have 10000 images. The first 5000 of both sets are/could be all positives. So during training, your net learns to label samples as positive no matter what. Validation samples are also positives, so this fits. Then, we train on all negative images. You net then adapts and labels samples as negative no matter what. This also correlates with your validation samples that happen to be also negative, and you end up with good validation scores.
One way to convince yourself that the real performance of unshuffled training is bad is to validate on the entire validation set, not on a single validation batch -- as I am assuming you are doing.
Upvotes: 1