Panda
Panda

Reputation: 79

Keras fit_generator() with generator that extends Sequence is returning more samples than total

I am training a neural network with Keras. Because of the size of the dataset, I need to use a generator and the fit_generator() method. I am following this tutorial:

https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly

However, I prepared a small example to check the samples being fed to the network at each epoch and it seems that the number is higher than the number of samples.

class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, files, batch_size=2, dim=(160, 160), n_channels=3,
                 n_classes=2, shuffle=False):
        'Initialization'
        self.dim = dim
        self.files = files
        self.batch_size = batch_size
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.shuffle = shuffle
        self.on_epoch_end()

    def __len__(self):
        'Denotes the number of batches per epoch'
        print ("Number of batches per epoch")
        print(int(np.floor(len(self.files) / self.batch_size)))
        return int(np.floor(len(self.files) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        files_temp = [self.files[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(files_temp)

        return X, y


    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.files))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)


    def __data_generation(self, files_temp):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, *self.dim, self.n_channels))
        y = np.empty((self.batch_size), dtype=int)

        # Generate data
        for i, ID in enumerate(files_temp):
            # Store sample
            X[i,] = read_image(ID)

            # Store class
            y[i] = get_label(ID)

        return X, keras.utils.to_categorical(y, num_classes=self.n_classes)


...

params = {'dim': (160, 160),
              'batch_size': 2,
              'n_classes': 2,
              'n_channels': 3,
              'shuffle': True}


gen_train = DataGenerator(files, **params)
 model.fit_generator(gen_train, steps_per_epoch=ceil(num_samples_train)/batch_size, validation_data=None,
        epochs = 1,  verbose=1,
    callbacks = [tensorboard])

Where read_image and get_label are my methods for getting the data. These methods include a print() for the image being loaded and I get more than I expect. For example:

num_samples = 10 batch_size = 2

Steps per epoch will be equal to 5 and that is what the keras progress bar shows, but I get more images (which I know because of the print inside the method).

I tried debugging, and find that the __getitem__ function is called more than 5 times! The first five times will have indexes between 0 and 4 (as expected) but then I will get a repeated index and more data being loaded.

Any idea why this is happening? I've debugged down to the data_utils.py in keras but can't find the exact place where index is being passed to __getitem__. Everything inside getitem seems to be working fine.

Upvotes: 1

Views: 1139

Answers (1)

Dr. Snoopy
Dr. Snoopy

Reputation: 56357

This is normal, for steps_per_epoch = 5, your __getitem__ will be called 5 times for each epoch. So of course, having more than one epoch means it will be called more times then just 5.

Also note that there is parallelism involved, Keras automatically runs your Sequence in another thread/process (depending on configuration) so they might be called out of the expected sequence. This is also normal.

Upvotes: 1

Related Questions