Data generator class reads batches out of sequence and is executed more number of time than the number of batches

Question

I have a large, non-image, CSV file that I want to read in batches and feed to model.fit_generator. I have written a DataGenerator(keras.utils.all_utils.Sequence) class with the following methods:

_init_,
_len_,
_getitem_,
on_epoch_end, and
__data_generation

I instantiate a training and validation class and then call model.fit_generator with the correct batch size and other parameters. I noticed that _len_, _getitem_, and _data_generation gets called more than the required amount and out of sequence throwing off my modeling accuracy. For example, if my total rows in CSV is 350 and I set my batch size to 50, _getitem_ and _data_generation should be called 7 times each for training and validation data sets. Instead, I see that the order of reading the batches is not sequential and _getitem_ and __data_generation are called more than 7 times. I have set shuffle=False.

I am using the following:

NVIDIA GeForce GTX 1660 Ti,
keras 2.6.0,
keras.utils 1.0.13,
tensorflow 2.6.0,
linux 5.11.0-38,
python 3.9.5

Here's my sample code:

   class DataGenerator(keras.utils.all_utils.Sequence):

   def __init__(self, file_name, rows_per_batch=50, shuffle=False):
       self.rows_per_batch = rows_per_batch
       self.shuffle = shuffle
       self.file_name = file_name
       self.on_epoch_end()
       reader = csv.reader(open(self.file_name, 'r'))
       self.lines = list(reader)

   def __len__(self):
       return len(self.lines) // self.rows_per_batch

   def __getitem__(self, index):
       skip_rows = index * self.rows_per_batch
       nrows = (index + 1) * self.rows_per_batch
       X, y = self.__data_generation(self.folder_name, self.file_name, skip_rows, nrows)
    
       return X, y

   def on_epoch_end(self):
       if self.shuffle == True:
           np.random.shuffle(self.indexes)

   def __data_generation(self, file_name, skip_rows, nrows, config_file):
   
       df = pd.read_csv(file_name, header=None, skiprows=skip_rows, nrows=nrows)

       < do some data processing here >
    
       return X, y

# main program
   file_name = 
   rows_per_batch = 

   a = DataGenerator(file_name, rows_per_batch, shuffle=False)
   b = DataGenerator(file_name, rows_per_batch, shuffle=False)

# Train model on dataset
   model.fit_generator(generator=a,
                    validation_data=b,
                    use_multiprocessing=False,
                    shuffle=False,
                    epochs=10,
                    workers=6)

Thanks, in advance, for any help or suggestions.

Data generator class reads batches out of sequence and is executed more number of time than the number of batches

Answers (1)

Related Questions