Maystro
Maystro

Reputation: 2955

Keras difference between generator and sequence

I'm using a deep CNN+LSTM network to perfom a classification on a dataset of 1D signals. I'm using keras 2.2.4 backed by tensorflow 1.12.0. Since I have a large dataset and limited resources, I'm using a generator to load the data into the memory during the training phase. First, I tried this generator:

def data_generator(batch_size, preproc, type, x, y):
    num_examples = len(x)
    examples = zip(x, y)
    examples = sorted(examples, key = lambda x: x[0].shape[0])
    end = num_examples - batch_size + 1
    batches = [examples[i:i + batch_size] for i in range(0, end, batch_size)]

    random.shuffle(batches)
    while True:
        for batch in batches:
            x, y = zip(*batch)
            yield preproc.process(x, y)

Using the above method, I'm able to launch training with a mini-batch size up to 30 samples at a time. However, this kind of method does not guarantee that the network will only train once on each sample per epoch. Considering this comment from Keras's website:

Sequence is a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.

I've tried another way of loading data using the following class:

class Data_Gen(Sequence):

def __init__(self, batch_size, preproc, type, x_set, y_set):
    self.x, self.y = np.array(x_set), np.array(y_set)
    self.batch_size = batch_size
    self.indices = np.arange(self.x.shape[0])
    np.random.shuffle(self.indices)
    self.type = type
    self.preproc = preproc

def __len__(self):
    # print(self.type + ' - len : ' + str(int(np.ceil(self.x.shape[0] / self.batch_size))))
    return int(np.ceil(self.x.shape[0] / self.batch_size))

def __getitem__(self, idx):
    inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
    batch_x = self.x[inds]
    batch_y = self.y[inds]
    return self.preproc.process(batch_x, batch_y)

def on_epoch_end(self):
    np.random.shuffle(self.indices)

I can confirm that using this method the network is training once on each sample per epoch but this time when I put more than 7 samples in the mini-batch, I got out of memory error:

OP_REQUIRES failed at random_op.cc: 202: Resource exhausted: OOM when allocating tensor with shape...............

I can confirm that I'm using the same model architecture, configuration, and machine to do this test. I'm wondering why would be a difference between these 2 ways of loading data??

Please don't hesitate to ask for more details in case needed.

Thanks in advance.

EDITED:

Here is the code I'm using to fit the model:

reduce_lr = keras.callbacks.ReduceLROnPlateau(
            factor=0.1,
            patience=2,
            min_lr=params["learning_rate"])

        checkpointer = keras.callbacks.ModelCheckpoint(
            filepath=str(get_filename_for_saving(save_dir)),
            save_best_only=False)

        batch_size = params.get("batch_size", 32)

        path = './logs/run-{0}'.format(datetime.now().strftime("%b %d %Y %H:%M:%S"))
        tensorboard = keras.callbacks.TensorBoard(log_dir=path, histogram_freq=0,
                                                  write_graph=True, write_images=False)
        if index == 0:
            print(model.summary())
            print("Model memory needed for batchsize {0} : {1} Gb".format(batch_size, get_model_memory_usage(batch_size, model)))

        if params.get("generator", False):
            train_gen = load.data_generator(batch_size, preproc, 'Train', *train)
            dev_gen = load.data_generator(batch_size, preproc, 'Dev', *dev)
            valid_metrics = Metrics(dev_gen, len(dev[0]) // batch_size, batch_size)
            model.fit_generator(
                train_gen,
                steps_per_epoch=len(train[0]) / batch_size + 1 if len(train[0]) % batch_size != 0 else len(train[0]) // batch_size,
                epochs=MAX_EPOCHS,
                validation_data=dev_gen,
                validation_steps=len(dev[0]) / batch_size + 1  if len(dev[0]) % batch_size != 0 else len(dev[0]) // batch_size,
                callbacks=[valid_metrics, MyCallback(), checkpointer, reduce_lr, tensorboard])

            # train_gen = load.Data_Gen(batch_size, preproc, 'Train', *train)
            # dev_gen = load.Data_Gen(batch_size, preproc, 'Dev', *dev)
            # model.fit_generator(
        #     train_gen,
        #     epochs=MAX_EPOCHS,
        #     validation_data=dev_gen,
        #     callbacks=[valid_metrics, MyCallback(), checkpointer, reduce_lr, tensorboard])

Upvotes: 37

Views: 4616

Answers (1)

Gaslight Deceive Subvert
Gaslight Deceive Subvert

Reputation: 20374

Those methods are roughly the same. It is correct to subclass Sequence when your dataset doesn't fit in memory. But you shouldn't run any preprocessing in any of the class' methods because that will be reexecuted once per epoch wasting lots of computing resources.

It is probably also easier to shuffle the samples rather than their indices. Like this:

from random import shuffle

class DataGen(Sequence):
    def __init__(self, batch_size, preproc, type, x_set, y_set):
        self.samples = list(zip(x, y))
        self.batch_size = batch_size
        shuffle(self.samples)
        self.type = type
        self.preproc = preproc

    def __len__(self):
        return int(np.ceil(len(self.samples) / self.batch_size))

    def __getitem__(self, i):
        batch = self.samples[i * self.batch_size:(i + 1) * self.batch_size]
        return self.preproc.process(*zip(batch))

    def on_epoch_end(self):
        shuffle(self.samples)

I think it is impossible to say why you run out of memory without knowing more about your data. My guess would be that your preproc function is doing something wrong. You can debug it by running:

for e in DataGen(batch_size, preproc, *train):
    print(e)
for e in DataGen(batch_size, preproc, *dev):
    print(e)

You will most likely run out of memory.

Upvotes: 1

Related Questions