jogan
jogan

Reputation: 85

How to work with two different generators while feeding input to a neural network in keras?

I have a huge dataset. My usual approach when I deal with such datset is I split into multiple tiny datasets using numpy archives and use a generator to deal with them. Are there any other alternatives to this? I also wanted to incorporate random run time Image augumentations with Keras Image preprocessing module which also is a generator type function. How do I stream line these two generator processes? The link for the Keras Image augmentation module is below. https://keras.io/preprocessing/image/

My current data flow generator is as follows:

def dat_loader(path, batch_size):
    while True:
        for dir, subdir, files in os.walk(path):
            for file in files:
                file_path = path + file
                archive = np.load(file_path)
                img = archive['images']
                truth = archive['truth']
                del archive
                num_batches = len(truth)//batch_size
                img = np.array_split(img, num_batches)
                truth = np.array_split(truth, num_batches)
                while truth:
                    batch_img = img.pop()
                    batch_truth = truth.pop()
                    yield batch_img, batch_truth

Upvotes: 0

Views: 31

Answers (1)

nuric
nuric

Reputation: 11225

One way for handling really large datasets is to use memory mapped files that dynamically load required data at runtime. NumPy has memmap that creates an array that maps to a large file which can be massive (I once had one for a pre-processed version of offline Wikipedia and it was okay) but doesn't nesserarily live in your RAM. Any changes get flushed back to the file when needed or when the object is garbage collected. Here is an example:

import numpy as np
# Create or load a memory mapped array, can contain your huge dataset
nrows, ncols = 1000000, 100
f = np.memmap('memmapped.dat', dtype=np.float32,
              mode='w+', shape=(nrows, ncols))

# Use it like a normal array but it will be slower as it might
# access the disk along the way.
for i in range(ncols):
    f[:, i] = np.random.rand(nrows)

from the online tutorial. Note this is just a potential solution, for your dataset and usage there might be better alternatives.

Upvotes: 1

Related Questions