Reputation: 85
I have a huge dataset. My usual approach when I deal with such datset is I split into multiple tiny datasets using numpy archives and use a generator to deal with them. Are there any other alternatives to this? I also wanted to incorporate random run time Image augumentations with Keras Image preprocessing module which also is a generator type function. How do I stream line these two generator processes? The link for the Keras Image augmentation module is below. https://keras.io/preprocessing/image/
My current data flow generator is as follows:
def dat_loader(path, batch_size):
while True:
for dir, subdir, files in os.walk(path):
for file in files:
file_path = path + file
archive = np.load(file_path)
img = archive['images']
truth = archive['truth']
del archive
num_batches = len(truth)//batch_size
img = np.array_split(img, num_batches)
truth = np.array_split(truth, num_batches)
while truth:
batch_img = img.pop()
batch_truth = truth.pop()
yield batch_img, batch_truth
Upvotes: 0
Views: 31
Reputation: 11225
One way for handling really large datasets is to use memory mapped files that dynamically load required data at runtime. NumPy has memmap that creates an array that maps to a large file which can be massive (I once had one for a pre-processed version of offline Wikipedia and it was okay) but doesn't nesserarily live in your RAM. Any changes get flushed back to the file when needed or when the object is garbage collected. Here is an example:
import numpy as np
# Create or load a memory mapped array, can contain your huge dataset
nrows, ncols = 1000000, 100
f = np.memmap('memmapped.dat', dtype=np.float32,
mode='w+', shape=(nrows, ncols))
# Use it like a normal array but it will be slower as it might
# access the disk along the way.
for i in range(ncols):
f[:, i] = np.random.rand(nrows)
from the online tutorial. Note this is just a potential solution, for your dataset and usage there might be better alternatives.
Upvotes: 1