Efficiently load large .npy files (20GB) with Keras/Tensorflow dataloader

Question

I am currently implementing a machine learning model which uses a rather heavy representation of data. My dataset is composed of images. Each of these images is encoded into a (224, 224, 103) matrix, making the entire dataset very heavy. I store these matrixes on the disk and load them during the training.

What I am currently doing right now is using mini-batches of 8 images and loading the .npy files for these 8 images from the disk during the entire training process. This is slow but it works.

Is there a more efficient way to do it using Keras/Tensorflow (which is what I'm using to code my model)? I unfortunately couldn't find much about a dataloader that would allow me to do this.

Thanks in advance.

Mateo Torres · Accepted Answer

You have several options to do this.

I will assume that the transformations you are doing to the images to get the final (224, 224, 103) matrix is very expensive, and that it's not desirable to do the pre-processing on the data loading. If this is not the case, you might benefit from reading the tutorial relevant to image processing.

I suggest you use a python generator to read the data, and to use tf.data to create a data pipeline to feed these .npy files to your model. The basic idea is very simple. you use a wrapper to ingest data from a generator that will read the files as needed. The relevant documentation and examples are here.

Now, once you get that working, I think it would be a good idea for you to optimize your pipeline, especially if you're planning to train in multiple GPUs or multiple computers.

Efficiently load large .npy files (>20GB) with Keras/Tensorflow dataloader

Answers (1)

Related Questions

Efficiently load large .npy files (&gt;20GB) with Keras/Tensorflow dataloader

Answers (1)

Related Questions

Efficiently load large .npy files (>20GB) with Keras/Tensorflow dataloader