Zohair
Zohair

Reputation: 505

Efficient way of importing images in Python

I have a dataset of about 22,000 images (round about 900 Mb for the whole thing) and I wanted to import it into Python to train a CNN.

I use the following code to import it and save it all in an array called X :

import scipy.misc as sm

for i in range (start, end):

    imageLink = "./dataSet/" + str(dataSet[i, 0]) + "/" + str(dataSet[i, 1])
    image = sm.imread(imageLink)
    X = np.append(X, image, axis = 0)

There are a few issues with this,

  1. It's incredibly slow. About 30 minutes imports only about 1000 images into python and it gets slower as the number of images grow.

  2. It takes up a lot of RAM. Importing about 2000 images takes about 16 GB of RAM (My machine has only 16GB, so I end up using swap memory, which makes it even slower I suppose).

The images are all sized 640 × 480.

Am I doing something wrong or is this normal? Is there any better/faster method to import images?

Thank you.

Upvotes: 1

Views: 674

Answers (1)

Pavel
Pavel

Reputation: 7552

Here are some general recommendations for this type of task:

  1. Upgrade to a fast SSD, if you haven't already. No matter what the processing does, fast hardware is crucial.
  2. Don't load the whole data set into memory. Build a batching mechanism that loads e.g. 100 files at a time, processes them, and frees memory for the next batch.
  3. Use a second thread to build the next batch while the first one is being processed.
  4. Introduce a separate pre-processing step that converts JPEG images read through imread to Numpy data structures with all required normalization steps. Store Numpy objects to disk so that your main training process only needs to read them using numpy.fromfile().

Upvotes: 2

Related Questions