Efficient way to make h5py file with memory constraint

Question

Let's say I have image like below:

root
|___dog
|    |___img1.jpg
|    |___img2.jpg
|    |___...
|    
|___cat
|___...

I want to make image files to h5py files.

First, I tried to read all image files and make it to h5 file.

import os
import numpy as np
import h5py
import PIL.Image as Image



datafile = h5py.File(data_path, 'w')


label_list = os.listdir('root')
for i, label in enumerate(label_list):
    files = os.listdir(os.path.join('root', label_list))
    for filename in files:
        img = Image.open(os.path.join('root', label, filename))
        ow, oh = 128, 128
        img = img.resize((ow, oh), Image.BILINEAR)
        data_x.append(np.array(img).tolist())
        data_y.append(i)


datafile = h5py.File(data_path, 'w')
datafile.create_dataset("data_image", dtype='uint8', data=data_x)
datafile.create_dataset("data_label", dtype='int64', data=data_y)

But I can't make it because of the memory constraint (Each folder have image more than 200,000 with 224x224 size).

So, what is the best way to make this image to h5 file?

kcw78 · Accepted Answer

The HDF5/h5py dataset objects have a much smaller memory footprint than the same size NumPy array. (That's one advantage to using HDF5.) You can create the HDF5 file and allocate the datasets BEFORE you start looping on the image files. Then you can operate on the images one at a time (read, resize, and write image 0, then image 1, etc).

The code below creates the necessary datasets presized for 200,000 images. The code logic is rearranged to work as I described. img_cnt variable used to position new image data in existing datasets. (Note: I think this works as written. However without the data, I couldn't test, so it may need minor tweaking.) If you want to adjust the dataset sizes in the future, you can add the maxshape=() parameter to the create_dataset() function.

# Open HDF5 and create datasets in advance
datafile = h5py.File(data_path, 'w')
datafile.create_dataset("data_image", (200000,224,224), dtype='uint8')
datafile.create_dataset("data_label", (200000,), dtype='int64')

label_list = os.listdir('root')
img_cnt = 0
for i, label in enumerate(label_list):
    files = os.listdir(os.path.join('root', label_list))
    for filename in files:
        img = Image.open(os.path.join('root', label, filename))
        ow, oh = 128, 128
        img = img.resize((ow, oh), Image.BILINEAR)
        datafile["data_image"][img_cnt,:,:] = np.array(img).tolist())
        datafile["data_label"][img_cnt] = i
        img_cnt += 1

datafile.close()

Efficient way to make h5py file with memory constraint

Answers (1)

Related Questions