Fast and efficient way of serializing and retrieving a large number of numpy arrays from HDF5 file

Question

I have a huge list of numpy arrays, specifically 113287, where each array is of shape 36 x 2048. In terms of memory, this amounts to 32 Gigabytes.

As of now, I have serialized these arrays as a giant HDF5 file. Now, the problem is that retrieving individual arrays from this hdf5 file takes excruciatingly long time (north of 10 mins) for each access.

How can I speed this up? This is very important for my implementation since I have to index into this list several thousand times for feeding into Deep Neural Networks.

Here's how I index into hdf5 file:

In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')

In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'

In [6]: group_key = list(hf.keys())[0]

In [7]: hf[group_key]
Out[7]: 


# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)

Any ideas where I can speed things up? Is there any other way of serializing these arrays for faster access?

Note: I'm using a Python list since I want the order to be preserved (i.e. to retrieve in the same order as I put it when I created the hdf5 file)

hpaulj · Accepted Answer

According to Out[7], "img_feats" is a large 3d array. (113287, 36, 2048) shape.

Define ds as the dataset (doesn't load anything):

ds = hf[group_key]

x = ds[0]    # should be a (36, 2048) array

arr = ds[:]   # should load the whole dataset into memory.
arr = ds[:n]   # load a subset, slice

According to h5py-reading-writing-data :

HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file.

I don't see any point in wrapping that in list(); that is, in splitting the 3d array in a list of 113287 2d arrays. There's a clean mapping between 3d datasets on the HDF5 file and numpy arrays.

h5py-fancy-indexing warns that fancy indexing of a dataset is slower. That is, seeking to load, say [1, 1000, 3000, 6000] subarrays of that large dataset.

You might want to experiment with writing and reading some smaller datasets if working with this large one is too confusing.

Fast and efficient way of serializing and retrieving a large number of numpy arrays from HDF5 file

Answers (2)

Related Questions