kmario23
kmario23

Reputation: 61365

Fast and efficient way of serializing and retrieving a large number of numpy arrays from HDF5 file

I have a huge list of numpy arrays, specifically 113287, where each array is of shape 36 x 2048. In terms of memory, this amounts to 32 Gigabytes.

As of now, I have serialized these arrays as a giant HDF5 file. Now, the problem is that retrieving individual arrays from this hdf5 file takes excruciatingly long time (north of 10 mins) for each access.

How can I speed this up? This is very important for my implementation since I have to index into this list several thousand times for feeding into Deep Neural Networks.

Here's how I index into hdf5 file:

In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')

In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'

In [6]: group_key = list(hf.keys())[0]

In [7]: hf[group_key]
Out[7]: <HDF5 dataset "img_feats": shape (113287, 36, 2048), type "<f4">


# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)

Any ideas where I can speed things up? Is there any other way of serializing these arrays for faster access?

Note: I'm using a Python list since I want the order to be preserved (i.e. to retrieve in the same order as I put it when I created the hdf5 file)

Upvotes: 2

Views: 2115

Answers (2)

hpaulj
hpaulj

Reputation: 231385

According to Out[7], "img_feats" is a large 3d array. (113287, 36, 2048) shape.

Define ds as the dataset (doesn't load anything):

ds = hf[group_key]

x = ds[0]    # should be a (36, 2048) array

arr = ds[:]   # should load the whole dataset into memory.
arr = ds[:n]   # load a subset, slice 

According to h5py-reading-writing-data :

HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file.

I don't see any point in wrapping that in list(); that is, in splitting the 3d array in a list of 113287 2d arrays. There's a clean mapping between 3d datasets on the HDF5 file and numpy arrays.

h5py-fancy-indexing warns that fancy indexing of a dataset is slower. That is, seeking to load, say [1, 1000, 3000, 6000] subarrays of that large dataset.

You might want to experiment with writing and reading some smaller datasets if working with this large one is too confusing.

Upvotes: 2

cvanelteren
cvanelteren

Reputation: 1703

One way would be to put each sample into its own group and index directly into those. I am thinking the conversion takes long because it tries to load the entire data set into a list (which it has to read from disk). Re-organizing the h5 file such that

  • group
    • sample
      • 36 x 2048 may help in indexing speed.

Upvotes: 1

Related Questions