J S
J S

Reputation: 75

Increasing reading speed for h5py

I'm having a minor issue with python's h5py package. I'm working with very large datasets (ca. 250k small image fragments) stored in an hdf5 file as an array with the dimensions (num_images x color_channels x width x height)

This dataset is being randomly separated into training and validation data. Consequently, I need to read out random elements of this data when training my classifier.

I've made the, to me, bizarre discovery that loading the entire dataset (all 250k images) is MUCH faster than reading out a specific subset of this data. Specifically, reading the entire array as:

data = h5py.File("filename.h5", "r")["images"][:]

is faster by about a factor of 5 than if I read out only a random, non-sequential subset of these images (25k images):

indices = [3, 23, 31, 105, 106, 674, ...]
data = h5py.File("filename.h5", "r")["images"][indices, :, :, :]

Is this by design? Is it due to compression of the hdf5 file?

Upvotes: 4

Views: 7625

Answers (1)

hpaulj
hpaulj

Reputation: 231395

http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

A subset of the NumPy fancy-indexing syntax is supported. Use this with caution, as the underlying HDF5 mechanisms may have different performance than you expect.

Very long lists (> 1000 elements) may produce poor performance

Advanced indexing requires reading a block of data here, then skipping some distance and reading another and so on. If that data is all in memory, as in a ndarray data buffer, that can be done relatively fast, though slower than reading the same number of bytes in one contiguous block. When that data is in a file, you have to include file seek and block reads.

Also if you are using chunking and compression:

Chunking has performance implications. It’s recommended to keep the total size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk.

I wonder if saving the images as individual data sets would improve performance. You'd then retrieve them by name rather than 1st dimension index. You'd have to join them into 4d array, but I suspect h5py has to do that anyways (it will have read them individually).

Upvotes: 3

Related Questions