Reputation: 5227
I'm a bit confused here:
As far as I have understood, h5py's .value
method reads an entire dataset and dumps it into an array, which is slow and discouraged (and should be generally replaced by [()]
. The correct way is to use numpy-esque slicing.
However, I'm getting irritating results (with h5py 2.2.1):
import h5py
import numpy as np
>>> file = h5py.File("test.hdf5",'w')
# Just fill a test file with a numpy array test dataset
>>> file["test"] = np.arange(0,300000)
# This is TERRIBLY slow?!
>>> file["test"][range(0,300000)]
array([ 0, 1, 2, ..., 299997, 299998, 299999])
# This is fast
>>> file["test"].value[range(0,300000)]
array([ 0, 1, 2, ..., 299997, 299998, 299999])
# This is also fast
>>> file["test"].value[np.arange(0,300000)]
array([ 0, 1, 2, ..., 299997, 299998, 299999])
# This crashes
>>> file["test"][np.arange(0,300000)]
I guess that my dataset is so small that .value
doesn't hinder performance significantly, but how can the first option be that slow?
What is the preferred version here?
Thanks!
UPDATE
It seems that I wasn't clear enough, sorry. I do know that .value
copies the whole dataset into memory while slicing only retrieves the appropiate subpart. What I'm wondering is why slicing in file is slower than copying the whole array and then slicing in memory.
I always thought hdf5/h5py was implemented specifically so that slicing subparts would always be the fastest.
Upvotes: 18
Views: 17794
Reputation: 741
For fast slicing with h5py, stick to the "plain-vanilla" slice notation:
file['test'][0:300000]
or, for example, reading every other element:
file['test'][0:300000:2]
Simple slicing (slice objects and single integer indices) should be very fast, as it translates directly into HDF5 hyperslab selections.
The expression file['test'][range(300000)]
invokes h5py's version of "fancy indexing", namely, indexing via an explicit list of indices. There's no native way to do this in HDF5, so h5py implements a (slower) method in Python, which unfortunately has abysmal performance when the lists are > 1000 elements. Likewise for file['test'][np.arange(300000)]
, which is interpreted in the same way.
See also:
[1] http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing
[2] https://github.com/h5py/h5py/issues/293
Upvotes: 31
Reputation: 488
The .value
method is copying the data to memory as a numpy array. Try comparing type(file["test"])
with type(file["test"].value)
: the former should be an HDF5 dataset, the latter a numpy array.
I'm not familiar enough with the h5py or HDF5 internals to tell you exactly why certain dataset operations are slow; but the reason those two are different is that in one case you're slicing a numpy array in memory, and in the other slicing an HDF5 dataset from disk.
Upvotes: 4
Reputation: 380
Based on the title of your post, the 'correct' way to slice array datasets is to use the builtin slice notation
All of your answers would be equivalent to file["test"][:]
[:] selects all elements in the array
More information about slicing notation can be found here, http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
I use hdf5 + python often, I've never had to use the .value methods. When you access a dataset in an array like such as myarr = file["test"]
python copies the dataset in the hdf5 into an array for you already.
Upvotes: 3