Reputation: 85
Is it possible to read a given set of rows from an hdf5 file without loading the whole file? I have quite big hdf5 files with loads of datasets, here is an example of what I had in mind to reduce time and memory usage:
#! /usr/bin/env python
import numpy as np
import h5py
infile = 'field1.87.hdf5'
f = h5py.File(infile,'r')
group = f['Data']
mdisk = group['mdisk'].value
val = 2.*pow(10.,10.)
ind = np.where(mdisk>val)[0]
m = group['mcold'][ind]
print m
ind
doesn't give consecutive rows but rather scattered ones.
The above code fails, but it follows the standard way of slicing an hdf5 dataset. The error message I get is:
Traceback (most recent call last):
File "./read_rows.py", line 17, in <module>
m = group['mcold'][ind]
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/dataset.py", line 425, in __getitem__
selection = sel.select(self.shape, args, dsid=self.id)
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 71, in select
sel[arg]
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 209, in __getitem__
raise TypeError("PointSelection __getitem__ only works with bool arrays")
TypeError: PointSelection __getitem__ only works with bool arrays
Upvotes: 4
Views: 10272
Reputation: 231345
I have a sample h5py file with:
data = f['data']
# <HDF5 dataset "data": shape (3, 6), type "<i4">
# is arange(18).reshape(3,6)
ind=np.where(data[:]%2)[0]
# array([0, 0, 0, 1, 1, 1, 2, 2, 2], dtype=int32)
data[ind] # getitem only works with boolean arrays error
data[ind.tolist()] # can't read data (Dataset: Read failed) error
This last error is caused by repeated values in the list.
But indexing with lists with unique values works fine
In [150]: data[[0,2]]
Out[150]:
array([[ 0, 1, 2, 3, 4, 5],
[12, 13, 14, 15, 16, 17]])
In [151]: data[:,[0,3,5]]
Out[151]:
array([[ 0, 3, 5],
[ 6, 9, 11],
[12, 15, 17]])
So does an array with the proper dimension slicing:
In [157]: data[ind[[0,3,6]],:]
Out[157]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17]])
In [165]: f['data'][:2,np.array([0,3,5])]
Out[165]:
array([[ 0, 3, 5],
[ 6, 9, 11]])
In [166]: f['data'][[0,1],np.array([0,3,5])]
# errror about only one indexing array allowed
So if the indexing is right - unique values, and matching the array dimensions, it should work.
My simple example doesn't test how much of the array is loaded. The documentation sounds as though elements are selected from the file without loading the whole array into memory.
Upvotes: 5