VGP
VGP

Reputation: 85

h5py: how to read selected rows of an hdf5 file?

Is it possible to read a given set of rows from an hdf5 file without loading the whole file? I have quite big hdf5 files with loads of datasets, here is an example of what I had in mind to reduce time and memory usage:

#! /usr/bin/env python

import numpy as np
import h5py

infile = 'field1.87.hdf5'
f = h5py.File(infile,'r')
group = f['Data']

mdisk = group['mdisk'].value

val = 2.*pow(10.,10.)
ind = np.where(mdisk>val)[0]

m = group['mcold'][ind]
print m

ind doesn't give consecutive rows but rather scattered ones.

The above code fails, but it follows the standard way of slicing an hdf5 dataset. The error message I get is:

Traceback (most recent call last):
  File "./read_rows.py", line 17, in <module>
    m = group['mcold'][ind]
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/dataset.py", line 425, in __getitem__
    selection = sel.select(self.shape, args, dsid=self.id)
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 71, in select
    sel[arg]
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 209, in __getitem__
    raise TypeError("PointSelection __getitem__ only works with bool arrays")
TypeError: PointSelection __getitem__ only works with bool arrays

Upvotes: 4

Views: 10272

Answers (1)

hpaulj
hpaulj

Reputation: 231345

I have a sample h5py file with:

data = f['data']
#  <HDF5 dataset "data": shape (3, 6), type "<i4">
# is arange(18).reshape(3,6)
ind=np.where(data[:]%2)[0]
# array([0, 0, 0, 1, 1, 1, 2, 2, 2], dtype=int32)
data[ind]  # getitem only works with boolean arrays error
data[ind.tolist()] # can't read data (Dataset: Read failed) error

This last error is caused by repeated values in the list.

But indexing with lists with unique values works fine

In [150]: data[[0,2]]
Out[150]: 
array([[ 0,  1,  2,  3,  4,  5],
       [12, 13, 14, 15, 16, 17]])

In [151]: data[:,[0,3,5]]
Out[151]: 
array([[ 0,  3,  5],
       [ 6,  9, 11],
       [12, 15, 17]])

So does an array with the proper dimension slicing:

In [157]: data[ind[[0,3,6]],:]
Out[157]: 
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])
In [165]: f['data'][:2,np.array([0,3,5])]
Out[165]: 
array([[ 0,  3,  5],
       [ 6,  9, 11]])
In [166]: f['data'][[0,1],np.array([0,3,5])]  
# errror about only one indexing array allowed

So if the indexing is right - unique values, and matching the array dimensions, it should work.

My simple example doesn't test how much of the array is loaded. The documentation sounds as though elements are selected from the file without loading the whole array into memory.

Upvotes: 5

Related Questions