Reputation: 57
I have created a .h5 from numpy array
h5f = h5py.File('/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5', 'w')
h5f.create_dataset('JZ3WPpxpypz', data=all, compression="gzip")
HDF5 dataset "JZ3WPpxpypz": shape (19494500, 376), type "f8"
But I am getting a memory error while reading the .h5 file to a numpy array
filename = '/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5'
h5 = h5py.File(filename,'r')
h5.keys()
[u'JZ3WPpxpypz']
data = h5['JZ3WPpxpypz']
If I try to see the array it gives me memory error
data[:]
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-33-629f56f97409> in <module>()
----> 1 data[:]
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
/home/debo/env_autoencoder/local/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in __getitem__(self, args)
560 single_element = selection.mshape == ()
561 mshape = (1,) if single_element else selection.mshape
--> 562 arr = numpy.ndarray(mshape, new_dtype, order='C')
563
564 # HDF5 has a bug where if the memory shape has a different rank
MemoryError:
Is there any memory efficient way to read .h5 file into numpy array?
Thanks, Debo.
Upvotes: 1
Views: 1582
Reputation: 7996
You don't need to call numpy.ndarray()
to get an array.
Try this:
arr = h5['JZ3WPpxpypz'][()]
# or
arr = data[()]
Adding [()]
returns the entire array (different from your data
variable -- it simply references the HDF5 dataset). Either method should give you an array of the same dtype and shape as the original array. You can also use numpy slicing operations to get subsets of the array.
A clarification is in order. I overlooked that numpy.ndarray()
was called as part of the process to print data[()]
.
Here are type checks to show the difference in the returns from the 2 calls:
# check type for each variable:
data = h5['JZ3WPpxpypz']
print (type(data))
# versus
arr = data[()]
print (type(arr))
Output will look like this:
<class 'h5py._hl.dataset.Dataset'>
<class 'numpy.ndarray'>
In general, h5py dataset behavior is similar to numpy arrays (by design). However, they are not the same. When you tried to print the dataset contents with this call (data[()]
), h5py tried to convert the dataset to a numpy array in the background with numpy.ndarray()
. It would have worked if you had a smaller dataset or sufficient memory.
My takeaway: calling arr = h5['JZ3WPpxpypz'][()]
creates the numpy array with a process that does not call numpy.ndarray()
.
When you have very large datasets, you may run into situations where you can't create an array with arr= h5f['dataset'][()]
because the dataset is too large to fit into memory as a numpy array. When this occurs, you can create the h5py dataset object, then access subsets of the data with slicing notation, like this trivial example:
data = h5['JZ3WPpxpypz']
arr1 = data[0:100000]
arr2 = data[100000:200000])
# etc
Upvotes: 2