Reputation: 7807
I have a number of hdf5 files, each of which have a single dataset. The datasets are too large to hold in RAM. I would like to combine these files into a single file containing all datasets separately (i.e. not to concatenate the datasets into a single dataset).
One way to do this is to create a hdf5 file and then copy the datasets one by one. This will be slow and complicated because it will need to be buffered copy.
Is there a more simple way to do this? Seems like there should be, since it is essentially just creating a container file.
I am using python/h5py.
Upvotes: 31
Views: 44194
Reputation: 21
To use Python (and not IPython) and h5copy to merge HDF5 files, we can build on GM's answer:
import h5py
import os
d_names = os.listdir(os.getcwd())
d_struct = {} #Here we will store the database structure
for i in d_names:
f = h5py.File(i,'r+')
d_struct[i] = f.keys()
f.close()
for i in d_names:
for j in d_struct[i]:
os.system('h5copy -i %s -o output.h5 -s %s -d %s' % (i, j, j))
Upvotes: 2
Reputation: 5471
This is actually one of the use-cases of HDF5. If you just want to be able to access all the datasets from a single file, and don't care how they're actually stored on disk, you can use external links. From the HDF5 website:
External links allow a group to include objects in another HDF5 file and enable the library to access those objects as if they are in the current file. In this manner, a group may appear to directly contain datasets, named datatypes, and even groups that are actually in a different file. This feature is implemented via a suite of functions that create and manage the links, define and retrieve paths to external objects, and interpret link names:
myfile = h5py.File('foo.hdf5','a')
myfile['ext link'] = h5py.ExternalLink("otherfile.hdf5", "/path/to/resource")
Be careful: when opening myfile
, you should open it with 'a'
if it is an existing file. If you open it with 'w'
, it will erase its contents.
This would be very much faster than copying all the datasets into a new file. I don't know how fast access to otherfile.hdf5
would be, but operating on all the datasets would be transparent - that is, h5py would see all the datasets as residing in foo.hdf5
.
Upvotes: 36
Reputation: 1832
One solution is to use the h5py
interface to the low-level H5Ocopy
function of the HDF5 API, in particular the h5py.h5o.copy
function:
In [1]: import h5py as h5
In [2]: hf1 = h5.File("f1.h5")
In [3]: hf2 = h5.File("f2.h5")
In [4]: hf1.create_dataset("val", data=35)
Out[4]: <HDF5 dataset "val": shape (), type "<i8">
In [5]: hf1.create_group("g1")
Out[5]: <HDF5 group "/g1" (0 members)>
In [6]: hf1.get("g1").create_dataset("val2", data="Thing")
Out[6]: <HDF5 dataset "val2": shape (), type "|O8">
In [7]: hf1.flush()
In [8]: h5.h5o.copy(hf1.id, "g1", hf2.id, "newg1")
In [9]: h5.h5o.copy(hf1.id, "val", hf2.id, "newval")
In [10]: hf2.values()
Out[10]: [<HDF5 group "/newg1" (1 members)>, <HDF5 dataset "newval": shape (), type "<i8">]
In [11]: hf2.get("newval").value
Out[11]: 35
In [12]: hf2.get("newg1").values()
Out[12]: [<HDF5 dataset "val2": shape (), type "|O8">]
In [13]: hf2.get("newg1").get("val2").value
Out[13]: 'Thing'
The above was generated with h5py
version 2.0.1-2+b1
and iPython version 0.13.1-2+deb7u1
atop Python version 2.7.3-4+deb7u1
from a more-or-less vanilla install of Debian Wheezy. The files f1.h5
and f2.h5
did not exist prior to executing the above. Note that, per salotz, for Python 3 the dataset/group names need to be bytes
(e.g., b"val"
), not str
.
The hf1.flush()
in command [7]
is crucial, as the low-level interface apparently will always draw from the version of the .h5
file stored on disk, not that cached in memory. Copying datasets to/from groups not at the root of a File
can be achieved by supplying the ID of that group using, e.g., hf1.get("g1").id
.
Note that h5py.h5o.copy
will fail with an exception (no clobber) if an object of the indicated name already exists in the destination location.
Upvotes: 17
Reputation: 4609
To update on this, with HDF5 version 1.10 comes a new feature that might be useful in this context called "Virtual Datasets".
Here you find a brief tutorial and some explanations:
Virtual Datasets.
Here more complete and detailed explanations and documentation for the feature:
Virtual Datasets extra doc.
And here the merged pull request in h5py to include the virtual datatsets API into h5py:
h5py Virtual Datasets PR but I don't know if it's already available in the current h5py version or will come later.
Upvotes: 2
Reputation: 22449
I usually use ipython and h5copy tool togheter, this is much faster compared to a pure python solution. Once installed h5copy.
#PLESE NOTE THIS IS IPYTHON CONSOLE CODE NOT PURE PYTHON
import h5py
#for every dataset Dn.h5 you want to merge to Output.h5
f = h5py.File('D1.h5','r+') #file to be merged
h5_keys = f.keys() #get the keys (You can remove the keys you don't use)
f.close() #close the file
for i in h5_keys:
!h5copy -i 'D1.h5' -o 'Output.h5' -s {i} -d {i}
To completely automatize the process supposing you are working in the folder were the files to be merged are stored:
import os
d_names = os.listdir(os.getcwd())
d_struct = {} #Here we will store the database structure
for i in d_names:
f = h5py.File(i,'r+')
d_struct[i] = f.keys()
f.close()
# A) empty all the groups in the new .h5 file
for i in d_names:
for j in d_struct[i]:
!h5copy -i '{i}' -o 'output.h5' -s {j} -d {j}
If you want to keep the previous dataset separate inside the output.h5, you have to create the group first using the flag -p
:
# B) Create a new group in the output.h5 file for every input.h5 file
for i in d_names:
dataset = d_struct[i][0]
newgroup = '%s/%s' %(i[:-3],dataset)
!h5copy -i '{i}' -o 'output.h5' -s {dataset} -d {newgroup} -p
for j in d_struct[i][1:]:
newgroup = '%s/%s' %(i[:-3],j)
!h5copy -i '{i}' -o 'output.h5' -s {j} -d {newgroup}
Upvotes: 2