Reputation: 2046
I have selected specific hdf5 datasets and want to copy them to a new hdf5 file. I could find some tutorials on copying between two files, but what if you have just created a new file and you want to copy datasets to the file? I thought the way below would work, but it doesn't. Are there any simple ways to do this?
>>> dic_oldDataset['old_dataset']
<HDF5 dataset "old_dataset": shape (333217,), type "|V14">
>>> new_file = h5py.File('new_file.h5', 'a')
>>> new_file.create_group('new_group')
>>> new_file['new_group']['new_dataset'] = dic_oldDataset['old_dataset']
RuntimeError: Unable to create link (interfile hard links are not allowed)
Upvotes: 5
Views: 17256
Reputation: 1069
Use the copy
method of the group
class from h5py
.
This works on groups and datasets. Is recursive (can do deep and shallow copies). Has options for attributes, symbolic links and references.
with h5py.File('destFile.h5','w') as f_dest:
with h5py.File('srcFile.h5','r') as f_src:
f_src.copy(f_src["/path/to/DataSet"],f_dest["/another/path"],"DataSet")
(The file object is also the root group.)
"An HDF5 file is organized as a rooted, directed graph" (source).
HDF5 groups (including the root group) and data sets are related to each other as "locations" (in the C API most functions take a loc_id
which identifes a group or data set). These locations are the nodes on the graph, paths describe arcs through the graph to a node. copy
takes a source and destination location, not specifically a group or dataset, so it can be applied to both. The source and destination do not need to be in the same file.
Attributes are stored within the header of the group or data set they are associated with. Therefore the attributes are also associated with that "location". It follows that copying a group or dataset will include all attributes associated with that "location". However you can turn this off.
copy
offers settings for references, also called object pointers. Object pointers are a data type in hdf5: H5T_STD_REG_OBJ
, similar to an integer H5T_STD_I32BE
(source) and can be stored in attributes or data sets. References can point to whole objects or regions within a data set. copy
only seems to cover object references. Does it break with data set regions H5T_STD_REF_DSETREG
?
The "locations" taken by the C api are one level of abstraction which explains why the copy
function works on individual datasets. Look at the figure again, it is the edges which are labelled, not the nodes. Under the hood, HDF5 objects are the targets of links, each link (edge) has a name, the objects (nodes) do not have names. There are two types of links: hard links and symbolic links. All HDF5 objects must have at least one hard link, hard links can only target objects within their file. When hard links are created the reference count increases by one, symbolic links do not effect the reference count. Symbolic links may point to objects within the file (soft) or objects in other files (external). copy
offers options to expand soft and external symbolic links.
This explains the error code (below) and offers an alternative to copying your dataset; A soft link could allow access to a data set in another file.
RuntimeError: Unable to create link (interfile hard links are not allowed)
Upvotes: 14
Reputation: 8006
Answer 1 (using h5py):
This creates a simple structured array to populate the first dataset in the first file.
The data is then read from that dataset and copied to the second file using my_array
.
import h5py, numpy as np
arr = np.array([(1,'a'), (2,'b')],
dtype=[('foo', int), ('bar', 'S1')])
print (arr.dtype)
h5file1 = h5py.File('test1.h5', 'w')
h5file1.create_dataset('/ex_group1/ex_ds1', data=arr)
print (h5file1)
my_array=h5file1['/ex_group1/ex_ds1']
h5file2 = h5py.File('test2.h5', 'w')
h5file2.create_dataset('/exgroup2/ex_ds2', data=my_array)
print (h5file2)
h5file1.close()
h5file2.close()
Upvotes: 4
Reputation: 8006
Answer 2 (using pytables):
This follows the same process as above with pytables functions. It creates the same simple structured array to populate the first dataset in the first file. The data is then read from that dataset and copied to the second file using my_array
.
import tables, numpy as np
arr = np.array([(1,'a'), (2,'b')],
dtype=[('foo', int), ('bar', 'S1')])
print (arr.dtype)
h5file1 = tables.open_file('test1.h5', mode = 'w', title = 'Test file')
my_group = h5file1.create_group('/', 'ex_group1', 'Example Group')
my_table = h5file1.create_table(my_group, 'ex_ds1', None, 'Example dataset', obj=arr)
print (h5file1)
my_array=my_table.read()
h5file2 = tables.open_file('test2.h5', mode = 'w', title = 'Test file')
h5file2.create_table('/exgroup2', 'ex_ds2', createparents=True, obj=my_array)
print (h5file2)
h5file1.close()
h5file2.close()
Upvotes: 1