maynull
maynull

Reputation: 2046

How to copy a dataset object to a different hdf5 file using pytables or h5py?

I have selected specific hdf5 datasets and want to copy them to a new hdf5 file. I could find some tutorials on copying between two files, but what if you have just created a new file and you want to copy datasets to the file? I thought the way below would work, but it doesn't. Are there any simple ways to do this?

>>> dic_oldDataset['old_dataset']
<HDF5 dataset "old_dataset": shape (333217,), type "|V14">

>>> new_file = h5py.File('new_file.h5', 'a')
>>> new_file.create_group('new_group')

>>> new_file['new_group']['new_dataset'] = dic_oldDataset['old_dataset']


RuntimeError: Unable to create link (interfile hard links are not allowed)

Upvotes: 5

Views: 17256

Answers (3)

Harry de winton
Harry de winton

Reputation: 1069

Answer 3

Use the copy method of the group class from h5py.

TL;DR

This works on groups and datasets. Is recursive (can do deep and shallow copies). Has options for attributes, symbolic links and references.

with h5py.File('destFile.h5','w') as f_dest:
    with h5py.File('srcFile.h5','r') as f_src:
            f_src.copy(f_src["/path/to/DataSet"],f_dest["/another/path"],"DataSet")

(The file object is also the root group.)

Locations in HDF5

"An HDF5 file is organized as a rooted, directed graph" (source). HDF5 groups (including the root group) and data sets are related to each other as "locations" (in the C API most functions take a loc_id which identifes a group or data set). These locations are the nodes on the graph, paths describe arcs through the graph to a node. copy takes a source and destination location, not specifically a group or dataset, so it can be applied to both. The source and destination do not need to be in the same file.

HDF5 file structure example

Attributes

Attributes are stored within the header of the group or data set they are associated with. Therefore the attributes are also associated with that "location". It follows that copying a group or dataset will include all attributes associated with that "location". However you can turn this off.

References

copy offers settings for references, also called object pointers. Object pointers are a data type in hdf5: H5T_STD_REG_OBJ, similar to an integer H5T_STD_I32BE (source) and can be stored in attributes or data sets. References can point to whole objects or regions within a data set. copy only seems to cover object references. Does it break with data set regions H5T_STD_REF_DSETREG?

Object pointer

Symbolic links

The "locations" taken by the C api are one level of abstraction which explains why the copy function works on individual datasets. Look at the figure again, it is the edges which are labelled, not the nodes. Under the hood, HDF5 objects are the targets of links, each link (edge) has a name, the objects (nodes) do not have names. There are two types of links: hard links and symbolic links. All HDF5 objects must have at least one hard link, hard links can only target objects within their file. When hard links are created the reference count increases by one, symbolic links do not effect the reference count. Symbolic links may point to objects within the file (soft) or objects in other files (external). copy offers options to expand soft and external symbolic links.

This explains the error code (below) and offers an alternative to copying your dataset; A soft link could allow access to a data set in another file.

RuntimeError: Unable to create link (interfile hard links are not allowed)

Upvotes: 14

kcw78
kcw78

Reputation: 8006

Answer 1 (using h5py):
This creates a simple structured array to populate the first dataset in the first file. The data is then read from that dataset and copied to the second file using my_array.

import h5py, numpy as np

arr = np.array([(1,'a'), (2,'b')], 
      dtype=[('foo', int), ('bar', 'S1')]) 
print (arr.dtype)

h5file1 = h5py.File('test1.h5', 'w')
h5file1.create_dataset('/ex_group1/ex_ds1', data=arr)                
print (h5file1)

my_array=h5file1['/ex_group1/ex_ds1']

h5file2 = h5py.File('test2.h5', 'w')
h5file2.create_dataset('/exgroup2/ex_ds2', data=my_array)
print (h5file2)

h5file1.close()
h5file2.close()

Upvotes: 4

kcw78
kcw78

Reputation: 8006

Answer 2 (using pytables):
This follows the same process as above with pytables functions. It creates the same simple structured array to populate the first dataset in the first file. The data is then read from that dataset and copied to the second file using my_array.

import tables, numpy as np

arr = np.array([(1,'a'), (2,'b')], 
      dtype=[('foo', int), ('bar', 'S1')]) 
print (arr.dtype)
h5file1 = tables.open_file('test1.h5', mode = 'w', title = 'Test file')
my_group = h5file1.create_group('/', 'ex_group1', 'Example Group')
my_table = h5file1.create_table(my_group, 'ex_ds1', None, 'Example dataset', obj=arr)                
print (h5file1)

my_array=my_table.read()

h5file2 = tables.open_file('test2.h5', mode = 'w', title = 'Test file')
h5file2.create_table('/exgroup2', 'ex_ds2', createparents=True, obj=my_array)
print (h5file2)

h5file1.close()
h5file2.close()

Upvotes: 1

Related Questions