Reputation: 21
I am Novice at coding. Can some one help with a script in Python using h5py wherein we can read all the directories and sub-directories to merge multiple h5 files into a single h5 file.
Upvotes: 1
Views: 4983
Reputation: 5965
What you need is a list of all datasets in the file. I think that the notion of a recursive function is what is needed here. This would allow you to extract all 'datasets' from a group, but when one of them appears to be a group itself, recursively do the same thing until all datasets are found. For example:
/
|- dataset1
|- group1
|- dataset2
|- dataset3
|- dataset4
Your function should in pseudo-code look like:
def getdatasets(key, file):
out = []
for name in file[key]:
path = join(key, name)
if file[path] is dataset: out += [path]
else out += getdatasets(path, file)
return out
For our example:
/dataset1
is a dataset: add path to output, giving
out = ['/dataset1']
/group
is not a dataset: call getdatasets('/group',file)
/group/dataset2
is a dataset: add path to output, giving
nested_out = ['/group/dataset2']
/group/dataset3
is a dataset: add path to output, giving
nested_out = ['/group/dataset2', '/group/dataset3']
This is added to what we already had:
out = ['/dataset1', '/group/dataset2', '/group/dataset3']
/dataset4
is a dataset: add path to output, giving
out = ['/dataset1', '/group/dataset2', '/group/dataset3', '/dataset4']
This list can be used to copy all data to another file.
To make a simple clone you could do the following.
import h5py
import numpy as np
# function to return a list of paths to each dataset
def getdatasets(key,archive):
if key[-1] != '/': key += '/'
out = []
for name in archive[key]:
path = key + name
if isinstance(archive[path], h5py.Dataset):
out += [path]
else:
out += getdatasets(path,archive)
return out
# open HDF5-files
data = h5py.File('old.hdf5','r')
new_data = h5py.File('new.hdf5','w')
# read as much datasets as possible from the old HDF5-file
datasets = getdatasets('/',data)
# get the group-names from the lists of datasets
groups = list(set([i[::-1].split('/',1)[1][::-1] for i in datasets]))
groups = [i for i in groups if len(i)>0]
# sort groups based on depth
idx = np.argsort(np.array([len(i.split('/')) for i in groups]))
groups = [groups[i] for i in idx]
# create all groups that contain dataset that will be copied
for group in groups:
new_data.create_group(group)
# copy datasets
for path in datasets:
# - get group name
group = path[::-1].split('/',1)[1][::-1]
# - minimum group name
if len(group) == 0: group = '/'
# - copy data
data.copy(path, new_data[group])
Further customizations are, of course, possible depending on what you want. You describe some combination of files. In that case you would have to
new_data = h5py.File('new.hdf5','a')
and probably add something to the path.
Upvotes: 1