Reputation:
I decided to store my data in HDF5 using its hierarchical structure instead of relying on the filesystem. Unfortunately, I'm having performance issues.
My data is formatted as follows: I have about 70 top level groups, corresponding to dates and each of them contain roughly 8000 datasets. I would like to see a list of the number of datasets per day:
for date in hdf5.keys():
print(len(hdf5[date]))
I'm finding it a little frustrating that this takes 2+ second/iteration.
Also, I have two different hdf5 files with the above layout and the bigger one is much slower at this.
What am I doing wrong?
Upvotes: 2
Views: 4387
Reputation: 571
Try creating the file with the libver latest flag:
f = h5py.File('name.hdf5', libver='latest')
This will be much faster if you have a lot of datasets per group or attributes per dataset.
Upvotes: 1