Reputation: 115
I'm trying to create a hdf5 file where each dataset is a 90x18 numpy array. I'm looking to create 2704332 total datasets for the file with an approximate final size of 40 GB.
with h5py.File('allDaysData.h5', 'w') as hf:
for x in list:
start = datetime.datetime.now()
hf.create_dataset(x, data=currentData)
end = datetime.datetime.now()
print(end-start)
When running this the create_dataset command takes no longer then .0004 seconds in the beginning. Once the file hits around 6 GB it abruptly switches to taking 0.08 seconds per dataset.
Is there some sort of limit on datasets for hdf5 files?
Upvotes: 1
Views: 1642
Reputation: 20224
In this answer, you can see the performance of create_dataset
is decreasing with the increasing of iterations. As h5py
stores data in special structure, I think it is because h5py
need more time to index the datasets.
There are two solutions, One is to use key word libver='latest'
. It will improve the performance significantly even though the generated file will be incompatible with old ones. Second one is to aggregate your arrays into several aggregations. For example, aggregate every 1024 arrays into one.
Upvotes: 4