Ryan Halabi
Ryan Halabi

Reputation: 115

h5py create_dataset loop slow

I'm trying to create a hdf5 file where each dataset is a 90x18 numpy array. I'm looking to create 2704332 total datasets for the file with an approximate final size of 40 GB.

with h5py.File('allDaysData.h5', 'w') as hf:
    for x in list:
        start = datetime.datetime.now()
        hf.create_dataset(x, data=currentData)
        end = datetime.datetime.now()
        print(end-start)

When running this the create_dataset command takes no longer then .0004 seconds in the beginning. Once the file hits around 6 GB it abruptly switches to taking 0.08 seconds per dataset.

Is there some sort of limit on datasets for hdf5 files?

Upvotes: 1

Views: 1642

Answers (1)

Sraw
Sraw

Reputation: 20224

There is a related answer.

In this answer, you can see the performance of create_dataset is decreasing with the increasing of iterations. As h5py stores data in special structure, I think it is because h5py need more time to index the datasets.

There are two solutions, One is to use key word libver='latest'. It will improve the performance significantly even though the generated file will be incompatible with old ones. Second one is to aggregate your arrays into several aggregations. For example, aggregate every 1024 arrays into one.

Upvotes: 4

Related Questions