Reputation: 4183
I've created a dataset with 1000 groups, each with 1300 uint8
arrays of varying lengths (though each one has a fixed size). Keys are strings of ~10 characters. I'm not trying to do anything tricky while saving (no chunking, compression etc - the data is already compressed).
Iterating over all keys is extremely slow the first time I run a script, though speeds up significantly the second time (same script, different process called later), so I suspect there is some caching involved somehow. After a while performance resets to the terrible level until I've waited it out again.
Is there a way to store the data to alleviate this problem? Or can I read it differently somehow?
Simplified code to save
with h5py.File('my_dataset.hdf5', 'w') as fp:
for k0 in keys0:
group = fp.create_group(k0)
for k1, v1 in get_items(k0):
group.create_dataset(k1, data=np.array(v1, dtype=np.uint8))
Simplified key accessing code:
with h5py.File('my_dataset.hdf5', 'r') as fp:
keys0 = fp.keys()
for k0 in keys0:
group = fp[k0]
n += len(tuple(group.keys())
If I track the progress of this script during a 'slow phase', it takes almost a second for each iteration. However, if I kill it after, say, 100 steps, then the next time I run the script the first 100 steps take < 1sec to run total, then performance drops back to a crawl.
Upvotes: 2
Views: 611
Reputation: 4183
While I'm still unsure why this is still slow, I've found a workaround: merge each sub-group into a single dataset
with h5py.File('my_dataset.hdf5', 'w') as fp:
for k0 in keys0:
subkeys = get_subkeys(k0)
nk = len(subkeys)
data = fp.create_dataset(
'data', shape=(nk,),
dtype=h5py.special_dtype(vlen=np.dtype(np.uint8)))
keys = fp.create_dataset('keys', shape=(nk,), dtype='S32')
for i, (k1, v1) in enumerate(get_items(k0)):
keys[i] = k1
data[i] = v1
Upvotes: 1