How to efficiently set up HDF5 files which will containing an unknown amount of data?

Question

I have a simulation which can be run for an arbitrarily long time. To store the simulation's output, I naively create a resizable HDF5 file, to which I constantly store data as I get it, as demonstrated in this toy example:

import contextlib
import os
import time
import numpy as np
import h5py

num_timepoints = 18000
num_vertices = 16
num_info = 38
output_size = 10

t0 = "A:\t0.hdf5"

with contextlib.suppress(FileNotFoundError):
    os.remove(t0)

st = time.time()

with h5py.File(t0, "a") as f:
    dset = f.create_dataset("test", (0, num_vertices, num_info), maxshape=(None, num_vertices, num_info))

for n in np.arange(18000/output_size):
    chunk = np.random.rand(output_size, 16, 38)
    with h5py.File(t0, "a") as f:
        dset = f["test"]

        orig_index = dset.shape[0]

        dset.resize(dset.shape[0] + chunk.shape[0], axis=0)
        dset[orig_index:, :, :] = chunk

et = time.time()

print("test0: time taken: {} s, size: {} kB".format(np.round(et - st, 2), int(os.path.getsize(t0))/1000))

Note that the size of the test data is similar to the size of the data I get from the simulation, on average (in the worst case, I might have 2 to 3 times the number of time points in the test).

The output of this test is:

test0: time taken: 2.02 s, size: 46332.856 kB

Compare this output with a test that provides the data size up front:

t1 = "A:\t1.hdf5"

with contextlib.suppress(FileNotFoundError):
    os.remove(t1)

st = time.time()

data = np.random.rand(num_timepoints, num_vertices, num_info)
with h5py.File(t1, "a") as f:
    dset = f.create_dataset("test", data.shape)
    dset = data

et = time.time()

print("test1: time taken: {} s, size: {} kB".format(np.round(et - st, 2), int(os.path.getsize(t1))/1000))

Which has as output:

test1: time taken: 0.09 s, size: 1.4 kB

If I choose output_size (which reflects how large a chunk of data I get from the simulation at once) to be 1, then test0 takes around 40 seconds, and creates an approximately 700 MB file!

Clearly, test0 is using a very naive and inefficient method. How may I improve upon it? My full test code is:

import contextlib
import os
import time
import numpy as np
import h5py

# =================================================

num_timepoints = 18000
num_vertices = 16
num_info = 38
output_size = 10

t0 = "A:\t0.hdf5"

with contextlib.suppress(FileNotFoundError):
    os.remove(t0)

st = time.time()

with h5py.File(t0, "a") as f:
    dset = f.create_dataset("test", (0, num_vertices, num_info), maxshape=(None, num_vertices, num_info))

for n in np.arange(18000/output_size):
    chunk = np.random.rand(output_size, 16, 38)
    with h5py.File(t0, "a") as f:
        dset = f["test"]

        orig_index = dset.shape[0]

        dset.resize(dset.shape[0] + chunk.shape[0], axis=0)
        dset[orig_index:, :, :] = chunk

et = time.time()

print("test0: time taken: {} s, size: {} kB".format(np.round(et - st, 2), int(os.path.getsize(t0))/1000))

# =================================================

t1 = "A:\t1.hdf5"

with contextlib.suppress(FileNotFoundError):
    os.remove(t1)

st = time.time()

data = np.random.rand(num_timepoints, num_vertices, num_info)
with h5py.File(t1, "a") as f:
    dset = f.create_dataset("test", data.shape)
    dset = data

et = time.time()

print("test1: time taken: {} s, size: {} kB".format(np.round(et - st, 2), int(os.path.getsize(t1))/1000))

# =================================================

print("Done.")

Thomas K · Accepted Answer

Here are some things I found that can easily improve the performance. First, don't close and reopen the file to write each chunk:

with h5py.File(t0, "a") as f:
    dset = f["test"]
    for n in np.arange(18000/output_size):
        chunk = np.random.rand(output_size, 16, 38)

        orig_index = dset.shape[0]
        dset.resize(dset.shape[0] + chunk.shape[0], axis=0)
        dset[orig_index:, :, :] = chunk

This takes it from ~2 seconds to ~0.9 seconds.

Second, h5py guesses a rather strange chunk shape for your dataset (when I tried it, 128*4*10). You can manually specify the shape of chunks you'll be adding:

with h5py.File(t0, "a") as f:
    dset = f.create_dataset("test", (0, num_vertices, num_info),
                            maxshape=(None, num_vertices, num_info),
                            chunks=(output_size, num_vertices, num_info),
                           )

On this example, I don't get much speedup (maybe 0.9 seconds to 0.8). But it's worth looking at; it might make a bigger difference depending on your data shape and your storage.

Finally, if I write a bigger chunk at once (output_size = 100), I see performance the same as (or better than) the all-at-once example, around 0.5 seconds (once the all-at-once example is fixed to actually write the data - see my comment).

Of course, you don't want to change what your simulation is doing just to make the writing faster. But if this speedup is important, you could write some code to batch up the data from the simulation and periodically write a bigger chunk to HDF5. The drawback is that you may lose some data if your simulation crashes.

You could also look at resizing in bigger chunks less often (e.g. resize to add 100, then do 10 writes of 10 rows each before resizing again). EDIT: I tried this, and it doesn't actually seem to improve the timings.

How to efficiently set up HDF5 files which will containing an unknown amount of data?

Answers (1)

Related Questions