benjimin
benjimin

Reputation: 4890

Can I write to a HDF5 file from multiple processes/threads?

Does hdf5 support parallel writes to the same file, from different threads or from different processes? Alternatively, does hdf5 support non-blocking writes?

If so then is this also supported by NetCDF4, and by the python bindings for either?

I am writing an application where I want different CPU cores to concurrently compute output intended for non-overlapping tiles of a very large output array. (Later I will want to read sections from it as a single array, without needing my own driver to manage indexing many separate files, and ideally without the additional IO task of rearranging it on disk.)

Upvotes: 6

Views: 7816

Answers (2)

benjimin
benjimin

Reputation: 4890

Not trivially, but there various potential work-arounds.

The ordinary HDF5 library apparently does not even support concurrent reading of different files by multiple threads. Consequently NetCDF4, and the python bindings for either, will not support parallel writing.

If the output file is pre-initialised and has chunking and compression disabled, to avoid having a chunk index, then (in principle) concurrent non-overlapping writes to the same file by separate processes might work(?).

In more recent versions of HDF5, there should be support for virtual datasets. Each process would write output to a different file, and afterward a new container file would be created, consisting of references to the individual data files (but otherwise able to be read like a normal HDF5 file).

There exists a "Parallel HDF5" library for MPI. Although MPI might otherwise seem like overkill, it would have advantages if scaling up later to multiple machines.

If writing output is not a performance bottleneck, a multithreaded application could probably implement one output thread (utilising some form of queue data-structure).

[Edit:] Another option is to use zarr format instead, which places each chunk in a separate file (an approach which future versions of HDF currently seem likely to adopt).

Upvotes: 6

John Readey
John Readey

Reputation: 561

If you are running in AWS, checkout HDF Cloud: https://www.hdfgroup.org/solutions/hdf-cloud.

This is a service that enables multiple reader/multiple writer workflows and is largely feature compatible with the HDF5 library. The client SDK doesn't support non-blocking writes, but of course if you are using the REST API directly you could do non-blocking I/O just like you would with any http-based service.

Upvotes: 0

Related Questions