Fine control over h5py buffering

Question

I have some data in memory that I want to store in a HDF file.

My data are not huge (<100 MB, so they fit in memory very comfortably), so for performance it seems to make sense to keep them there. At the same time, I also want to store it on disk. It is not critical that the two are always exactly in sync, as long as they are both valid (i.e. not corrupted), and that I can trigger a synchronization manually.

I could just keep my data in a separate container in memory, and shovel it into an HDF object on demand. If possible I would like to avoid writing this layer. It would require me to keep track of what parts have been changed, and selectively update those. I was hoping HDF would take care of that for me.

I know about the driver='core' with backing store functionality, but it AFAICT, it only syncs the backing store when closing the file. I can flush the file, but does that guarantee to write the object to storage?

From looking at the HDF5 source code, it seems that the answer is yes. But I'd like to hear a confirmation.

Bonus question: Is driver='core' actually faster than normal filesystem back-ends? What do I need to look out for?

Pierre de Buyl · Accepted Answer

What the H5Fflush command does is request to the file system to transfer all buffers to the file.

The documentation has a specific note about it:

HDF5 does not possess full control over buffering. H5Fflush flushes the internal HDF5 buffers then asks the operating system (the OS) to flush the system buffers for the open files. After that, the OS is responsible for ensuring that the data is actually flushed to disk.

In practice, I have noticed that I can use most of the time read the data from a HDF5 file that has been flushed (even if the process was subsequently killed) but this is not guaranteed by HDF5: there is no safety in relying on the flush operation to have a valid HDF5 file as further operations (on the metadata, for instance) can corrupt the file is the process is then interrupted. You have to close the file completely to have this consistency.

Fine control over h5py buffering

Answers (2)

Related Questions