Reputation: 647
I have some data in memory that I want to store in a HDF file.
My data are not huge (<100 MB, so they fit in memory very comfortably), so for performance it seems to make sense to keep them there. At the same time, I also want to store it on disk. It is not critical that the two are always exactly in sync, as long as they are both valid (i.e. not corrupted), and that I can trigger a synchronization manually.
I could just keep my data in a separate container in memory, and shovel it into an HDF object on demand. If possible I would like to avoid writing this layer. It would require me to keep track of what parts have been changed, and selectively update those. I was hoping HDF would take care of that for me.
I know about the driver='core'
with backing store functionality, but it AFAICT, it only syncs the backing store when closing the file. I can flush
the file, but does that guarantee to write the object to storage?
From looking at the HDF5 source code, it seems that the answer is yes. But I'd like to hear a confirmation.
Bonus question: Is driver='core'
actually faster than normal filesystem back-ends? What do I need to look out for?
Upvotes: 2
Views: 1300
Reputation: 564
If you need consistency and avoid corrupted hdf5 files, you may like to:
1) use write-ahead-log, use append logs to write what's being added/updated each time, no need to write to hdf5 this moment. 2) periodically, or at the time you need to shutdown, you replay the logs to apply them one by one, write to the hdf5 file 3) if your process down during 1), you won't lose data, after you startup next time, just replay the logs and writes them to hdf5 file 4) if your process is down during 2), you will not lose data, just remove the corrupted hdf5 file, replay the logs and write it again.
Upvotes: 0
Reputation: 7293
What the H5Fflush
command does is request to the file system to transfer all buffers to the file.
The documentation has a specific note about it:
HDF5 does not possess full control over buffering. H5Fflush flushes the internal HDF5 buffers then asks the operating system (the OS) to flush the system buffers for the open files. After that, the OS is responsible for ensuring that the data is actually flushed to disk.
In practice, I have noticed that I can use most of the time read the data from a HDF5 file that has been flushed (even if the process was subsequently killed) but this is not guaranteed by HDF5: there is no safety in relying on the flush operation to have a valid HDF5 file as further operations (on the metadata, for instance) can corrupt the file is the process is then interrupted. You have to close the file completely to have this consistency.
Upvotes: 1