Live-analysis of simulation data using pytables / hdf5

Question

I am working on some cfd-simulations with c/CUDA and python, at the moment the workflow goes like this:

Start a simulation written in pure c / cuda
Write output to a binary file
Reopen files with python i.e. numpy.fromfile and do some analysis.

Since I have a lot of data and also some metadata I though it would be better to switch to hdf5 file format. So my Idea was something like,

Create some initial conditions data for my simulations using pytables.
Reopen and write to the datasets in c by using the standard hdf5 library.
Reopen files using pytables for analysis.

I really would like to do some live analysis of the data i.e. write from the c-programm to hdf5 and directly read from python using pytables. This would be pretty useful, but I am really not sure how much this is supported by pytables.

Since I never worked with pytables or hdf5 it would be good to know if this is a good approach or if there are maybe some pitfalls.

weatherfrog · Accepted Answer

I think it is a reasonable approach, but there is a pitfall indeed. The HDF5 C-library is not thread-safe (there is a "parallel" version, more on this later). That means, your scenario does not work out of the box: one process writing data to a file while another process is reading (not necessarily the same dataset) will result in a corrupted file. To make it work, you must either:

implement file locking, making sure that no process is reading while the file is being written to, or
serialize access to the file by delegating reads/writes to a distinguished process. You must then communicate with this process through some IPC technique (Unix domain sockets, ...). Of course, this might affect performance because data is being copied back and forth.

Recently, the HDF group published an MPI-based parallel version of HDF5, which makes concurrent read/write access possible. Cf. http://www.hdfgroup.org/HDF5/PHDF5/. It was created for use cases like yours.

To my knowledge, pytables does not provide any bindings to parallel HDF5. You should use h5py instead, which provides very user-friendly bindings to parallel HDF5. See the examples on this website: http://docs.h5py.org/en/2.3/mpi.html

Unfortunately, parallel HDF5 has a major drawback: to date, it does not support writing compressed datasets (reading is possible, though). Cf. http://www.hdfgroup.org/hdf5-quest.html#p5comp

Live-analysis of simulation data using pytables / hdf5

Answers (1)

Related Questions