Reputation: 520
I have a python program that accepts a stream of data via UDP at a rate of +- 1000 Hz. A typical stream takes +- 15 mins. It consists of +- 10 channels each consisting of a stream of doubles, booleans or vector of size 3 with a timestamp.
Currently every iteration (so 1000 times a second) it writes a line to a csv file with all the values.
To limit the size of the files I want to change the format to hdf5 and write the data with h5py.
So very short it should look like this:
class StoreData(threading.Thread):
def __init__(self):
super().__init__()
self.f = open_hdf5_file_as_write()
def run(self):
while True:
# return True every +- 0.001 seconds
if self.new_values_available():
vals = self.get_new_vals()
# What to do best with the vals here?
But I stumble upon 2 questions.
What is the best structure of the HDF5 file? Is it best to store the streams in different groups, or just different datasets in the same group?
How should I write the data? Do I expand every iteration the datasets with 1 variable using a resize? Do I locally store data and update every n iterations with a chunk of n values per stream or do I keep everything in a pandas table and write it just once at the end?
Answering 1 of the 2 questions would already be a big help!
Upvotes: 1
Views: 1880
Reputation: 8046
Both are good questions. I can't give a precise answer without knowing more about your data and workflows. (Note: The HDF Group has a good overview you might want to review here:Introduction to HDF5. It is a good place to learn the possibilities with schema design.) Here are things I would consider in a "thought experiment":
The best structure:
With HDF5, you can define any schema you want (within limits), so the best structure (schema), is the one that works best with your data and processes.
How should I write the data?
There are several Python packages that can write HDF5 data. I am familiar with PyTables (aka tables) and h5py. (Pandas can also create HDF5 files, but I have no experience to share.) Both packages have similar capabilities, and some differences. Both support HDF5 features you need (resizeable datasets, homogeneous and/or heterogeneous data). h5py attempts to map the HDF5 feature set to NumPy as closely as possible. PyTables has an abstraction layer on top of HDF5 and NumPy, with advanced indexing capabilities to quickly perform in-kernel data queries. (Also, I found PyTables I/O is slightly faster than h5py.) For those reasons, I prefer PyTables, but I am equally comfortable with h5py.
How often should I write: every 1 or N iterations, or once at the end?
This is a trade-off of available RAM vs required I/O performance vs coding complexity. There is an I/O "time cost" with each write to the file. So, the fastest process is to save all data in RAM and write at the end. That means you need enough memory to hold a 15 minute datastream. I suspect memory requirements will drive this decision. The good news: PyTables and h5py will support any of these methods.
Upvotes: 3