DwightFromTheOffice
DwightFromTheOffice

Reputation: 520

Storing datastream in hdf5 file using python

I have a python program that accepts a stream of data via UDP at a rate of +- 1000 Hz. A typical stream takes +- 15 mins. It consists of +- 10 channels each consisting of a stream of doubles, booleans or vector of size 3 with a timestamp.

Currently every iteration (so 1000 times a second) it writes a line to a csv file with all the values.

To limit the size of the files I want to change the format to hdf5 and write the data with h5py.

So very short it should look like this:

class StoreData(threading.Thread):

    def __init__(self):
        super().__init__()
        self.f = open_hdf5_file_as_write()

    def run(self):
        while True:
            # return True every +- 0.001 seconds
            if self.new_values_available():
                vals = self.get_new_vals()
                # What to do best with the vals here?

But I stumble upon 2 questions.

  1. What is the best structure of the HDF5 file? Is it best to store the streams in different groups, or just different datasets in the same group?

  2. How should I write the data? Do I expand every iteration the datasets with 1 variable using a resize? Do I locally store data and update every n iterations with a chunk of n values per stream or do I keep everything in a pandas table and write it just once at the end?

Answering 1 of the 2 questions would already be a big help!

Upvotes: 1

Views: 1880

Answers (1)

kcw78
kcw78

Reputation: 8046

Both are good questions. I can't give a precise answer without knowing more about your data and workflows. (Note: The HDF Group has a good overview you might want to review here:Introduction to HDF5. It is a good place to learn the possibilities with schema design.) Here are things I would consider in a "thought experiment":

The best structure:
With HDF5, you can define any schema you want (within limits), so the best structure (schema), is the one that works best with your data and processes.

  • Since you have an existing CSV file format, the simplest approach is creating an equivalent NumPy dtype, and referencing it to create a recarray that holds the data. This would mimic your current data organization. If you want to get fancier, here are other considerations:
  • Your datatypes: are they homogeneous (all floats or all ints), or heterogeneous (a mix of floats, ints and strings)? You have more options if they are all the same. However, HDF5 also supports mixed types as compound data.
  • Organization: How are you going to use the data? A properly designed schema will help you avoid data gymnastics in the future. Is it advantageous (to you) to save everything in 1 dataset, or to distribute across different datasets/groups? Think of data organized in folders and files on your computer. HDF5 Groups are your folders and the datasets are your files.
  • Convenience of working with the data: similar to organization. How easy/hard it is to write vs read it. It might be easier to write it as you get it - but is that a convenient format when you want to process it?

How should I write the data?
There are several Python packages that can write HDF5 data. I am familiar with PyTables (aka tables) and h5py. (Pandas can also create HDF5 files, but I have no experience to share.) Both packages have similar capabilities, and some differences. Both support HDF5 features you need (resizeable datasets, homogeneous and/or heterogeneous data). h5py attempts to map the HDF5 feature set to NumPy as closely as possible. PyTables has an abstraction layer on top of HDF5 and NumPy, with advanced indexing capabilities to quickly perform in-kernel data queries. (Also, I found PyTables I/O is slightly faster than h5py.) For those reasons, I prefer PyTables, but I am equally comfortable with h5py.

How often should I write: every 1 or N iterations, or once at the end?
This is a trade-off of available RAM vs required I/O performance vs coding complexity. There is an I/O "time cost" with each write to the file. So, the fastest process is to save all data in RAM and write at the end. That means you need enough memory to hold a 15 minute datastream. I suspect memory requirements will drive this decision. The good news: PyTables and h5py will support any of these methods.

Upvotes: 3

Related Questions