Reputation: 189676
I have two streams of data that have limited duration (typically 1-60 seconds) and I want to store them in a compressed data file for later retrieval. Right now I am using HDF5, but I've heard about Parquet and want to give it a try.
Stream 1:
The data is arriving as a series of records, approximately 2500 records per second. Each record is a tuple (timestamp, tag, data) with the following sizes:
Stream 2:
The data is arriving as a series of records, approximately 100000 records per second. Each record is a tuple (timestamp, index, value) with the following sizes:
Can I do this with Apache Parquet? I am totally new to this + can't seem to find the right documentation; I found documentation about reading/writing entire tables, but in my case I need to incrementally write to the tables in batches of some number of rows (depending on how large of a buffer I want to use).
I am interested in both Java and Python and can explore in either, but I'm more fluent in Python.
I found this page for pyarrow: https://arrow.apache.org/docs/python/parquet.html --- it talks about row groups and ParquetWriter
and read_row_group()
but I can't tell if it supports my use case.
Any suggestions?
Upvotes: 3
Views: 7311
Reputation: 4195
We came up with a novel solution for this (code is OSS):
The core of this methodology involves “transposing” incoming streaming data from a row-oriented to a column-oriented structure using an intermediate scratch file that is stored on disk rather than in memory. This transposition is crucial for the memory-efficient streaming of data into Parquet files.
The 2-pass write method is designed to handle large datasets efficiently. By splitting the process into two passes and focusing on column-wise operations, the method minimizes memory usage while ensuring data integrity and performance.
The 2-pass write solution is a sophisticated approach to handle the memory constraints of streaming data into Parquet files.
detailed writeup here: https://estuary.dev/memory-efficient-streaming-parquet/
Upvotes: 1