Jason S
Jason S

Reputation: 189676

Streaming data into Apache Parquet files?

I have two streams of data that have limited duration (typically 1-60 seconds) and I want to store them in a compressed data file for later retrieval. Right now I am using HDF5, but I've heard about Parquet and want to give it a try.

Stream 1:

The data is arriving as a series of records, approximately 2500 records per second. Each record is a tuple (timestamp, tag, data) with the following sizes:

Stream 2:

The data is arriving as a series of records, approximately 100000 records per second. Each record is a tuple (timestamp, index, value) with the following sizes:

Can I do this with Apache Parquet? I am totally new to this + can't seem to find the right documentation; I found documentation about reading/writing entire tables, but in my case I need to incrementally write to the tables in batches of some number of rows (depending on how large of a buffer I want to use).

I am interested in both Java and Python and can explore in either, but I'm more fluent in Python.

I found this page for pyarrow: https://arrow.apache.org/docs/python/parquet.html --- it talks about row groups and ParquetWriter and read_row_group() but I can't tell if it supports my use case.

Any suggestions?

Upvotes: 3

Views: 7311

Answers (1)

danthelion
danthelion

Reputation: 4195

We came up with a novel solution for this (code is OSS):

The core of this methodology involves “transposing” incoming streaming data from a row-oriented to a column-oriented structure using an intermediate scratch file that is stored on disk rather than in memory. This transposition is crucial for the memory-efficient streaming of data into Parquet files.

The 2-pass write method is designed to handle large datasets efficiently. By splitting the process into two passes and focusing on column-wise operations, the method minimizes memory usage while ensuring data integrity and performance.

The 2-pass write solution is a sophisticated approach to handle the memory constraints of streaming data into Parquet files.

detailed writeup here: https://estuary.dev/memory-efficient-streaming-parquet/

Upvotes: 1

Related Questions