How can I write streaming/row-oriented data using parquet-cpp without buffering?

Question

I have essentially row-oriented/streaming data (Netflow) coming into my C++ application and I want to write the data to Parquet-gzip files.

Looking at the sample reader-writer.cc program in the parquet-cpp project, it seems that I can only feed the data to parquet-cpp in a columnar way:

constexpr int NUM_ROWS_PER_ROW_GROUP = 500;
...
// Append a RowGroup with a specific number of rows.
parquet::RowGroupWriter* rg_writer = file_writer->AppendRowGroup(NUM_ROWS_PER_ROW_GROUP);

// Write the Bool column
for (int i = 0; i < NUM_ROWS_PER_ROW_GROUP; i++) {
   bool_writer->WriteBatch(1, nullptr, nullptr, &value);
}
// Write the Int32 column
...
// Write the ... column

This seems to imply that I will need to buffer NUM_ROWS_PER_ROW_GROUP rows myself, then loop over them, to transfer them to parquet-cpp one column at a time. I'm hoping there is a better way, as this seems inefficient, since the data will need to be copied twice: once into my buffers, then again when feeding the data into parquet-cpp one column at a time.

Is there a way to get each row's data into parquet-cpp without having to buffer a bunch of rows first? The Apache Arrow project (which parquet-cpp uses) has a tutorial that shows how to convert row-wise data into an Arrow table. For each row of input data, the code appends to each column builder:

for (const data_row& row : rows) {
   ARROW_RETURN_NOT_OK(id_builder.Append(row.id));
   ARROW_RETURN_NOT_OK(cost_builder.Append(row.cost));

I would like to do something like that with parquet-cpp. Is that possible?

Uwe L. Korn · Accepted Answer

You will never be able to have no buffering at all as we need to transform from a row-wise to a columnar representation. The best possible path at the time of writing is to construct Apache Arrow tables that are then fed into parquet-cpp.

parquet-cpp provides special Arrow APIs that can then directly operate on these tables, mostly without any additional data copies. You can find the API in parquet/arrow/reader.h and parquet/arrow/writer.h.

The optimal but yet to be implemented solution could save some bytes by doing the following:

ingest row-by-row in a new parquet-cpp API
directly encode these values per column with the specified encoding and compression settings
only buffer this in memory
at the end of the row group, write out column after column

While this optimal solution may save you some memory, there are still some steps that need to be implemented by someone (feel free to contribute them or ask for help on implementing those), you are probably good with uaing the Apache Arrow based API.

How can I write streaming/row-oriented data using parquet-cpp without buffering?

Answers (2)

Related Questions