tSchema
tSchema

Reputation: 203

How do I stream parquet using pyarrow?

I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset and I'm aware of RecordBatchStreamReader but I'm not sure how to combine them.

How can I use Pyarrow to do this?

Upvotes: 3

Views: 2830

Answers (2)

megaserg
megaserg

Reputation: 35

Streaming Parquet files became available since Arrow v3.0.0 (early 2021): https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.iter_batches

Upvotes: 1

Wes McKinney
Wes McKinney

Reputation: 105521

At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader (the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.

Upvotes: 4

Related Questions