Reputation: 203
I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset
and I'm aware of RecordBatchStreamReader
but I'm not sure how to combine them.
How can I use Pyarrow to do this?
Upvotes: 3
Views: 2830
Reputation: 35
Streaming Parquet files became available since Arrow v3.0.0 (early 2021): https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.iter_batches
Upvotes: 1
Reputation: 105521
At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader
(the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.
Upvotes: 4