Reputation: 16064

Is it possible to read parquet files in chunks?

For example, pandas's read_csv has a chunk_size argument which allows the read_csv to return an iterator on the CSV file so we can read it in chunks.

The Parquet format stores the data in chunks, but there isn't a documented way to read in it chunks like read_csv.

Is there a way to read parquet files in chunks?

Upvotes: 34

Answers (5)

lee

Reputation: 541

If your parquet file was not created with row groups, the read_row_group method doesn't seem to work (there is only one group!).

However if your parquet file is partitioned as a directory of parquet files you can use the fastparquet engine, which only works on individual files, to read files then, concatenate the files in pandas or get the values and concatenate the ndarrays

import pandas as pd
from glob import glob
files = sorted(glob('dat.parquet/part*'))

data = pd.read_parquet(files[0],engine='fastparquet')
for f in files[1:]:
    data = pd.concat([data,pd.read_parquet(f,engine='fastparquet')])

Upvotes: 12

Michał Słapek

Reputation: 1572

You can use iter_batches from pyarrow. to_pandas method should give you pandas DataFrame.

Example:

import pyarrow.parquet as pq

parquet_file = pq.ParquetFile('example.parquet')

for batch in parquet_file.iter_batches():
    print("RecordBatch")
    batch_df = batch.to_pandas()
    print("batch_df:", batch_df)

Upvotes: 35

azizbro

Reputation: 4260

You can't use a generator/iterator over a parquet file because it is a compressed file. You need to fully decompress it first.

Upvotes: -1

George Farah

Reputation: 79

This is an old question but the follwoing worked for me if you want to read all chunks in one liner without using concat:

pd.read_parquet("chunks_*", engine="fastparquet")

or if you want to read specific chunks you can try:

pd.read_parquet("chunks_[1-2]*", engine="fastparquet")

(this way you will read only the first two chunks, it is also not necessary to specify an engine)

Upvotes: 2

Micah Kornfield

Reputation: 1718

I'm not sure if one can do it directly from pandas but pyarrow exposes read_row_group. The resulting Table should be convertable to a pandas dataframe with to_pandas

As of pyarrow 3.0 there is now a iter_batches method that can be used.

Upvotes: 7

Is it possible to read parquet files in chunks?

Answers (5)

Related Questions