Reputation: 16064
For example, pandas's read_csv
has a chunk_size
argument which allows the read_csv
to return an iterator on the CSV file so we can read it in chunks.
The Parquet format stores the data in chunks, but there isn't a documented way to read in it chunks like read_csv
.
Is there a way to read parquet files in chunks?
Upvotes: 34
Views: 47487
Reputation: 541
If your parquet file was not created with row groups, the read_row_group method doesn't seem to work (there is only one group!).
However if your parquet file is partitioned as a directory of parquet files you can use the fastparquet engine, which only works on individual files, to read files then, concatenate the files in pandas or get the values and concatenate the ndarrays
import pandas as pd
from glob import glob
files = sorted(glob('dat.parquet/part*'))
data = pd.read_parquet(files[0],engine='fastparquet')
for f in files[1:]:
data = pd.concat([data,pd.read_parquet(f,engine='fastparquet')])
Upvotes: 12
Reputation: 1572
You can use iter_batches from pyarrow. to_pandas method should give you pandas DataFrame.
Example:
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for batch in parquet_file.iter_batches():
print("RecordBatch")
batch_df = batch.to_pandas()
print("batch_df:", batch_df)
Upvotes: 35
Reputation: 4260
You can't use a generator/iterator over a parquet file because it is a compressed file. You need to fully decompress it first.
Upvotes: -1
Reputation: 79
This is an old question but the follwoing worked for me if you want to read all chunks in one liner without using concat:
pd.read_parquet("chunks_*", engine="fastparquet")
or if you want to read specific chunks you can try:
pd.read_parquet("chunks_[1-2]*", engine="fastparquet")
(this way you will read only the first two chunks, it is also not necessary to specify an engine)
Upvotes: 2
Reputation: 1718
I'm not sure if one can do it directly from pandas but pyarrow exposes read_row_group. The resulting Table should be convertable to a pandas dataframe with to_pandas
As of pyarrow 3.0 there is now a iter_batches method that can be used.
Upvotes: 7