314159
314159

Reputation: 81

How to read a defined list of parquet files from s3 using PyArrow?

I need to incrementally load data to Pandas from Parquet files stored in s3, i'm trying to use PyArrow for this but not having any luck.

Writing an entire directory of Parquet files into Pandas works just fine:

import s3fs
import pyarrow.parquet as pq
import pandas as pd

fs = s3fs.S3FileSystem(mykey,mysecret)
p_dataset = pq.ParquetDataset('s3://mys3bucket/directory', filesystem=fs)

df = p_dataset.read().to_pandas()

But when I try to load a single Parquet file I get an error:

fs = s3fs.S3FileSystem(mykey,mysecret)
p_dataset = pq.ParquetDataset('s3://mys3bucket/directory/1_0_00000000000000014012'
, filesystem=fs)

df = p_dataset.read().to_pandas()

Throws error:

    ---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-179-3d01b32c60f7> in <module>()
     15 p_dataset = pq.ParquetDataset(
     16     's3://mys3bucket/directory/1_0_00000000000000014012',
---> 17                       filesystem=fs)
     18 
     19 table2.to_pandas()

C:\User\Anaconda3\lib\site-packages\pyarrow\parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads)
    880 
    881         if validate_schema:
--> 882             self.validate_schemas()
    883 
    884         if filters is not None:

C:\User\Anaconda3\lib\site-packages\pyarrow\parquet.py in validate_schemas(self)
    893                 self.schema = self.common_metadata.schema
    894             else:
--> 895                 self.schema = self.pieces[0].get_metadata(open_file).schema
    896         elif self.schema is None:
    897             self.schema = self.metadata.schema

IndexError: list index out of range

Would appreciate any help with this error.

Ideally I need to append all new data added to s3 (added since the previous time I ran this script) to the Pandas dataframe so was thinking I pass a list of filenames to ParquetDataset. Is there a better way to achieve this? Thanks

Upvotes: 2

Views: 2837

Answers (2)

Vincent Claes
Vincent Claes

Reputation: 4768

For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet

to install do;

pip install awswrangler

to read a single parquet file from s3 using awswrangler 1.x.x and above, do;

import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/my-file.parquet")

to read a list of parquet files, do;

import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/", dataset=True)

By setting dataset=True awswrangler will read all the individual parquet files below the s3 key.

Upvotes: 0

Wes McKinney
Wes McKinney

Reputation: 105481

You want to use pq.read_table (pass a file path or file handle) instead of pq.ParquetDataset (pass a directory). HTH

Upvotes: 1

Related Questions