Reputation: 81
I need to incrementally load data to Pandas from Parquet files stored in s3, i'm trying to use PyArrow for this but not having any luck.
Writing an entire directory of Parquet files into Pandas works just fine:
import s3fs
import pyarrow.parquet as pq
import pandas as pd
fs = s3fs.S3FileSystem(mykey,mysecret)
p_dataset = pq.ParquetDataset('s3://mys3bucket/directory', filesystem=fs)
df = p_dataset.read().to_pandas()
But when I try to load a single Parquet file I get an error:
fs = s3fs.S3FileSystem(mykey,mysecret)
p_dataset = pq.ParquetDataset('s3://mys3bucket/directory/1_0_00000000000000014012'
, filesystem=fs)
df = p_dataset.read().to_pandas()
Throws error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-179-3d01b32c60f7> in <module>()
15 p_dataset = pq.ParquetDataset(
16 's3://mys3bucket/directory/1_0_00000000000000014012',
---> 17 filesystem=fs)
18
19 table2.to_pandas()
C:\User\Anaconda3\lib\site-packages\pyarrow\parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads)
880
881 if validate_schema:
--> 882 self.validate_schemas()
883
884 if filters is not None:
C:\User\Anaconda3\lib\site-packages\pyarrow\parquet.py in validate_schemas(self)
893 self.schema = self.common_metadata.schema
894 else:
--> 895 self.schema = self.pieces[0].get_metadata(open_file).schema
896 elif self.schema is None:
897 self.schema = self.metadata.schema
IndexError: list index out of range
Would appreciate any help with this error.
Ideally I need to append all new data added to s3 (added since the previous time I ran this script) to the Pandas dataframe so was thinking I pass a list of filenames to ParquetDataset. Is there a better way to achieve this? Thanks
Upvotes: 2
Views: 2837
Reputation: 4768
For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet
to install do;
pip install awswrangler
to read a single parquet file from s3 using awswrangler 1.x.x and above, do;
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/my-file.parquet")
to read a list of parquet files, do;
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/", dataset=True)
By setting dataset=True awswrangler will read all the individual parquet files below the s3 key.
Upvotes: 0
Reputation: 105481
You want to use pq.read_table
(pass a file path or file handle) instead of pq.ParquetDataset
(pass a directory). HTH
Upvotes: 1