michaelgbj
michaelgbj

Reputation: 294

modin pandas read_parquet() failed on ETag KeyError trying to read a partitioned parquet from s3

I created a dataframe from pandas and used to_parquet(...) to write to s3 directly.

arguments are:

df.to_parquet('s3://bucket/fn.parquet', compression='gzip', engine='fastparquet', partition_cols=['col1'])

when I use pandas's pandas.read_parquet(url), the dataframe is loaded fine.

But when I use modin.pandas.read_parquet(url), I get following error:

 File "/home/mguo/anaconda3/envs/testenv/lib/python3.7/site-packages/s3fs/core.py", line 1779, in __init__
    self.req_kw["IfMatch"] = self.details["ETag"]
KeyError: 'ETag'

Below are my version:

python==3.7.3
pandas==1.2.4
modin==0.10.0
s3fs==2021.6.0

Upvotes: 4

Views: 2736

Answers (1)

Mahesh Vashishtha
Mahesh Vashishtha

Reputation: 176

This issue on the Modin GitHub tracked support for reading partitioned files with read_parquet in Modin, as you are trying to do here. This pull request on the Modin GitHub added that feature and resolved the issue. You should be able to read partitioned parquet files without the ETag KeyError if you upgrade to the latest version of Modin (0.12.0).

Upvotes: 1

Related Questions