Reputation: 13128
I tried to read parquet from s3 like this:
import dask.dataframe as dd
s3_path = "s3://my_bucket/my_table"
times = dd.read_parquet(
s3_path,
storage_options={
"client_kwargs": {
"endpoint_url": bucket_endpoint_url,
},
"profile_name": bucket_profile,
},
engine='pyarrow',
)
It takes a very long time just to create a dask dataframe. No computation is performed on this data frame yet. I trace code and it looks like, it is spending the time in pyarrow.parquet.validate_schema()
My parquet tables has lots of files in it (~2000 files). And it is taking 543 sec on my laptop just to create the data frame. And it is trying to check schema of each parquet file. Is there a way to disable schema validation?
Thanks,
Upvotes: 4
Views: 1720
Reputation: 57271
Currently if there is no metadata file and if you're using the PyArrow backend then Dask is probably sending a request to read metadata from each of the individual partitions on S3. This is quite slow.
Dask's dataframe parquet reader is being rewritten now to help address this. You might consider using fastparquet until then and the ignore_divisions keyword (or something like that), or checking back in a month or two.
Upvotes: 2