dask.dataframe.read_parquet takes too long

Question

I tried to read parquet from s3 like this:

import dask.dataframe as dd

s3_path = "s3://my_bucket/my_table"
times = dd.read_parquet(
        s3_path,
        storage_options={
                          "client_kwargs": {
                              "endpoint_url": bucket_endpoint_url,
                          },
                          "profile_name": bucket_profile,
                        },
        engine='pyarrow',
    )

It takes a very long time just to create a dask dataframe. No computation is performed on this data frame yet. I trace code and it looks like, it is spending the time in pyarrow.parquet.validate_schema()

My parquet tables has lots of files in it (~2000 files). And it is taking 543 sec on my laptop just to create the data frame. And it is trying to check schema of each parquet file. Is there a way to disable schema validation?

Thanks,

dask.dataframe.read_parquet takes too long

Answers (1)

Related Questions