Sean Nguyen
Sean Nguyen

Reputation: 13128

dask.dataframe.read_parquet takes too long

I tried to read parquet from s3 like this:

import dask.dataframe as dd

s3_path = "s3://my_bucket/my_table"
times = dd.read_parquet(
        s3_path,
        storage_options={
                          "client_kwargs": {
                              "endpoint_url": bucket_endpoint_url,
                          },
                          "profile_name": bucket_profile,
                        },
        engine='pyarrow',
    )

It takes a very long time just to create a dask dataframe. No computation is performed on this data frame yet. I trace code and it looks like, it is spending the time in pyarrow.parquet.validate_schema()

My parquet tables has lots of files in it (~2000 files). And it is taking 543 sec on my laptop just to create the data frame. And it is trying to check schema of each parquet file. Is there a way to disable schema validation?

Thanks,

Upvotes: 4

Views: 1720

Answers (1)

MRocklin
MRocklin

Reputation: 57271

Currently if there is no metadata file and if you're using the PyArrow backend then Dask is probably sending a request to read metadata from each of the individual partitions on S3. This is quite slow.

Dask's dataframe parquet reader is being rewritten now to help address this. You might consider using fastparquet until then and the ignore_divisions keyword (or something like that), or checking back in a month or two.

Upvotes: 2

Related Questions