Polars scan s3 multi-part parquet files

Question

I have a multipart partitioned parquet on s3. Each partition contains multiple parquet files. The below code narrows in on a single partition which may contain somewhere around 30 parquet files. When I use scan_parquet on a s3 address that includes *.parquet wildcard, it only looks at the first file in the partition. I verified this with the count of customers. It has the count from just the first file in the partition. Is there a way that it can scan across files?

import polars as pl

s3_loc = "s3://some_bucket/some_parquet/some_partion=123/*.parquet"
df = pl.scan_parquet(s3_loc)
cus_count = df.select(pl.count('customers')).collect()

If I leave off the *.parquet from the s3 address then I get the following error.

exceptions.ArrowErrorException: ExternalFormat("File out of specification: A parquet file must containt a header and footer with at least 12 bytes")

Dean MacGregor · Accepted Answer

New Answer

polars can natively load files from AWS, Azure, GCP, or plain old http and no longer uses fsspec (very much, if at all). Instead, it uses the object_store under the hood. The syntax to use it is.

pl.scan_parquet(
    "s3://some_bucket/some_parquet/some_partion=123/*.parquet", 
    storage_options= dict_of_credentials)

The dict_of_credentials might be a bit different than what you normally feed to initiate an s3fs.S3FileSystem().

Additionally, you can exclude storage_options if you've got your credentials as env variables but the env variables that work for s3fs might not be exactly what object_store is looking for. For instance, this shows what env variables it will look for.

Old Answer using pyarrow.dataset

It looks like from the user guide on multiple files that to do so requires a loop creating many lazy dfs that you then combine together.

Another approach is to use the scan_ds function which takes a pyarrow dataset object.

import polars as pl
import s3fs
import pyarrow.dataset as ds
fs = s3fs.S3FileSystem()
# you can also make a file system with anything fsspec supports
# S3FileSystem is just a wrapper for fsspec
s3_loc = "s3://some_bucket/some_parquet/some_partion=123"
myds = ds.dataset(s3_loc, filesystem=fs)
lazy_df = pl.scan_pyarrow_dataset(myds)
cus_count = lazy_df.select(pl.count('customers')).collect()

Polars scan s3 multi-part parquet files

Answers (2)

New Answer

Old Answer using pyarrow.dataset

Related Questions