Python awswrangler performance under large number of partitions

Question

I need to store/fetch data using two hierarchy levels, date and class. So when I upload data to S3 as part of the ETL pipeline, I'm using awswrangler's to_parquet function with partition_cols=["date", "class"]. To fetch data from the S3 bucket, I'm using the read_parquet function with partition_filter=filter_func, where is similar to filter_func=lambda x: x["date"] in date_list and x["class"] in class_list.

My issue is that class can take a large number of different values (60k to 70k) so writing and reading data from S3 takes much longer than I expected, to the point where it is unfeasible for my application. This has left me wondering if there is a more efficient way to implement these read and write operations.

An example would keeping only the day partition and filtering the resulting pandas dataframe after reading the parquet. Is there any other alternative?

Python awswrangler performance under large number of partitions

Answers (0)

Related Questions