How to decrease number of calls made to s3 when writing parquet files in spark?

Question

In pyspark, I'm writing a very large dataframe into 5000000+ partitions, to use them basically like a hashed database that I can use to access in O(1) time complexity.

df.repartition(*cols).write.partitionby(*cols).mode("overwrite").parquet(s3_path_prefix)

The total AWS S3 Cost of that operation was around 450$. (This is a daily operation so the cost occurs once every day)

This means around 19 S3 requests were made per partition.

The cost of calls made to s3 is 0.0047 USD per 1000 PUT, COPY, POST, LIST requests.
450 / 5000000 / 0.0047 * 1000 = 19 requests

19 operations seems like a lot for a single partition.
(_SUCCESS, _committed, _started, part-00000 files are created per partition)

Is there a way to decrease the number of s3 calls made per partition?

Or better yet, is there a cheaper way to achieve my original goal (save a large pyspark dataframe in a way a single partition can be read in O(1)) without using s3?

How to decrease number of calls made to s3 when writing parquet files in spark?

Answers (1)

Related Questions