이준서
이준서

Reputation: 35

How to decrease number of calls made to s3 when writing parquet files in spark?

In pyspark, I'm writing a very large dataframe into 5000000+ partitions, to use them basically like a hashed database that I can use to access in O(1) time complexity.

df.repartition(*cols).write.partitionby(*cols).mode("overwrite").parquet(s3_path_prefix)

The total AWS S3 Cost of that operation was around 450$. (This is a daily operation so the cost occurs once every day)

This means around 19 S3 requests were made per partition.

19 operations seems like a lot for a single partition.
(_SUCCESS, _committed, _started, part-00000 files are created per partition)

Is there a way to decrease the number of s3 calls made per partition?

Or better yet, is there a cheaper way to achieve my original goal (save a large pyspark dataframe in a way a single partition can be read in O(1)) without using s3?

Upvotes: 0

Views: 361

Answers (1)

stevel
stevel

Reputation: 13480

if you switch to an s3a committer then marginally fewer s3 calls are made, especially on 3.3.6+ where all the nominal safety checks.

you can turn off a lot of DELETE calls with

spark.hadoop.fs.s3a.directory.marker.retention keep

see docs for backwards compatibility: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/directory_markers.html

otherwise: switch to a table format to save a lot of dir scan time/cost in query planning.

be aware that the s3a code is optimised for performance over s3 io cost, as generally cluster time dominates. but we try to cut out IO when we can, as that adds time. Most recent hadoop versions and any "unsafe" switches are useful here...

Upvotes: 0

Related Questions