Reputation: 35
In pyspark, I'm writing a very large dataframe into 5000000+ partitions, to use them basically like a hashed database that I can use to access in O(1) time complexity.
df.repartition(*cols).write.partitionby(*cols).mode("overwrite").parquet(s3_path_prefix)
The total AWS S3 Cost of that operation was around 450$. (This is a daily operation so the cost occurs once every day)
This means around 19 S3 requests were made per partition.
19 operations seems like a lot for a single partition.
(_SUCCESS, _committed, _started, part-00000 files are created per partition)
Is there a way to decrease the number of s3 calls made per partition?
Or better yet, is there a cheaper way to achieve my original goal (save a large pyspark dataframe in a way a single partition can be read in O(1)) without using s3?
Upvotes: 0
Views: 361
Reputation: 13480
if you switch to an s3a committer then marginally fewer s3 calls are made, especially on 3.3.6+ where all the nominal safety checks.
you can turn off a lot of DELETE calls with
spark.hadoop.fs.s3a.directory.marker.retention keep
see docs for backwards compatibility: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/directory_markers.html
otherwise: switch to a table format to save a lot of dir scan time/cost in query planning.
be aware that the s3a code is optimised for performance over s3 io cost, as generally cluster time dominates. but we try to cut out IO when we can, as that adds time. Most recent hadoop versions and any "unsafe" switches are useful here...
Upvotes: 0