Spark InsertInto Slow Renaming in AWS S3

Question

When using insertInto in AWS s3, the data is initially written to the .spark-staging folder and then moved (fake renaming) to the actual location in batches of 10 files at a time. Moving staging data 10 by 10 is slow.

Is there any way to increase its parallelism?

Grish · Accepted Answer

that is an issue with s3 Committer, check documentation

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-commit-protocol.html

you should try to avoid using .option("partitionOverwriteMode", "dynamic")

if your data is e.g. partitioned by date, you should include date in output path

so instead

ds.write.option("partitionOverwriteMode", "dynamic").partitionBy(date).parquet("s3://baspath/")

do

ds.write.parquet(s"s3://baspath/date=$someDate/")

Spark InsertInto Slow Renaming in AWS S3

Answers (1)

Related Questions