pyspark speed up writing to S3

Question

I have an RDD with ~ 150K elements , writing it to S3 takes about 5 hours I want to speed it up , I've tried , adjusting the spark params currently using the following settings :

--executor-memory 19g --num-executors 17 --executor-cores 5 --driver-memory 3g

this is how I write to S3:

rdd.repartition(1).write.partitionBy('id').save(path=s3_path, mode='overwrite', format='json')

how ever all the tweaks with spark params from various online resouces , have only shaved minutes of the process . I'm searching for ideas on what to try to decrease run time by 50%

santifinland · Accepted Answer

The problem is in the repartition by 1. Doing that you force Spark to use only 1 core for writing. You are putting all your data in a single executor of your cluster. You are leaving the other 16 executors without data or tasks.

You can use repartition if you want to balance the sizes of your final output files, but make sure you use enough cores to speed up the writing.

In you example, you have up to 17 * 5 cores available for writing. You can repartition by 32, for example, if you want a trade off between speed and the number of output files.

pyspark speed up writing to S3

Answers (1)

Related Questions