PySpark Task Size

Question

I currently have a Spark cluster of 1 Driver and 2 Workers on version 2.4.5.

I would like to go further on optimizing parallelism to get a better throughput when loading and processing data, when I am doing this I often get these messages on the console:

WARN scheduler.TaskSetManager: Stage contains a task of very large size (728 KB). The maximum recommended task size is 100 KB.

How does this work? I am fairly new to the Spark technology but understand the basics of it, I would like to know how to optimize this but I'm not sure if it involves configuring the Slaves to have more executors and this way get more parallelism or if I need to Partition my Dataframes with either the coalesce or repartition functions.

Thank you guys in advance!

Ged · Accepted Answer

The general gist here is that you need to repartition to get more, but smaller size partitions, so as to get more parallelism and higher thruput. The 728k is an arbitrary number related to your Stage. I had this sometimes when I first started out with Scala and Spark.

I cannot see your code so I leave it at this. But googling here on SO suggests lack of parallelism as well. In all honesty quite well known.

PySpark Task Size

Answers (1)

Related Questions