Reputation: 12920
I am using databricks with Azure, so I don't have a way to provide the number of executors and memory per executors.
Let's consider I have the following configuration.
now if I set my shuffle partition to 10, (less than total cores, 40). What would happen?
will it create total of 10 executors, one per node, with each executor occupying all the cores and all the memory?
Upvotes: 0
Views: 1001
Reputation: 27373
If you don't use dynamic allocation
, you will end up leaving most cores unused during execution. Think about you have 40 "slots" for computation available, but only 10 tasks to process, so 30 "slots" will be empty (just idle).
I have to add that the above is a very simplified situation. In reality, you can have multiple stages running in parallel, so depending on your query, you will still have all 40 cores utilized (see e.g. Does stages in an application run parallel in spark?)
Note also that spark.sql.shuffle.partitions
is not the only parameter which determines the number of tasks/partitions. You can have different number of partitions for
if you modify your query using repartition
, e.g. when using :
df
.repartition(100,$"key")
.groupBy($"key").count
your value of spark.sql.shuffle.partitions=10
will be overwritten by 100 in this exchange step
Upvotes: 1
Reputation: 530
What your discribing as an expectation is named dynamic allocation on Spark. You can provide min and max allocation and then depending on amount of partiton the framework will be scaled. https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
But with only 10 partition on a 100 gb file you will have outOfMemoryErrors
Upvotes: 0