I am using databricks with Azure, so I don't have a way to provide the number of executors and memory per executors. Let's consider I have the following configuration. 10 Worker nodes, each with 4 cores and 10 GB of memory. it's a standalone configuration input read size is 100 GB now if I set my shuffle partition to 10, (less than total cores, 40). What would happen? will it create total of 10 executors, one per node, with each executor occupying all the cores and all the memory?

apache-sparkapache-spark-sql

Gaurang Shah

Reputation: 12920

Spark Shuffle partition - if I have shuffle partition less than number of cores what would happen?

I am using databricks with Azure, so I don't have a way to provide the number of executors and memory per executors.

Let's consider I have the following configuration.

10 Worker nodes, each with 4 cores and 10 GB of memory.
it's a standalone configuration
input read size is 100 GB

now if I set my shuffle partition to 10, (less than total cores, 40). What would happen?

will it create total of 10 executors, one per node, with each executor occupying all the cores and all the memory?

Upvotes: 0

Answers (2)

Raphael Roth

Reputation: 27373

If you don't use dynamic allocation, you will end up leaving most cores unused during execution. Think about you have 40 "slots" for computation available, but only 10 tasks to process, so 30 "slots" will be empty (just idle).

I have to add that the above is a very simplified situation. In reality, you can have multiple stages running in parallel, so depending on your query, you will still have all 40 cores utilized (see e.g. Does stages in an application run parallel in spark?)

Note also that spark.sql.shuffle.partitions is not the only parameter which determines the number of tasks/partitions. You can have different number of partitions for

reading files
if you modify your query using repartition, e.g. when using :
```
df
 .repartition(100,$"key")
 .groupBy($"key").count
```

your value of spark.sql.shuffle.partitions=10 will be overwritten by 100 in this exchange step

Upvotes: 1

airliquide

Reputation: 530

What your discribing as an expectation is named dynamic allocation on Spark. You can provide min and max allocation and then depending on amount of partiton the framework will be scaled. https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation

But with only 10 partition on a 100 gb file you will have outOfMemoryErrors

Upvotes: 0

Spark Shuffle partition - if I have shuffle partition less than number of cores what would happen?

Answers (2)

Related Questions