Reputation: 91

Optimize Spark and Yarn configuration

We have a cluster of 4 nodes with the characteristics above :

enter image description here

Spark jobs make a lot of times in processing, how could we optimize this time knowing that our jobs run from RStudio and we still have a lot of memory not utilized.

enter image description here

Upvotes: 0

Answers (2)

Raktotpal Bordoloi

Reputation: 1057

To add more context to the answer above, I would like to give explanation on how to set those parameters --num-executors, --executor-memory, --executor-cores appropriately.

The following answer covers the 3 main aspects mentioned in title - number of executors, executor memory and number of cores.

There may be other parameters like driver memory and others which I did not address as of this answer.

Case 1 Hardware - 6 Nodes, and Each node 16 cores, 64 GB RAM

Each executor is a JVM instance. So we can have multiple executors in a single Node

First 1 core and 1 GB is needed for OS and Hadoop Daemons, so available are 15 cores, 63 GB RAM for each node

Start with one by one how to choose these parameters.

Number of cores:

Number of cores = Concurrent tasks as executor can run

So we might think, more concurrent tasks for each executor will give better performance.

But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.

This number came from the ability of executor and not from how many cores a system has. So the number 5 stays same even if you have double(32) cores in the CPU.

Number of executors:

Coming back to next step, with 5 as cores per executor, and 15 as total available cores in one Node(CPU) - we come to 3 executors per node.

So with 6 nodes, and 3 executors per node - we get 18 executors. Out of 18 we need 1 executor (java process) for AM in YARN we get 17 executors

This 17 is the number we give to spark using --num-executors while running from spark-submit shell command

Memory for each executor:

From above step, we have 3 executors per node. And available RAM is 63 GB

So memory for each executor is 63/3 = 21GB.

However small overhead memory is also needed to determine the full memory request to YARN for each executor. Formula for that over head is max(384, .07 * spark.executor.memory)

Calculating that overhead - .07 * 21 (Here 21 is calculated as above 63/3)
                            = 1.47

Since 1.47 GB > 384 MB, the over head is 1.47. Take the above from each 21 above => 21 - 1.47 ~ 19 GB

So executor memory - 19 GB

Final numbers - Executors - 17 per node, Cores 5 per executor, Executor Memory - 19 GB

This way, assigning the resources properly to the spark jobs in the cluster would speed up the jobs; efficiently using available resources.

Upvotes: 1

Tiffany

Reputation: 273

I recommend you to have a look to these parameters :

--num-executors : controls how many executors will be allocated

--executor-memory : RAM for each executor

--executor-cores : cores for each executor

Upvotes: 0

Optimize Spark and Yarn configuration

Answers (2)

Related Questions