Apache Spark tuning on distributed environment

I'd maximize Hadoop performance in a distributed environment (using Apache Spark with Yarn) and I'm following the hints on a blog post of Cloudera with this configuration:

6 nodes, 16 core/node, ram 64G/node

and the proposed solution is: --num-executors 17 --executor-cores 5 --executor-memory 19G

But i didn't understand why they use 17 num executors (in other words 3 executors for each node).

Our configuration is instead:

8 nodes, 8 core/node, ram 8G/node

What is the best solution?

Upvotes: 0

Answers (1)

Dan Ciborowski - MSFT

Reputation: 7237

Your ram is pretty low. I would expect this to be higher.

But, we start off with 8 nodes, and 8 cores. To determine our max executors we do nodes*(cores-1) = 56. Minus 1 core from each node for management.

So I would start off with 56 executors, 1 executor core, 1G ram.

If you have out of memory issues, double the ram, have the executors, up the cores. 28 executors, 2 executor cores, 2G ram but your max executors will be less, because an executor must fit onto a node. You will be able to get a total of 24 allocated containers max.

I would try 3 cores before 4 cores next, as 3 cores will fit 2 executors on each node, while with 4 cores you will have the same executors as 7.

Or, you can skip right to... 8 executors, 7 cores, 7gig ram(want to leave some for the rest of cluster).

I also found if CPU Scheduling was disabled, yarn was overriding my cores setting, and it was always staying at 1, no matter my config. Other settings must also be changed to turn this on.

yarn.schedular.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

Upvotes: 1

Apache Spark tuning on distributed environment

Answers (1)

Related Questions