D. Müller
D. Müller

Reputation: 3426

Spark on YARN resource manager: Relation between YARN Containers and Spark Executors

I'm new to Spark on YARN and don't understand the relation between the YARN Containers and the Spark Executors. I tried out the following configuration, based on the results of the yarn-utils.py script, that can be used to find optimal cluster configuration.

The Hadoop cluster (HDP 2.4) I'm working on:

So I ran python yarn-utils.py -c 12 -m 64 -d 4 -k True (c=cores, m=memory, d=hdds, k=hbase-installed) and got the following result:

 Using cores=12 memory=64GB disks=4 hbase=True
 Profile: cores=12 memory=49152MB reserved=16GB usableMem=48GB disks=4
 Num Container=8
 Container Ram=6144MB
 Used Ram=48GB
 Unused Ram=16GB
 yarn.scheduler.minimum-allocation-mb=6144
 yarn.scheduler.maximum-allocation-mb=49152
 yarn.nodemanager.resource.memory-mb=49152
 mapreduce.map.memory.mb=6144
 mapreduce.map.java.opts=-Xmx4915m
 mapreduce.reduce.memory.mb=6144
 mapreduce.reduce.java.opts=-Xmx4915m
 yarn.app.mapreduce.am.resource.mb=6144
 yarn.app.mapreduce.am.command-opts=-Xmx4915m
 mapreduce.task.io.sort.mb=2457

These settings I made via the Ambari interface and restarted the cluster. The values also match roughly what I calculated manually before.

I have now problems

However, I found this post What is a container in YARN? , but this didn't really help as it doesn't describe the relation to the executors.

Can someone help to solve one or more of the questions?

Upvotes: 19

Views: 12363

Answers (2)

neverGiveUp
neverGiveUp

Reputation: 1

When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce schedules a container and starts a JVM for each task, Spark hosts multiple tasks within the same container. This approach enables several orders of magnitude faster task startup time.

In YARN, each application instance has an ApplicationMaster process, which is the first container started for that application. enter image description here

Upvotes: 0

D. Müller
D. Müller

Reputation: 3426

I will report my insights here step by step:

  • First important thing is this fact (Source: this Cloudera documentation):

    When running Spark on YARN, each Spark executor runs as a YARN container. [...]

  • This means the number of containers will always be the same as the executors created by a Spark application e.g. via --num-executors parameter in spark-submit.

  • Set by the yarn.scheduler.minimum-allocation-mb every container always allocates at least this amount of memory. This means if parameter --executor-memory is set to e.g. only 1g but yarn.scheduler.minimum-allocation-mb is e.g. 6g, the container is much bigger than needed by the Spark application.

  • The other way round, if the parameter --executor-memory is set to somthing higher than the yarn.scheduler.minimum-allocation-mb value, e.g. 12g, the Container will allocate more memory dynamically, but only if the requested amount of memory is smaller or equal to yarn.scheduler.maximum-allocation-mb value.

  • The value of yarn.nodemanager.resource.memory-mb determines, how much memory can be allocated in sum by all containers of one host!

=> So setting yarn.scheduler.minimum-allocation-mb allows you to run smaller containers e.g. for smaller executors (else it would be waste of memory).

=> Setting yarn.scheduler.maximum-allocation-mb to the maximum value (e.g. equal to yarn.nodemanager.resource.memory-mb) allows you to define bigger executors (more memory is allocated if needed, e.g. by --executor-memory parameter).

Upvotes: 40

Related Questions