Martin Studer
Martin Studer

Reputation: 2321

YARN cluster mode reduces number of executor instances

I'm provisioning a Google Cloud Dataproc cluster in the following way: gcloud dataproc clusters create spark --async --image-version 1.2 \ --master-machine-type n1-standard-1 --master-boot-disk-size 10 \ --worker-machine-type n1-highmem-8 --num-workers 4 --worker-boot-disk-size 10 \ --num-worker-local-ssds 1

Launching a Spark application in yarn-cluster mode with

will only ever launch 3 executor instances instead of the requested 4, effectively "wasting" a full worker node which seems to be running the driver only. Also, reducing spark.executor.cores=7 to "reserve" a core on a worker node for the driver does not seem to help.

What configuration is required to be able to run the driver in yarn-cluster mode alongside executor processes, making optimal use of the available resources?

Upvotes: 0

Views: 549

Answers (1)

Angus Davis
Angus Davis

Reputation: 2683

An n1-highmem-8 using Dataproc 1.2 is configured to have 40960m allocatable per YARN NodeManager. Instructing spark to use 36g of heap memory per executor will also include 3.6g of memoryOverhead (0.1 * heap memory). YARN will allocate this as the full 40960m.

The driver will use 1g of heap and 384m for memoryOverhead (the minimum value). YARN will allocate this as 2g. As the driver will always launch before executors, its memory is allocated first. When an allocation request comes in for 40960 for an executor, there is no node with that much memory available and so no container is allocated on the same node as the driver.

Using spark.executor.memory=34g will allow the driver and executor to run on the same node.

Upvotes: 2

Related Questions