Is there a way to specify all three resource properties (executor instances, cores and memory) in Spark on YARN (Dataproc)

Question

I'm trying to setup a small Dataproc Spark cluster of 3 workers (2 regular and one preemptible) but I'm running into problems.

Specifically, I've been struggling to find a way to let the Spark application submitters to have freedom to specify the number of executors while being able to specify how many cores should be assigned to them

Dataproc image of Yarn and Spark has the following defaults:

Spark dynamic allocation enabled
Yarn Capacity Scheduler configured with DefaultResourceCalculator

With these defaults the number of cores is not taken into account (the ratio container-vcores is always 1:1), as DefaultResourceCalculator only cares about memory. In any case, when configured this way, the number of executors is honored (by means of setting spark.dynamicAllocation.enabled = false and spark.executor.instances = as properties in gcloud submit)

So I changed it to DominantResourceCalculator and now it takes care of the requested cores but I'm no longer able to specify the number of executors, regardless of disabling the Spark dynamic allocation or not.

It might also be of interest to know that the default YARN queue is limited to 70 % of capacity by configuration (in capacity-scheduler.xml) and that there is also another non-default queue configured (but not used yet). My understanding is that both Capacity and Fair schedulers do not limit the resource allocation in case of uncontended job submission as long as the max capacity is kept at 100. In any case, for the sake of clarity, these are the properties setup during the cluster creation:

capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
capacity-scheduler:yarn.scheduler.capacity.root.queues=default,online
capacity-scheduler:yarn.scheduler.capacity.root.default.capacity=30
capacity-scheduler:yarn.scheduler.capacity.root.online.capacity=70
capacity-scheduler:yarn.scheduler.capacity.root.online.user-limit-factor=1
capacity-scheduler:yarn.scheduler.capacity.root.online.maximum-capacity=100
capacity-scheduler:yarn.scheduler.capacity.root.online.state=RUNNING
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_submit_applications=*
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_administer_queue=*

The job submission is done by means of gcloud tool and the queue used is the default.

E.g, the following properties set when executing gcloud dataproc submit:

--properties spark.dynamicAllocation.enabled=false,spark.executor.memory=5g,spark.executor.instances=3

end up in the following assignment:

Is there a way to configure YARN so that it accepts both?

EDITED to specify queue setup

George · Accepted Answer

You may try setting a higher value, such as 2, for yarn.scheduler.capacity.root.online.user-limit-factor in place of the present value of 1, the value you have set. This setting enables the user to leverage twice the chosen capacity. Your setting of 100% as the maximum capacity allows for this doubling of the chosen capacity.

Is there a way to specify all three resource properties (executor instances, cores and memory) in Spark on YARN (Dataproc)

Answers (1)

Related Questions