Reputation: 664
On google dataproc, I was wondering how the spark settings are determined? In my case, I am running a 3 node n1-standard-4 cluster and the automatically generated spark-defaults.conf looks like this:
# User-supplied properties.
#Fri Dec 16 12:01:47 UTC 2016
spark.yarn.am.memoryOverhead=558
spark.executor.memory=5586m
spark.executor.cores=2
spark.driver.memory=3840m
spark.yarn.executor.memoryOverhead=558
spark.driver.maxResultSize=1920m
spark.yarn.am.memory=5586m
I am wondering why the config is set that way, especially why spark.yarn.am.memory is set that high? As fas as I understand this setting only takes effect in client mode, where the driver runs on the submitting machine (master). Also, AM is "only" in charge for requesting resources for the worker processes and coordinating these. Why should am.memory be that high? In my scenario, this default setting actually implies that I can only launch one spark process in client mode, as there is simply no RAM available anywhere in the cluster for a second AM. (This is actually what I observed and why I looked into the config in the first place).
So, again, my question: how does the dataproc startup script decide how to set these values, what's the rational behind it, and why should am.memory specifically be that high?
Upvotes: 2
Views: 1961
Reputation: 1349
By default, Dataproc gives both Spark AppMasters and the Executors half of the memory given to each NodeManager (regardless of the size of the node).
Why the AppMaster is that large is a good question. The only real answer is to support YARN cluster mode on small VMs. Dataproc is also optimized for single tenant ephemeral clusters, so shrinking the AppMaster wouldn't help too much if there weren't other small containers to pack with it.
Dataproc team is working on improving the default configs (in future image versions). If you have suggestions, you are more than welcome to reach out at [email protected].
Upvotes: 3