spark default settings on dataproc, especially spark.yarn.am.memory

Question

On google dataproc, I was wondering how the spark settings are determined? In my case, I am running a 3 node n1-standard-4 cluster and the automatically generated spark-defaults.conf looks like this:

    # User-supplied properties.
    #Fri Dec 16 12:01:47 UTC 2016
    spark.yarn.am.memoryOverhead=558
    spark.executor.memory=5586m
    spark.executor.cores=2
    spark.driver.memory=3840m
    spark.yarn.executor.memoryOverhead=558
    spark.driver.maxResultSize=1920m
    spark.yarn.am.memory=5586m

I am wondering why the config is set that way, especially why spark.yarn.am.memory is set that high? As fas as I understand this setting only takes effect in client mode, where the driver runs on the submitting machine (master). Also, AM is "only" in charge for requesting resources for the worker processes and coordinating these. Why should am.memory be that high? In my scenario, this default setting actually implies that I can only launch one spark process in client mode, as there is simply no RAM available anywhere in the cluster for a second AM. (This is actually what I observed and why I looked into the config in the first place).

So, again, my question: how does the dataproc startup script decide how to set these values, what's the rational behind it, and why should am.memory specifically be that high?

Patrick Clay · Accepted Answer

By default, Dataproc gives both Spark AppMasters and the Executors half of the memory given to each NodeManager (regardless of the size of the node).

Why the AppMaster is that large is a good question. The only real answer is to support YARN cluster mode on small VMs. Dataproc is also optimized for single tenant ephemeral clusters, so shrinking the AppMaster wouldn't help too much if there weren't other small containers to pack with it.

Dataproc team is working on improving the default configs (in future image versions). If you have suggestions, you are more than welcome to reach out at cloud-dataproc-feedback@google.com.

spark default settings on dataproc, especially spark.yarn.am.memory

Answers (1)

Related Questions