SparkR - override default parameters in spark.conf

Question

I am using sparkR (spark 2.0.0, yarn) on cluster with following configuration: 5 machines (24 cores + 200 GB RAM each). Wanted to run sparkR.session() with additional arguments to assign only a percentage of total resources to my job:

if(Sys.getenv("SPARK_HOME") == "") Sys.setenv(SPARK_HOME = "/...")

library(SparkR, lib.loc = file.path(Sys.getenv('SPARK_HOME'), "R", "lib"))

sparkR.session(master = "spark://host:7077",
               appname = "SparkR",
               sparkHome = Sys.getenv("SPARK_HOME"),
               sparkConfig = list(spark.driver.memory = "2g"
                                 ,spark.executor.memory = "20g"
                                 ,spark.executor.cores = "4"
                                 ,spark.executor.instances = "10"),
               enableHiveSupport = TRUE)

The weird thing is that parameters seemed to be passed to sparkContext, but at the same time I end with number of x-core executors which make use of 100% resources (in this example 5 * 24 cores = 120 cores available; 120 / 4 = 30 executors).

I tried creating another spark-defaults.conf with no default paramaters assigned (so the only default parameters are those existed in spark documentation - they should be easily overrided) by:

if(Sys.getenv("SPARK_CONF_DIR") == "") Sys.setenv(SPARK_CONF_DIR = "/...")

Again, when I looked at the Spark UI on http://driver-node:4040 the total number of executors isn't correct (tab "Executors"), but at the same time all the config parameters in tab "Environment" are exactly the same as those provided by me in R script.

Anyone knows what might be the reason? Is the problem with R API or some infrastructural cluster-specific issue (like yarn settings)

SparkR - override default parameters in spark.conf

Answers (1)

Related Questions