user2280549
user2280549

Reputation: 1234

SparkR - override default parameters in spark.conf

I am using sparkR (spark 2.0.0, yarn) on cluster with following configuration: 5 machines (24 cores + 200 GB RAM each). Wanted to run sparkR.session() with additional arguments to assign only a percentage of total resources to my job:

if(Sys.getenv("SPARK_HOME") == "") Sys.setenv(SPARK_HOME = "/...")

library(SparkR, lib.loc = file.path(Sys.getenv('SPARK_HOME'), "R", "lib"))

sparkR.session(master = "spark://host:7077",
               appname = "SparkR",
               sparkHome = Sys.getenv("SPARK_HOME"),
               sparkConfig = list(spark.driver.memory = "2g"
                                 ,spark.executor.memory = "20g"
                                 ,spark.executor.cores = "4"
                                 ,spark.executor.instances = "10"),
               enableHiveSupport = TRUE)

The weird thing is that parameters seemed to be passed to sparkContext, but at the same time I end with number of x-core executors which make use of 100% resources (in this example 5 * 24 cores = 120 cores available; 120 / 4 = 30 executors).

I tried creating another spark-defaults.conf with no default paramaters assigned (so the only default parameters are those existed in spark documentation - they should be easily overrided) by:

if(Sys.getenv("SPARK_CONF_DIR") == "") Sys.setenv(SPARK_CONF_DIR = "/...")

Again, when I looked at the Spark UI on http://driver-node:4040 the total number of executors isn't correct (tab "Executors"), but at the same time all the config parameters in tab "Environment" are exactly the same as those provided by me in R script.

Anyone knows what might be the reason? Is the problem with R API or some infrastructural cluster-specific issue (like yarn settings)

Upvotes: 1

Views: 1622

Answers (1)

dtsbg
dtsbg

Reputation: 23

I found you have to use the spark.driver.extraJavaOptions, e.g.

spark <- sparkR.session(master = "yarn",
            sparkConfig = list(
              spark.driver.memory = "2g",
              spark.driver.extraJavaOptions =
              paste("-Dhive.metastore.uris=",
                    Sys.getenv("HIVE_METASTORE_URIS"),
                    " -Dspark.executor.instances=",
                    Sys.getenv("SPARK_EXECUTORS"),
                    " -Dspark.executor.cores=",
                    Sys.getenv("SPARK_CORES"),
                    sep = "")
              ))

Alternatively you change the spark-submit args, e.g.

Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn --driver-memory 10g sparkr-shell")

Upvotes: 2

Related Questions