Ray Chase
Ray Chase

Reputation: 43

Setting Spark configuration through environment variable, command line arguments or code?

I'm learning Spark these days, but I'm a little confused by Spark configurations. AFAIK, there are at least 3 ways to config:

  1. Environment variables, http://spark.apache.org/docs/latest/spark-standalone.html
  2. Command line arguments, like ./bin/spark-submit --class <main-class> --master xxx --deploy-mode xxx --conf key=value
  3. Code, like in Scala/Java code.

Why are there so many ways to do it, what are the differences? Is there a best practice for this?

Upvotes: 2

Views: 7422

Answers (3)

KrisP
KrisP

Reputation: 1216

To answer your question directly:

  • you use configurations in source code when you expect your important parameters never to change and not be hardware dependent - e.g. conf.set("spark.eventLog.enabled", "true") (although, arguably, you might leave that particular one out of source code - it could arguably go in the properties file, 3rd option here)

  • you use command-line options for parameters that change from run to run - e.g. driver-memory or executor-cores - you expect this to change depending which hardware you run it on (or while tuning) - so such a configuration shouldn't be in your source code

  • you use configurations in a properties file when configuration settings don't change often - e.g. if you always use the same hardware configuration to run your app, you might define spark.driver.memory in the properties file (a template is in the conf directory of your $SPARK_HOME)

Upvotes: 3

Chris Fregly
Chris Fregly

Reputation: 1530

couple rules that i follow:

1) avoid any of the SPARK_CAPITAL_LETTER_SHOUTING_AT_YOU config params from spark-env.sh as they don't seem to work in some cases

2) prefer, instead, the spark.nice.and.calm.lower.case config params from spark-defaults.conf

3) for anything non-obvious or job-specific, create a script and explicitly pass in the command line --spark.config.params to the spark-submit call to highlight these

Upvotes: 1

Joe Widen
Joe Widen

Reputation: 2448

Spark follows a hierarchy for setting configs. You have figured out already that there are many ways to set configs which are confusing. Here is the hierarchy that spark uses for taking configs.

  1. Set in the code on the conf or context
  2. Passed in at runtime from the command line
  3. From a config file specified by --properties-file at runtime
  4. Spark env defaults

So for an example, lets create a simple Spark application

val conf = new SparkConf()
conf.setAppName("InCodeApp")
val sc = new SparkContext(conf)

If you were to run this application and try to overide the app name set in the code:

spark-submit --name "CLI App" myApp.jar

When you ran this application, the application name would be "InCodeApp"

Because of this hierarchy, I've found it best to leave mosy properties to be set at the command line, with the exception of configurations that should never change (like setting speculation or kryo).

Upvotes: 1

Related Questions