yatu
yatu

Reputation: 88236

Set spark context configuration prioritizing spark-submit

I'm building a dockerized spark application, which will be run through an entrypoint.sh file which in turn runs a spark-submit:

#!/bin/bash

export SPARK_DIST_CLASSPATH=$(hadoop classpath):$HADOOP_HOME/share/hadoop/*
export _JAVA_OPTIONS="-Xms2g -Xmx8g -XX:MaxPermSize=8g"

spark-submit \
    --master local \
    --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///job/log4j.properties" \
    --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///job/log4j.properties"\
    --files "/job/log4j.properties" \
     main.py --train_path $1 --test_path $2

As you can see, I'm setting part of the configuration via the spark-submit --confoptions.

If in the spark-submit statement I set a configuration parameter which will also be set in main.py via:

SparkConf().set(option, value)

set has priority over spark-submit, and hence for whatever configuration is set in both ways, only what is set using SparkConf().set will prevail (see other question).

What I've been trying to achieve is to be able to control the spark configuration prioritizing what is set in spark-submit. There seems to be the parameter SparkConf.setIfMissing, but I'm not sure I'm using it properly.

What I've tried is to instanciate a SparkConf() object and set the configuration using the aforementioned method, as: SparkConf().setIfMissing(option, value). But it's not working. It overrides whatever is set in spark-submit. My guess is that until the context is not initialized, you can't retrieve what has been set via spark-submit.

So I'm unsure how to use SparkConf.setIfMissing for this case, if this is it's purpose to begin with? Otherwise, is there some other approach to accomplish this behaviour? Would appreciate any help in this.

Upvotes: 1

Views: 2002

Answers (1)

yatu
yatu

Reputation: 88236

I managed to solve it. Stopping the SparkContext and retrieving all parameters set via spark-submit, and then creating a new context did it. The steps would be as follows:

  • Initialize SparkContext
  • Retrieve all previously set configuration via sc.getConf() and stop the previous context using sc.stop()
  • Set all remaining configuration using SparkConf().setIfMissing() and create a new context with the new configuration SparkContext(conf=conf)

The last step enables to prioritize the configuration set via spark-submit. That way, only parameters that have not been previously set are set through this method. In code, that would be:

config = my_config_dict
sc = SparkContext()
sc.stop()
conf = sc.getConf()
for option in my_config_dict.keys():
    conf.setIfMissing(option, my_config_dict[option])
sc = SparkContext(conf=conf)

Upvotes: 1

Related Questions