Reputation: 88236
I'm building a dockerized spark application, which will be run through an entrypoint.sh
file which in turn runs a spark-submit
:
#!/bin/bash
export SPARK_DIST_CLASSPATH=$(hadoop classpath):$HADOOP_HOME/share/hadoop/*
export _JAVA_OPTIONS="-Xms2g -Xmx8g -XX:MaxPermSize=8g"
spark-submit \
--master local \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///job/log4j.properties" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///job/log4j.properties"\
--files "/job/log4j.properties" \
main.py --train_path $1 --test_path $2
As you can see, I'm setting part of the configuration via the spark-submit
--conf
options.
If in the spark-submit
statement I set a configuration parameter which will also be set in main.py
via:
SparkConf().set(option, value)
set
has priority over spark-submit
, and hence for whatever configuration is set in both ways, only what is set using SparkConf().set
will prevail (see other question).
What I've been trying to achieve is to be able to control the spark configuration prioritizing what is set in spark-submit
. There seems to be the parameter SparkConf.setIfMissing
, but I'm not sure I'm using it properly.
What I've tried is to instanciate a SparkConf()
object and set the configuration using the aforementioned method, as: SparkConf().setIfMissing(option, value)
. But it's not working. It overrides whatever is set in spark-submit
. My guess is that until the context is not initialized, you can't retrieve what has been set via spark-submit
.
So I'm unsure how to use SparkConf.setIfMissing
for this case, if this is it's purpose to begin with? Otherwise, is there some other approach to accomplish this behaviour? Would appreciate any help in this.
Upvotes: 1
Views: 2002
Reputation: 88236
I managed to solve it. Stopping the SparkContext
and retrieving all parameters set via spark-submit
, and then creating a new context did it. The steps would be as follows:
SparkContext
sc.getConf()
and stop the previous context using sc.stop()
SparkConf().setIfMissing()
and create a new context with the new configuration SparkContext(conf=conf)
The last step enables to prioritize the configuration set via spark-submit
. That way, only parameters that have not been previously set are set through this method.
In code, that would be:
config = my_config_dict
sc = SparkContext()
sc.stop()
conf = sc.getConf()
for option in my_config_dict.keys():
conf.setIfMissing(option, my_config_dict[option])
sc = SparkContext(conf=conf)
Upvotes: 1