Reputation: 51
I was going through this Apache Spark documentation, and it mentions that:
When running Spark on YARN in
cluster
mode, environment variables need to be set using thespark.yarn.appMasterEnv.[EnvironmentVariableName]
property in yourconf/spark-defaults.conf
file.
I am running my EMR cluster on AWS data pipeline. I wanted to know that where do I have to edit this conf file. Also, if I create my own custom conf file, and specify it as part of --configurations
(in the spark-submit), will it solve my use-case?
Upvotes: 4
Views: 10223
Reputation: 85
For future reference you could directly pass the environment variable when creating the EMR cluster using the Configurations parameter as described in the docs here.
Specifically, the spark-defaults
file can be modified by passing a configuration JSON as follows:
{
'Classification': 'spark-defaults',
'Properties': {
'spark.yarn.appMasterEnv.[EnvironmentVariableName]' = 'some_value',
'spark.executorEnv.[EnvironmentVariableName]': 'some_other_value'
}
},
Where spark.yarn.appMasterEnv.[EnvironmentVariableName]
would be used to pass a variable in cluster mode using YARN (here). And spark.executorEnv.[EnvironmentVariableName]
to pass a variable to the executor process (here).
Upvotes: 1
Reputation: 2360
One way to do it, is the following: (The tricky part is that you might need to setup the environment variables on both executor and driver parameters)
spark-submit \
--driver-memory 2g \
--executor-memory 4g \
--conf spark.executor.instances=4 \
--conf spark.driver.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
--conf spark.executor.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
--master yarn \
--deploy-mode cluster\
--class com.industry.class.name \
assembly-jar.jar
I have tested it in EMR and client mode but should work on cluster mode as well.
Upvotes: 3