Reputation: 97
When we want to connect HDFS From Spark, we just set HADOOP_CONF_DIR to instead of pass a variety of arguments to Spark Conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
/usr/hdp/current/spark-client/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --executor-memory 1G --num-executors 3 /usr/hdp/current/spark-client/lib/spark-examples*.jar 100
How does spark handle HADOOP_CONF_DIR? How are these configuration files passed to Hadoop?
Upvotes: 1
Views: 9147
Reputation: 2072
1. HADOOP_CONF_DIR & spark-env.sh
While running spark using Yarn, you need to add following line in to spark-env.sh
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Note: check $HADOOP_HOME/etc/hadoop is correct one in your environment. And spark-env.sh
contains export of HADOOP_HOME as well.
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
2. spark-defaults.conf
All your memory related configs will be in spark-defaults.conf
file.
When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file. Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode. See the YARN-related Spark Properties for more information.
3. overwrite the configs with app manager configs
As per the spark documentation it is clearly saying that if you have configured Yarn Cluster manager then it will be overwrite the spark-env.sh setting. Can you just check in Yarn-env or yarn-site file for the local dir folder setting.
"this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager." source - https://spark.apache.org/docs/2.3.1/configuration.html
Upvotes: 2