Reputation: 1
We are building an airflow server on an EC2 instance that communicates to an EMR cluster to run spark jobs. We are trying to submit a BashOperator DAG that runs a spark-submit command for a simple wordcount application. Here is our spark submit command below:
./spark-submit --deploy-mode client --verbose --master yarn wordcount.py s3://bucket/inputwordcount.txt s3://bucket/outputbucket/ ;
We're getting the following error: Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
So far we've set HADOOP_CONF_DIR and YARN_CONF_DIR to /etc/hadoop/ in our EC2 instance in our .bashrc and have copied the spark-env.sh from the EMR cluster to /etc/hadoop/ on the EC2 Instance
We aren't too sure what files we are supposed to copy over to HADOOP_CONF_DIR/YARN_CONF_DIR directory in the EC2 for the spark-submit command to send the job to the EMR cluster running spark. Has anyone had experience configured a server to send spark commands to a remote server, we would appreciate the help!
Upvotes: 0
Views: 1645
Reputation: 1461
I think the issue it that you are running spark-submit
on the EC2 machine. I would suggest you to create EMR cluster with corresponding step. Here is an example from Airflow repo itself.
Or if you prefer using BashOperator, you should use aws cli. Namely you can use aws emr command.
Upvotes: -1