AsapYAMLGang
AsapYAMLGang

Reputation: 1

Configuring spark-submit to a remote AWS EMR cluster

We are building an airflow server on an EC2 instance that communicates to an EMR cluster to run spark jobs. We are trying to submit a BashOperator DAG that runs a spark-submit command for a simple wordcount application. Here is our spark submit command below:

./spark-submit --deploy-mode client --verbose --master yarn wordcount.py s3://bucket/inputwordcount.txt s3://bucket/outputbucket/ ;

We're getting the following error: Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

So far we've set HADOOP_CONF_DIR and YARN_CONF_DIR to /etc/hadoop/ in our EC2 instance in our .bashrc and have copied the spark-env.sh from the EMR cluster to /etc/hadoop/ on the EC2 Instance

We aren't too sure what files we are supposed to copy over to HADOOP_CONF_DIR/YARN_CONF_DIR directory in the EC2 for the spark-submit command to send the job to the EMR cluster running spark. Has anyone had experience configured a server to send spark commands to a remote server, we would appreciate the help!

Upvotes: 0

Views: 1645

Answers (1)

gorros
gorros

Reputation: 1461

I think the issue it that you are running spark-submit on the EC2 machine. I would suggest you to create EMR cluster with corresponding step. Here is an example from Airflow repo itself. Or if you prefer using BashOperator, you should use aws cli. Namely you can use aws emr command.

Upvotes: -1

Related Questions