Reputation: 45
I have an EC2 instance and an EMR. I want to run spark jobs on EMR using airflow. Where would airflow needs to be installed for this?
I am considering using SparkSubmit operator for this. What arguments should I provide while creating the airflow task?
Upvotes: 3
Views: 1132
Reputation: 20445
You will be installing airflow on ec2 and I will suggest installing a containerized version of it. See this answer.
For submitting spark jobs, you will need EmrAddStepsOperator from airflow, and you will need to provide the step for spark-submit.
(Note: If you are starting the cluster from the script, you will need to use EmrCreateJobFlowOperator as well, see details here)
A typical submit step will look something like this
spark_submit_step = [
{
'Name': 'Run Spark',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit',
'--jars',
"/emr/instance-controller/lib/bootstrap-actions/1/spark-iforest-2.4.0.jar,/home/hadoop/mysql-connector-java-5.1.47.jar",
'--py-files',
'/home/hadoop/mysqlConnect.py',
'/home/hadoop/main.py',
'custum_argument',
another_custum_argument,
another_custom_argument]
}
}
]
Upvotes: 2