Shiva Reddy
Shiva Reddy

Reputation: 45

Running spark jobs on emr using airflow

I have an EC2 instance and an EMR. I want to run spark jobs on EMR using airflow. Where would airflow needs to be installed for this?

I am considering using SparkSubmit operator for this. What arguments should I provide while creating the airflow task?

Upvotes: 3

Views: 1132

Answers (1)

A.B
A.B

Reputation: 20445

You will be installing airflow on ec2 and I will suggest installing a containerized version of it. See this answer.

For submitting spark jobs, you will need EmrAddStepsOperator from airflow, and you will need to provide the step for spark-submit.

(Note: If you are starting the cluster from the script, you will need to use EmrCreateJobFlowOperator as well, see details here)

A typical submit step will look something like this

spark_submit_step = [
    {
        'Name': 'Run Spark',
        'ActionOnFailure': 'TERMINATE_CLUSTER',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['spark-submit',
                '--jars',
                "/emr/instance-controller/lib/bootstrap-actions/1/spark-iforest-2.4.0.jar,/home/hadoop/mysql-connector-java-5.1.47.jar",
                '--py-files',
                '/home/hadoop/mysqlConnect.py',
                '/home/hadoop/main.py',
                'custum_argument',
                 another_custum_argument,
                 another_custom_argument]
        }
    }
    ]

Upvotes: 2

Related Questions