aandy
aandy

Reputation: 21

what is the best Airflow architecture for AWS EMR clusters?

I have an AWS EMR cluster with 1 master node, 30 core nodes and some auto-scaled task nodes. now, hundreds of Hive and mysql jobs are running by Oozie on the cluster. I'm going to change some jobs from Oozie to Airflow. I googled to apply Airflow to my cluster. I found out that all dag should be located on every node and Airflow Worker must be installed on all nodes. But, My dag will be updated frequently and new dags will be added frequently, but the number of nodes is about 100 and even auto-scaled nodes are used. And, As you know, only master node has hive/mysql application on the cluster. So I am very confused. Who can tell me Airflow architecture to apply to my EMR cluster?

Upvotes: 0

Views: 829

Answers (1)

jhnclvr
jhnclvr

Reputation: 9497

Airflow worker nodes are not the same as EMR nodes.

In a typical setup, a celery worker ("Airflow worker node") reads from a queue of jobs and executes them using the appropriate operator (In this case probably a SparkSubmitOperator or possibly an SSHOperator).

Celery workers would not run on your EMR nodes as those are dedicated to running Hadoop jobs.

Celery workers would likely run on EC2s outside of your EMR cluster.

One common solution to having the same DAGs on every celery worker, is to put the dags on network storage (like EFS) and mount the network drive to the celery worker EC2s.

Upvotes: 1

Related Questions