Reputation: 1051
I have two workers and 3 tasks.
dag = DAG('dummy_for_testing', default_args=default_args)
t1 = BashOperator(
task_id='print_task1',
bash_command='task1.py',
dag=dag)
t2 = BashOperator(
task_id='print_task2',
bash_command='task2.py',
dag=dag)
t3 = BashOperator(
task_id='print_task3',
bash_command='task3.py',
dag=dag)
t1 >> t2 >> t3
Let say, I am performing tasks(t1,t2,t3)
on a particular file. Currently, everything is working on one worker but I want to setup another worker that will take the output of first task and perform task t2 and then task t3. So that, queue1
will perform t1
for the next file. How can I make this work for two workers. I am thinking of using queues
but couldn't understand how to make queue2
wait until task t1
in queue1
finished.
Upvotes: 0
Views: 900
Reputation: 3257
You shouldn't have to do anything other than start both workers, they will pick up tasks as they become available and within the concurrency/parallelism constraints defined in your config.
In the example you gave, the tasks might run entirely one worker 1
, worker 2
, or a mixture of both. This is because t2
won't start until t1
has completed. In the time between t1
completing and t2
starting, both workers will be idle (assuming you don't have other dags running). One will win the race in reserving the t2
task to run.
If you needed to have specific tasks running on different workers, (say to have one or more workers with higher levels of resources available, or special hardware) you can specify the queue at task level. The queue won't make a difference in the order that tasks run as the Airflow scheduler will ensure a task doesn't run until the tasks upstream to it have been successfully ran.
Upvotes: 4