Rafay
Rafay

Reputation: 6188

Architecturing Airflow DAG that needs contextual throttling

I should be able to create a single node Group1 that caters to the throttling and also have

I have tried explaining this in the following diagram: enter image description here

How do I implement such a hierarchy in Airflow for a Spring Boot Java application? Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time. For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.

Upvotes: 6

Views: 582

Answers (1)

Dave
Dave

Reputation: 2059

These constraints cannot be modeled as a directed acyclic graph, and thus cannot implemented in airflow exactly as described. However, they can be modeled as queues, and thus could be implemented with a job queue framework. Here are your two options:

Implement suboptimally as airflow DAG:

from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
# Executors that inherit from BaseExecutor take a parallelism parameter
from wherever import SomeExecutor, SomeOperator

# Table load jobs are done with parallelism 5
load_tables = SubDagOperator(subdag=DAG("load_tables"), executor=SomeExecutor(parallelism=5))

# Each table load must be it's own job, or must be split into sets of tables of predetermined size, such that num_tables_per_job * parallelism = 5
for table in tables:
    load_table = SomeOperator(task_id=f"load_table_{table}", dag=load_tables)

# Jobs done afterwards are done with higher parallelism
afterwards = SubDagOperator(
    subdag=DAG("afterwards"), executor=SomeExecutor(parallelism=high_parallelism)
)

for job in jobs:
    afterward_job = SomeOperator(task_id=f"job_{job}", dag=afterwards)

# After _all_ table load jobs are complete, start the jobs that should be done afterwards

load_tables > afterwards

The suboptimal aspect here, is that, for the first half of the DAG, the cluster will be underutilized by higher_parallelism - 5.

Implement optimally with job queue:

# This is pseudocode, but could be easily adapted to a framework like Celery

# You need two queues
# The table load queue should be initialized with the job items
table_load_queue = Queue(initialize_with_tables)
# The queue for jobs to do afterwards starts empty
afterwards_queue = Queue()

def worker():

    # Work while there's at least one item in either queue
    while not table_load_queue.empty() or not afterwards_queue.empty():
        working_on_table_load = [worker.is_working_table_load for worker in scheduler.active()]

        # Work table loads if we haven't reached capacity, otherwise work the jobs afterwards
        if sum(working_on_table_load) < 5:
            is_working_table_load = True
            task = table_load_queue.dequeue()
        else
            is_working_table_load = False
            task = afterwards_queue.dequeue()

        if task:
            after = work(task)
            if is_working_table_load:

                # After working a table load, create the job to work afterwards
                afterwards_queue.enqueue(after)

# Use all the parallelism available
scheduler.start(worker, num_workers=high_parallelism)

Using this approach, the cluster won't be underutilized.

Upvotes: 1

Related Questions