Dev
Dev

Reputation: 13753

Granularity of tasks in airflow

For one task, there are many helper tasks - fetch/save properties from file/db, validations, audits. These helper methods are not time consuming.

One sample DAG flow,

fetch_data >> actual_processing >> validation >> save_data >> audit

What's the recommendation in this scenario:

What's the overhead of an airflow task assuming there are enough resources?

Upvotes: 3

Views: 673

Answers (1)

y2k-shubham
y2k-shubham

Reputation: 11607

Question-1

What's the recommendation in this scenario

Always try to keep maximum stuff in single task (and preferably have fat tasks that run for several minutes than lean tasks running for few seconds) to (not exhaustive list)

  • 1. minimize scheduling latency

  • 2. minimize load on scheduler / webserver / SQLAlchemy backend db.


Exception to this rule could be (not exhaustive list)

  • 1. when idempotency dictates that you must break your tasks into smaller steps to prevent wasteful re-computation / breaking of workflow as told in Using Operators doc

An operator represents a single, ideally idempotent, task

  • 2. peculiar cases such as if you are using pools to limit load on an external resource => in this case, each operation that touches that external resource has to be modelled as a separate task in order to enforce load-restriction via pools

Question-2

What's the overhead of an airflow task assuming there are enough resources?

While I can't provide a technically precise answer here, do understand that Airflow's scheduler essentially works on a poll-based approach

  • at every heartbeat (usually ~ 20-30 s), it scans meta-db and DagBag to determine the list of tasks that are ready to run for e.g. like
    • a scheduled task who's upstream tasks have been run
    • an up_for_retry task who's retry_delay has expired

From the old docs

The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have been met. Behind the scenes, it monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) inspects active tasks to see whether they can be triggered.

  • this means that having more tasks (as well as more connections / dependencies between them) will increase the workload of scheduler (more checks to be evaluated)

Suggested reads

For all these issues with running a massive number of fast/small tasks , we require fast distributed task management, that does not require previous resource allocation (as Airflow does), as each ETL task needs very few resources, and allows tasks to be executed one after the other immediately.

Upvotes: 2

Related Questions