Reputation: 13753
For one task, there are many helper tasks - fetch/save properties from file/db, validations, audits. These helper methods are not time consuming.
One sample DAG flow,
fetch_data >> actual_processing >> validation >> save_data >> audit
What's the recommendation in this scenario:
What's the overhead of an airflow task assuming there are enough resources?
Upvotes: 3
Views: 673
Reputation: 11607
Question-1
What's the recommendation in this scenario
Always try to keep maximum stuff in single task (and preferably have fat tasks that run for several minutes than lean tasks running for few seconds) to (not exhaustive list)
1. minimize scheduling latency
2. minimize load on scheduler
/ webserver
/ SQLAlchemy
backend db.
Exception to this rule could be (not exhaustive list)
An operator represents a single, ideally idempotent, task
pools
to limit load on an external resource => in this case, each operation that touches that external resource has to be modelled as a separate task in order to enforce load-restriction via pool
sQuestion-2
What's the overhead of an airflow task assuming there are enough resources?
While I can't provide a technically precise answer here, do understand that Airflow's scheduler essentially works on a poll-based approach
DagBag
to determine the list of task
s that are ready to run for e.g. like
scheduled
task who's upstream tasks have been runup_for_retry
task who's retry_delay
has expiredFrom the old docs
The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have been met. Behind the scenes, it monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) inspects active tasks to see whether they can be triggered.
task
s (as well as more connections / dependencies between them) will increase the workload of scheduler (more checks to be evaluated)Suggested reads
For all these issues with running a massive number of fast/small tasks , we require fast distributed task management, that does not require previous resource allocation (as Airflow does), as each ETL task needs very few resources, and allows tasks to be executed one after the other immediately.
Upvotes: 2