Amogh Antarkar
Amogh Antarkar

Reputation: 169

Apache Airflow - Maintain table for dag_ids with last run date?

Apache Airflow has MYSQL tables like dag, dag_run, job that
maintain metadata fields for dags including the dag run times. However, letting the external query jobs query these production airflow tables to check last run completions might not be a good design practice if the frequency and reporting query load increases on these tables.

Another possible option is to add python code in airflow dags to maintain another separate database-table which would save dag id and its run time metadata with every dag task run. This table will be outside of airflow and needs updating dag code to save metadata to the new database table.

What would be a recommended way or better alternate design to check the airflow dag tasks' last run completion time by outside reporting queries?

Upvotes: 0

Views: 2367

Answers (1)

joebeeson
joebeeson

Reputation: 4366

If you're only querying the database periodically there should be nothing wrong with exposing the Airflow database, preferably from a read-only account. Just keep an eye on how the database is holding up.

If you need to hit it very often you might want to copy the data to another database. Depending on the amount of "lag" you're willing to accept, you could simply query the Airflow database on an interval to write the state elsewhere -- you could even use Airflow to do this for you.

If you need real-time information you may want to look at modifying your processes and add a task to insert a record into a database.

Upvotes: 1

Related Questions