How to robustly handle blips in the DAG folder

Question

I'm running airflow 2 (with Redis and Celery) where the dags folder is set to /drive1/myuser/airflow/dags. This dags folder contains a symlink that looks like my_dags -> /drive2/myrepos/latest_deployment/mydags. latest_deployment is another symlink that changes to the most recent version of code every time code is deployed.

It happens rarely, but sometimes when code is deployed, in my worker I'll see an error message like " Dag 'XXX' could not be found; either it does not exist or it failed to parse". Then in the scheduler logs I'll see

`Executor reports task instance XXX [queued]` 
finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?

My presumption is that the code deployment causes the symlink to change, or not exist, for a few seconds while the worker is trying to load the DAG to run a task. I have retries configured for tasks, so if it just retried this wouldn't be a problem, it would find the DAG on retry. However, not only does the DAG not retry, the entire scheduler will hang for impacted DAGs and stop scheduling tasks until I manually kill the impacted DAGs and clear their status. After clearing the hung run they'll run and succeed again normally.

How can I make Airflow handle a few transient seconds of the DAG folder disappearing / changing? I can't change the deployment system, or turn Airflow off during deployments. Ideally I'd like Airflow just to retry the failed task, but even marking the entire DAG run as errored would be better than the current freezing.

How to robustly handle blips in the DAG folder

Answers (0)

Related Questions