rich-iovanisci
rich-iovanisci

Reputation: 21

Airflow DAG Serialization/DagBag Issues After Major Upgrade

We've recently upgraded our Airflow environment from 2.5.1 to 2.7.2.

Some users are experiencing longer times than usual for DAGs to update in the UI once the files are updated. This seems like a problem with DAG serialization, but the DagFileProcessorManager (DFPM) logs seem okay.

We have the DFPM running as a standalone process, separate from the Scheduler but on the same EC2. The Scheduler and DFPM are running as separate System Units on the same EC2, with the Webserver on a separate EC2 managed by it's own System Unit and the Metastore is an external postgres db.

Sometimes in the UI, when a user selects their DAG from the DAGs menu, the DAG fails to load with a message like 'DAG "<dag_id> seems to be missing from the DagBag".

There are also ERROR logs in the Scheduler logs like 'DAG <dag_id> not found in serialized_dag table' and 'Couldn't find DAG <dag_id> in DAG bag or database!'.

We never faced issues like this before upgrading, but nothing jumped out in during the airflow database migration related to this, and nothing seems off aside from what's mentioned above in any of the Airflow components' logs.

We've tried:

I was expecting multiple processes to increase the turnaround time for DAG parsing and re-serialization, but it hasn't seemed to help all that much. Users are still reporting slow-turnaround time (hours, not minutes).

Also expecting to resolve DagBag errors through reserialization, but that hasn't seemed to help.

Hoping that someone with more intimate knowledge of how the Scheduler and Webserver build their respective DagBags/the airflow serialization process can help point me in the right direction to debug further.

Edit: Perhaps it's worth noting I've seen a bunch of somewhat similar posts, but most from at least a year or two ago where the solution was something as simple as restarting the Webserver and Scheduler. We've rolled our entire infrastructure stacks dozens of times in the process of debugging.

Edit 2: Digging around the DB, and just noticed that the count of rows in serialized_dag table dropped to zero and is now increasing back to around where it should be. I navigated to the UI while it was still low, and noticed that the DAG missing from DagBag issue was occurring at that time. It seems like some process is flushing out or overwriting that table occasionally. Is any Airflow process capable of doing something like that when it encounters an error? Would running multiple DFPMs overwrite the serialized_dag table? If so is that a bug?

Upvotes: 2

Views: 1072

Answers (1)

sktemkar
sktemkar

Reputation: 9

Increase the dag_file_processor_timeout to higher value like 300. The default is 50 which will get timed out if # of Dags>300 or so

Upvotes: 0

Related Questions