Reputation: 1759
My team and I are on Airflow v2.1.0 using the Celery executor with Redis. Recently we’ve noticed some jobs are occasionally running until we kick them (many hours, sometimes days—basically until someone notices). There doesn’t seem to be a particular pattern that we’ve noticed yet.
We also use DataDog and the statsd provider to collect and monitor metrics produces by Airflow. Ideally we could setup a DataDog monitor for this but there doesn’t appear to be an obvious metric for this situation.
How can we detect and alarm on stuck jobs like this?
Upvotes: 2
Views: 663
Reputation: 1156
This issue is probably fixed by PR16550.
The problem arrises when you restart the scheduler and all tasks that were scheduled or queued (but didn't make it to the actual executor yet), will become in a state where the scheduler won't be able to start it. This will remain indefinitely(even restarting the scheduler won't fix it) without manual intervention. However, as you point out, you can indeed still run it manually.
Upvotes: 1
Reputation: 1875
You can use Airflow's SLA in combination with sla_miss_callback
parameter to call some service (we use Slack for example).
From the docs:
An SLA, or a Service Level Agreement, is an expectation for the maximum time a Task should take. If a task takes longer than this to run, then it visible in the "SLA Misses" part of the user interface, as well going out in an email of all tasks that missed their SLA.
With that, you define a SLA for those tasks you want to monitor, and provide a sla_miss_callback
to get notified about those misses.
Upvotes: 1