Reputation: 517
I'am having an issue with airflow when running it on a 24xlarge machine on EC2.
I must note that the parallelism level is 256.
For some days the dagrun finishes with status 'failed' for two undetermined reasons :
Some task has the status 'upstream_failed', which is not true because we can see clearly that all the previous steps where successful.
Other tasks have not the status 'null', they have not started yet and they cause the dagrun to fail.
I must note that the logs for both of these tasks are empty
And here is the tast instance details for these cases :
Any solutions please ?
Upvotes: 6
Views: 7842
Reputation: 1038
The other case where I've experienced the second condition ("Other tasks have not the status 'null'"), is when the task instance has changed, and specifically changed operator type.
I'm hoping you already got an answer / were able to move on. I've been stuck on this issue a few times in the last month, so I figured I would document what I ended up doing to solve the issue.
Example:
As best I can piece together, what's happening is:
task_instance
operator
task_instance
with the correct operator type; not seeing it, it updates the associated database record(s) as state = 'removed'You can see tasks impacted by this process with the query:
SELECT *
FROM task_instance
WHERE state = 'removed'
It looks like there's been work on this issue for airflow 1.10:
That being said, I'm not 100% sure based on the commits that I can find that it would resolve this issue. It seems like the general philosophy is still "when a DAG changes, you should increment / change the DAG name".
I don't love that solution, because it makes it hard to iterate on what is fundamentally one pipeline. The alternative I used was to follow (partially) the recommendations from Astronomer and "blow out" the DAG history. In order to do that, you need to:
Upvotes: 3
Reputation: 4366
This can happen when the task status was manually changed (likely through the "Mark Success" option), or forced into a state (as in upstream_failed
) and the task never receives a hostname
value on the record and wouldn't have any logs or PID
Upvotes: 0