dstandish
dstandish

Reputation: 2408

airflow cleared tasks not getting executed

Preamble

Yet another airflow tasks not getting executed question...

Everything was going more or less fine in my airflow experience up until this weekend when things really went downhill.

I have checked all the standard things e.g. as outlined in this helpful post.

I have reset the whole instance multiple times trying to get it working properly but I am totally losing the battle here.

Environment

The problem

Here's what happens in my troubleshooting infinite loop / recurring nightmare.

  1. I reset the metadata DB (or possibly the whole virtualenv and config etc) and re-enter connection information.
  2. Tasks will get executed once. They may succeed. If I missed something in the setup, a task may fail.
  3. When task fails, it goes to retry state.
  4. I fix the issue with (e.g. forgot to enter a connection) and manually clear the task instance.
  5. Cleared task instances do not run, but just sit in a "none" state
  6. Attempts to get dag running again fail.

Before I started having this trouble, after a cleared a task instance, it would always very quickly get picked up and executed again.

But now, clearing the task instance usually results in the task instance getting stuck in a cleared state. It just sits there.

Worse, if I try failing the dag and all instances, and manually triggering the dag again, the task instances get created but stay in 'none' state. Restarting scheduler doesn't help.

Other observation

This is probably a red herring, but one thing I have noticed only recently is that when I click on the icon representing the task instances stuck in the 'none' state, it takes me to a "task instances" view filter that has the wrong filter; the filter is set at "string equals null".

But you need to switch it to "string empty yes" in order to have it actually return the task instances that are stuck.

I am assuming this is just an unrelated UI bug, a red herring as far as I am concerned, but I thought I'd mention it just in case.

Edit 1

I am noticing that there is some "null operator" going on: why is my operator null?  i will look into it

Edit 2

Is null a valid value for task instance state? Or is this an indicator that something is wrong.

Is it legit to have a null task instance state?

Edit 3

More none stuff.

Here are some bits from the task instance details page. Lots of attributes are none:

Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency  Reason
Unknown All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
- The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
- This task instance already ran and had its state changed manually (e.g. cleared in the UI)

If this task instance does not start soon please contact your Airflow administrator for assistance.
Task Instance Attributes
Attribute   Value
duration    None
end_date    None
is_premature    False
job_id  None
operator    None
pid None
queued_dttm None
raw False
run_as_user None
start_date  None
state   None

Update

I may finally be on to something...

After my nightmarish, marathon, stuck-in-twilight-zone troubleshooting session, I threw my hands up and resolved to use docker containers instead of running natively. It was just too weird. Things were just not making sense. I needed to move to docker so that the environment could be completely controlled and reproduced.

So I started working on the docker setup based on puckel/docker-airflow. This was no trivial task either, because I decided to use environment variables for all parameters and connections. Not all hooks parse connection URIs the same way, so you have to be careful and look at the code and do some trial and error.

So then I did that, I finally got my docker setup working locally. But when I went to build the image on my EC2 instance, I found that the disk was full. And it was in no small part due to airflow logs that it was full.

So, my new theory is that lack of disk space may have had something to do with this. I am not sure if I will be able to find a smoking gun in the logs, but I will look.

Upvotes: 7

Views: 9014

Answers (1)

dstandish
dstandish

Reputation: 2408

Ok I am closing this out and marking the presumptive root cause as server was out of space.

There were a number of contributing factors:

  1. My server did not have a lot of storage. Only 10GB. I did not realize it was so low. Resolution: add more space
  2. Logging in airflow 1.10.2 went a little crazy. An INFO log message was outputting Harvesting DAG parsing results every second or two, which resulted, eventually, in a large log file. Resolution: This is fixed in commit [AIRFLOW-3911] Change Harvesting DAG parsing results to DEBUG log level (#4729), which is in 1.10.3, but you can always fork and cherry pick if you are stuck on 1.10.2.
  3. Additionally, some of my scheduler / webserver interval params could have benefited from an increase. As a result I ended up with multi-GB log files. I think this may have been partly due to changing airflow versions without correctly updating airflow.cfg. Solution: when upgrading (or changing versions), temporarily move airflow.cfg so that a cfg compatible with the version will be generated, then merge them carefully. Another strategy is to rely only on environment variables, so that your config should always be as fresh install, and the only parameters in your env variables are parameter overrides and, possibly, connections.
  4. Airflow may not log errors anywhere in this case; everything looked fine, except the scheduler was not queuing up jobs, or it would queue one or two and then just stop, without any error message. Solutions can include (1) add out-of-space alarms on your cloud computing provider, (2) figure out how to ensure scheduler raises some helpful exception in this case and contribute them to airflow.

Upvotes: 5

Related Questions