tdebroc
tdebroc

Reputation: 1526

Amazon MWAA Airflow - Tasks container shut down / stop / killed without logs

We use Amazon MWAA Airflow, rarely some task as marked as "FAILED" but there is no logs at all. As if the container had been shut down without noticing us.

I have found this link: https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags#task_fails_without_emitting_logs Which explain this by OOM on the machine. But our tasks are doing almost nothing with CPU and RAM. They only do 1 HTTP call to AWS API. So very light.

On Cloudwatch, I can see that no others tasks are launched on the same container (the DAG run start by printing the container IP, so I can search this IP on all tasks).

If someone has an idea, would be great, thanks !

Upvotes: 7

Views: 4556

Answers (3)

In addition to the accepted answer here's a blog post explaining the problem: https://technical.thombedford.com/267

And also, as of the time of writing this answer, Amazon states the problem in MWAA documentation as a Note:

Amazon MWAA uses Apache Airflow metrics to determine when additional Celery Executor workers are needed, and as required increases the number of Fargate workers up to the value specified by max-workers. When that number is zero, Amazon MWAA removes additional workers, downscaling back to the min-workers value. For more information, see the following How it works section. When downscaling occurs, it is possible for new tasks to be scheduled. Furthermore, it's possible for workers that are set for deletion to pick up those tasks before the worker containers are removed. This period can last between two to five minutes, due to a combination of factors: the time it takes for the Apache Airflow metrics to be sent, the time to detect a steady state of zero tasks, and the time it takes the Fargate workers to be removed. If you use Amazon MWAA with periods of sustained workload, followed by periods of no workload, you will be unaffected by this limitation. However, if you have very intermittent workloads with repeated high usage, followed by zero tasks for approximately five minutes, you might be affected by this issue when tasks running on the downscaled workers are deleted and marked as failed.

https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-autoscaling.html

Upvotes: 1

Arne Huang
Arne Huang

Reputation: 634

So after trying various things, there is also a concurrency issue with their boto package and the easiest way to solve it is to make things not concurrent.

So running their smallest instance size, with a scheduler with only 2 vcpus (as per this) will not have this issue.

Another thing to try is setting celery.sync_parallelism = 1

both will solve random task failures without logs if you are running their medium or large instances

Upvotes: 1

Alban M
Alban M

Reputation: 351

MWAA make use of ECS as a backend and the way things work is that ECS will autoscale the number of worker according to the number of tasks running in the cluster. For a small environment, each worker can handle 5 tasks by default. If there's more than 5 tasks then it will scale out another worker and so on.

We don't do any compute on airflow (batch, long running job), our Dags are mainly API requests to other service, this mean our Dags run fast and are short lives. From time to time, we can spike to eight or more tasks for a very short period of time (few seconds). In that case, the autoscaling will trigger a scale out and add a worker(s) to the cluster. Then, since those tasks are only API request, it get executed very quickly and immediately the number of task goes down to 0 which trigger a scale in (remove worker(s)). If at that exact moment another task is schedule, then airflow will eventually run the task on a container being remove and your task will get killed in the middle without any notice (race condition). You usually see incomplete logs when this happen.

The first workaround is to disable autoscaling by freezing the number of worker in the cluster. You can set the min and max to the appropriate number of worker which will depend on your workload. We agree, we lose the elasticity of the service.

$ aws mwaa update-environment --name MyEnvironmentName --min-workers 2 --max-workers 2

Another solution suggest by AWS will be to have always one dummy task running (an infinite loop) so you never endup scaling in all your worker.

AWS told us they are working on a solution to improve executor.

Upvotes: 9

Related Questions