nordri
nordri

Reputation: 33

Scale In issue with One Task per Host strategy

We are trying to deploy our app in ECS using the strategy "One task per host" because we use host networking rather than Docker' one.

We start with 1 task ~ 1 host and as the (let's say) CPU rise we see how new Container Instances are added to the cluster as new Task are deployed inside those Instances.

Then, when the CPU is lower we see how the scheduler begin to destroy instances and containers.

Everything is marvelous but, sometimes, the scheduler destroy an instance and a container so, our cluster, at some point, has 1 instance and 0 tasks, because it destroys a task in a different Instance. What we want is a way to always destroy the Container instance and the container inside it.

I see "Termination policy" on Instances but there's nothing similiar with containers.

We are working with scale groups in Container instances and Containers from 1 to 5 and a metric based on CPU so as the CPU grow the instances and containers grow 1 to 1 and we want them to be destroyed in the same order.

Is that posible?

Upvotes: 1

Views: 574

Answers (1)

Patrick
Patrick

Reputation: 3230

I know this is necro'ing this post somewhat, but let me provide an answer in case you're still looking, or in case someone finds this post with the same problem.

Yes, this is possible, but not with the normal "packing" strategies of ECS. ECS can essentially either binpack containers or round robin them - that's about the extent of its intelligence. It's also not smart enough to know that it should terminate containers on the oldest instance (which is usually the one terminated by an ASG when it scales in). So how to handle the issue?

Well, the trick is to let the ASG service handle the scaling completely. Create your rules to scale up within cloudwatch, and have them trigger a Lambda function. Within that Lambda function, increase the desired count on your ASG and your ECS service. Yes, ECS will attempt to place the task immediately (which will fail since you have no available instance with enough CPU/Memory), but it will succeed as soon as the ASG finishes placing the instance.

Scaling up is easy, scaling down is a little trickier. So you create your scale down rule and have it remove an ASG instance when the CPU usage is low. You don't configure a scale down rule on your ECS cluster. At all. When the ASG terminates an instance, it will fire a lifecycle rule saying "Instance id X is attempting to terminate" and you can listen to that rule with SQS (which you will configure to kick off a lambda). When you catch that rule in Lambda, you will will use BOTO to get a list of tasks associated to the instance ID, and call kill to remove those specific tasks.

As soon as that task is dead, you will then set the desired on your service to desired -1 which will prevent the service from attempting to rebuild the task.

And voila, now your task instance associated with your dying instance will always be the instance that is terminated when you scale down.

Complicated? Yeah - the ASG is really meant to be naive to the cluster its associated to. This will do what you want though.

It's possible you could simulate the same sort of approach with ECS controlling all the scaling instead of the ASG, but it will be more difficult because you'll have to poll for terminated containers instead of being able to listen for lifecycle rules.

Upvotes: 1

Related Questions