Neil Trodden
Neil Trodden

Reputation: 4728

How do I work out why an ECS health-check is failing?

Outline:

I have a very simple ECS container which listens on port 5000 and writes out HelloWorld, plus the hostname of the instance it is running on. I want to deploy many of these containers using ECS and load balance them just to really learn more about how this works. And it is working to a certain extent but my health check is failing (time out) which is causing the containers tasks to be bounced up and down.

Current configuration:

Routing

Each private subnet has a rule to 10.0.0.0/19, and a default route for 0.0.0.0/0 to the NAT instance in public subnet in the same AZ as it.

Each public subnet has the same 10.0.0.0/19 route and a default route for 0.0.0.0/0 to the internet gateway.

Security groups

My instances are in a group that allows egress to anywhere and ingress on ports 32768 - 65535 from the security group the ALB is in.

The ALB is in a security group that allows ingress on port 80 only but egress to the security group my ECS instances are in on any port/protocol

What happens

When I bring all this up, it actually works - I can take the public dns record of the ALB and refresh and I see responses coming back to me from my container app telling me the hostname. This is exactly what I want to achieve however, it fails the health check and the container is drained, and replaced - with another one that fails the health check. This continues in a cycle, I have never seen a single successful health check.

What I've tried

Summary

I can view my application correctly load-balanced across AZs and private subnets on both its main url '/' and its health check url '/health' through the load balancer, from the ECS instance itself, and by using the instances private IP and port from another machine within the public subnet the ALB is in. The ECS service just cannot see this health check once without timing out. What on earth could I be missing??

Upvotes: 14

Views: 19892

Answers (2)

Neil Trodden
Neil Trodden

Reputation: 4728

For any that follow, I managed to break the app in my container accidentally and it was throwing a 500 error. Crucially though, the health check started reporting this 500 error -> therefore it was NOT a network timeout. Which means that when the health-check contacts the end point in my app, it was not handling the response properly and this appears to be a problem related to Nancy (the api framework I was using) and Go which sometimes reports "Client.Timeout exceeded while awaiting headers" and I am sure ECS is interpreting this as a network time-out. I'm going to tcpdump the network traffic and see what the health-check is sending and Nancy is responding and compare that to a container that works. Perhaps there is a Nancy fix or maybe ECS needs to not be so fussy.

edit:

By simply updating all the nuget packages that my Nancy app was using to the latest available and suddenly everything started working!

Upvotes: 3

Polymath
Polymath

Reputation: 125

More questions than answers. but maybe they will take you in the right direction.

You say that you can access the container app via the ALB, but then the node fails the heath check. The ALB should not be allowing connection to the node until it's health check succeeds. So if you are connecting to the node via the ALB, then the ALB must have tested and decided it was healthy. Is it a different health check that is killing the node ?

Have you check CloudTrail to see if it has any clues about what is triggering the tear-down ? Is the tear down being triggered by the ALB or the auto scaling group? Might it be that auto scaling group has the wrong scale-in criteria ?

Good luck

Upvotes: 0

Related Questions