Reputation: 1462
(NOTE: I've modified my original post since I've now collected a little more data.)
Something just started happening today after several weeks of no issues and I can't think of anything I changed that would've caused this.
I have a Spring Boot app living behind an NGINX proxy (all Dockerized), all of which is in an AWS ECS Fargate cluster.
After deployment, I'm noticing that--randomly (as in, sometimes this doesn't happen)--a call to the services being served up by Spring Boot will 503 (behind the NGINX proxy). It seems to do this on every second deployment, with a subsequent deployment fixing the matter, i.e. calls to the server will succeed for awhile (maybe a few seconds; maybe a few minutes), but then stop.
I looked at the "HealthyHostCount" and I noticed that when I get the 503 and my main target group says it has no registered/healthy hosts in either AZ. I'm not sure what would cause the TG to deregister a target, especially since a subsequent deployment seems to "fix" the issue.
Any insight/help would be greatly appreciated.
Thanks in advance.
UPDATE It looks as though it seems to happen right after I "terminate original task set" from the Blue/Green deployment from CodeDeploy. I'm wondering if it's an AZ issue, i.e. I haven't specified enough tasks to run on them both.
Upvotes: 0
Views: 3674
Reputation: 11
If anyone else had this issue after creating a new environment for an existing app, make sure you configure your security groups. My environment was timing out (and had bad health) because my web pages did not have access to the database inbound rule on AWS.
You can see if this is your issue if you are able to connect to urls in your app that do not connect to the same web services/databases.
Upvotes: 0
Reputation: 1462
It turns out it was an issue with how I'd configured one of my target groups and/or the ALB, although what the exact issue was, I couldn't determine as I re-created the TGs and the service from scratch.
If I had to guess, I'd guess my TG wasn't configured with the right port, or the ALB had a bad rule/listener.
Upvotes: 0
Reputation: 3765
I think it likely your services are failing the health check on the TargetGroup.
When the TargetGroup determines that an existing target is unhealthy, it will unregister it, which will cause ECS to then launch a task to replace it. In the meantime you'll typically see Gateway errors like this.
On the ECS page, click your service, then click on the Events tab... if I am correct, here you'd see messages about why the task was stopped like 'unhealthy' or 'health check timeout'.
The cause can be myriad. An overloaded cluster (probably not the case with Fargate), not enough resources, a runaway request that eats all the memory or CPU.
You case smells like a slow startup issue. ECS web services have a 'Health Check Grace Period'... that is, a time to wait before health checks start. If your container takes too long to get started when first launched, the first couple of health checks may fail and cause the tasks to 'cycle'. A new deployment may be slower if your images are particularly large, because the ECS host has to download the image. Fargate, in general, is a bit slower than EC2 hosts because of overhead in how it sets up networking. If the slow startup is your problem, you can try increasing this grace period, but should also investigate how to speed your container startup time (decrease image size, other nginx tweaks to speed intialization, etc).
Upvotes: 1