Reputation: 797
We programmed an API app deployed as an ECS fargate task (aws
network mode). It makes internal calls to an on-premises server. The task is hosted in a VPC with a private subnet and connects to the on-premises server via a transit gateway.
To simplify the scenario, we conducted tests by directly accessing the private IP of the task. When the test involved only a few calls, the task responded correctly. The test is just a simple POST call with little data coming back and forth.
However, when we increased the number of virtual users to 260 and ramped up the test (i.e., a performance test), we observed that after a minute, the internal call from the task to the on-premises server started timing out. (Usually the call only takes 3sec, and the timeout time is set 1 min)
We saw timeout errors in the app / container logs. While some requests were processed successfully, others failed due to timeouts.
For the requests that failed with timeout errors, we confirmed that the on-premises server did not receive those requests. It is also confirmed that on-premises server has no performance concern.
The app's CPU and memory usage remained within normal limits. Even after allocating additional CPU and memory, the issue persisted.
There are no firewall rules or policies on either the AWS or on-premises side that block the traffic.
Scale the ecs service up with a few more tasks, and let the tests hit the LB instead of a single container, the issue continues.
It was mentioned in some articles that fargate task has some outbound tcp connection limitation, but aws doc did not mention that.
Not sure if it app config issue or infra issue or networking issue.
This is the egress config for the security group which seems good to me.
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
Is there any app coding issue? Thanks.
This is task definition in json (with some real account, arn..etc removed):
{
"taskDefinitionArn": "arn",
"containerDefinitions": [
{
"name": "name",
"image": "...",
"cpu": 8192,
"memory": 32768,
"portMappings": [
{
"containerPort": 5004,
"hostPort": 5004,
"protocol": "tcp"
}
],
"essential": true,
"environment": [
...
],
"mountPoints": [],
"volumesFrom": [],
"secrets": [
...
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/mygroup",
"awslogs-region": "us-west-2",
"awslogs-stream-prefix": "ecs"
}
},
"systemControls": []
}
],
"family": "myfamily",
"taskRoleArn": "arn",
"executionRoleArn": "roleArn",
"networkMode": "awsvpc",
"revision": 231,
"volumes": [],
"status": "ACTIVE",
"requiresAttributes": [
{
"name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
},
{
"name": "ecs.capability.execution-role-awslogs"
},
{
"name": "com.amazonaws.ecs.capability.ecr-auth"
},
{
"name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
},
{
"name": "ecs.capability.secrets.asm.environment-variables"
},
{
"name": "ecs.capability.increased-task-cpu-limit"
},
{
"name": "com.amazonaws.ecs.capability.task-iam-role"
},
{
"name": "ecs.capability.execution-role-ecr-pull"
},
{
"name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
},
{
"name": "ecs.capability.task-eni"
}
],
"placementConstraints": [],
"compatibilities": [
"EC2",
"FARGATE"
],
"requiresCompatibilities": [
"FARGATE"
],
"cpu": "8192",
"memory": "32768",
"registeredAt": "2024-08-22T18:30:57.544Z",
"deregisteredAt": "2024-08-22T20:23:45.607Z",
"registeredBy": "...",
"tags": [
....
]
}
Upvotes: 2
Views: 132