Georgi Koemdzhiev
Georgi Koemdzhiev

Reputation: 11931

ECS Task fails with InsufficientFreeAddressesInSubnet error when running my State Machine

I have a state machine that consists of a Map task that starts a lot of Fargate tasks (30+) a very similar task definition. The only differences between the tasks are the environment variables in the ContainerOverrides block.

Task Definition:

"CalculateTask": {
    "Type": "Task",
    "Resource": "arn:aws:states:::ecs:runTask.sync",
    "Retry": [
        {
            "ErrorEquals": [
                "States.ALL"
            ],
            "IntervalSeconds": 10,
            "MaxAttempts": 2,
            "BackoffRate": 1.5
        }
    ],
    "Parameters": {
        "LaunchType": "FARGATE",
        "Cluster": "arn:aws:ecs:region:111111111:cluster/cluster-name",
        "TaskDefinition": "arn:aws:ecs:region:111111111:task-definition/task-definition:44",
        "NetworkConfiguration": {
            "AwsvpcConfiguration": {
                "Subnets": [
                    "subnet-1111111111111111","subnet-2222222222222222","subnet-3333333333333333"
                ],
                ...
            }
        },
        "Overrides": {
            "ContainerOverrides": [
                {
                    "Name": "Phase-1-start",
                    "Environment": [
                        {
                            "Name": "COMMAND",
                            "Value": "calculateGas/Oil/PeakGas..."
                        }
                    ]
                }
            ]
        }
    }
}

When I run my State Machibe tasks keep failing with this StoppedReason:

"StopCode": "TaskFailedToStart",
    "StoppedAt": 1618584363236,
    "StoppedReason": "Unexpected EC2 error while attempting to Create Network Interface with public IP assignment 
    enabled in subnet 'subnet-2222222222222222': InsufficientFreeAddressesInSubnet",

I don't understand why this issue occurs, I am supplying 3 subnet ids for ECS to choose from.

Upvotes: 2

Views: 6569

Answers (1)

Eric Rizzi
Eric Rizzi

Reputation: 33

I had the same exact issue. The root cause ended up being that Fargate tasks I started with run_task were, for some reason, not properly terminating. They were ending up in an "INACTIVE" state and hanging around for months. The fact that they weren't properly terminating meant that they weren't releasing their IP addresses in the subnet. This meant that new tasks weren't able to get an IP and would fail.

To fix, I had to:

  1. Log into the AWS console
  2. Go to the ECR service
  3. Click on the Clusters page
  4. Click on the offending cluster (likely the one with a bunch of Running Tasks)
  5. Click on the Tasks tab
  6. Select all [INACTIVE] instances
  7. Click Stop to stop the tasks

In addition to cleaning up these inactive instances, I added some extra code/alarming to make sure that this issue wouldn't go undetected:

def invoke_fargate(cw_metrics, YOUR_ARGS_HERE)
    client = boto3.client("ecs", region_name=get_aws_region())
    response = client.run_task(YOUR_CODE_HERE)

    # Honestly not sure if this is required...better safe than sorry?
    _LOGGER.info("Starting to sleep to allow `run_task` chance to kick of container")
    time.sleep(30)

    task_arn = response["tasks"][0]["taskArn"]
    description = client.describe_tasks(cluster=cluster_name, tasks=[task_arn])
    _LOGGER.info("%s", description)

    for status_dict in description["tasks"]:
        if status_dict.get("stopCode") in ["TaskFailedToStart"]:
            cw_metrics.trigger_alarm("FARGATE_INVOCATION_FAILED")
    _LOGGER.info("Done with Fargate invocation")

Upvotes: 1

Related Questions