Why won't my AWS ECS service start my task?

Question

I having a problem with a new AWS Load balancer and AWS-ECS repository, cluster, and task I'm creating in AWS with Terraform. Everything is being created without errors. There are some IAM roles and certificates in a separate file. These are the relevant definitions here. What's happening is the ECS service is creating a task, but the task shuts down immediately after it starts. I am not seeing any logs in the Cloudwatch log group at all. In fact it's never even created.

It makes sense to me that this whole thing would fail to run when I first run the infrastructure, because the ECS repository is brand new and doesn't have any Docker image pushed to it. But I've pushed the image and the service never starts again. I would imagine it would infinitely loop trying to start a task after failing but it does not.

I have forced it to restart by destroying the service and then recreating it. That I would expect to work, given that there's now an image to run. It has the same behavior of the initial start up which is that the service creates one task which fails to start with no logs of why and then never runs a task again.

Does anyone know what's wrong with this or perhaps where I might be able to see an error?

locals {
    container_name = "tdweb-web-server-container"
}

resource "aws_lb" "web_server" {
  name = "tdweb-alb"
  internal = false
  load_balancer_type = "application"
  security_groups = [aws_security_group.lb_sg.id]
  subnets = [
    aws_subnet.subnet_a.id,
    aws_subnet.subnet_b.id,
    aws_subnet.subnet_c.id
  ]
}

resource "aws_security_group" "lb_sg" {
  name = "ALB Security Group"
  description = "Allows TLS inbound traffic"
  vpc_id = aws_vpc.main.id

  ingress {
    description = "TLS from VPC"
    from_port = 443
    to_port = 443
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "web_server_service" {
  name = "Web Sever Service Security Group"
  description = "Allows HTTP inbound traffic"
  vpc_id = aws_vpc.main.id

  ingress {
    description = "HTTP from VPC"
    from_port = 80
    to_port = 80
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_alb_listener" "https" {
  load_balancer_arn = aws_lb.web_server.arn
  port = 443
  protocol = "HTTPS"
  ssl_policy = "ELBSecurityPolicy-2016-08"
  certificate_arn = aws_acm_certificate.main.arn
  
  default_action {
    target_group_arn = aws_lb_target_group.web_server.arn
    type = "forward"
  }
}

resource "random_string" "target_group_suffix" {
  length  = 4
  upper   = false
  special = false
}

resource "aws_lb_target_group" "web_server" {
  name = "web-server-target-group-${random_string.target_group_suffix.result}"
  port = 80
  protocol = "HTTP"  
    target_type = "ip"
  vpc_id = aws_vpc.main.id
    lifecycle {
    create_before_destroy = true
  }
}

resource "aws_iam_role" "web_server_task" {
  name = "tdweb-web-server-task-role"
  assume_role_policy = data.aws_iam_policy_document.web_server_task.json
}

data "aws_iam_policy_document" "web_server_task" {
  statement {
    actions = ["sts:AssumeRole"]
        principals {
            type = "Service"
            identifiers = ["ecs-tasks.amazonaws.com"]
        }
  }
}

resource "aws_iam_role_policy_attachment" "web_server_task" {
  for_each = toset([
    "arn:aws:iam::aws:policy/AmazonSQSFullAccess",
    "arn:aws:iam::aws:policy/AmazonS3FullAccess",
    "arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess",
    "arn:aws:iam::aws:policy/AWSLambdaInvocation-DynamoDB"
  ])
  role = aws_iam_role.web_server_task.name
  policy_arn = each.value
}

resource "aws_ecr_repository" "web_server" {
  name = "tdweb-web-server-repository"
}

resource "aws_ecs_cluster" "web_server" {
  name = "tdweb-web-server-cluster"
}

resource "aws_ecs_task_definition" "web_server" {
  family = "task_definition_name"
  task_role_arn = aws_iam_role.web_server_task.arn
  execution_role_arn = aws_iam_role.ecs_task_execution.arn
  network_mode = "awsvpc"
  cpu = "1024"
  memory = "2048"
  requires_compatibilities = ["FARGATE"]
    container_definitions = <


Edit: To answer a comment, here's the VPC and subnets
resource "aws_vpc" "main" {
  cidr_block = "172.31.0.0/16"
}

resource "aws_subnet" "subnet_a" {
  vpc_id     = aws_vpc.main.id
  availability_zone = "us-east-1a"
  cidr_block = "172.31.0.0/20"
}

resource "aws_subnet" "subnet_b" {
  vpc_id     = aws_vpc.main.id
  availability_zone = "us-east-1b"
  cidr_block = "172.31.16.0/20"
}

resource "aws_subnet" "subnet_c" {
  vpc_id     = aws_vpc.main.id
  availability_zone = "us-east-1c"
  cidr_block = "172.31.32.0/20"
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

Edit: This is a somewhat enlightening update. I found this error not in the task logs but in the container logs within the task. Which I never knew was there.

Status reason CannotPullContainerError: Error response from daemon:
Get https://563407091361.dkr.ecr.us-east-1.amazonaws.com/v2/:
net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)

It seems as though the service cannot pull the container from the ECR repo. I don't know how to fix this yet after doing some reading. I'm still looking around.

Marcin · Accepted Answer

Based on the comments, a likely issue is the lack of internet access in the subsets. This can be rectified as follows:

# Route table to connect to Internet Gateway

resource "aws_route_table" "public" {

  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_route_table_association" "subnet_public_a" {
  subnet_id      = aws_subnet.subnet_a.id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "subnet_public_b" {
  subnet_id      = aws_subnet.subnet_b.id
  route_table_id = aws_route_table.public.id
}


resource "aws_route_table_association" "subnet_public_c" {
  subnet_id      = aws_subnet.subnet_c.id
  route_table_id = aws_route_table.public.id
}

Also you can add depends_on to your aws_ecs_service so that it waits for these attachments to be completed.

A shorter alternative for the associations:

locals {
  subnets = [aws_subnet.subnet_a.id, 
             aws_subnet.subnet_b.id,
             aws_subnet.subnet_c.id]
}

resource "aws_route_table_association" "subnet_public_b" {

  count          = length(local.subnets)
  
  subnet_id      = local.subnets[count.index]
  route_table_id = aws_route_table.public.id
}

Why won't my AWS ECS service start my task?

Answers (1)

Related Questions

Why won&#39;t my AWS ECS service start my task?

Answers (1)

Related Questions

Why won't my AWS ECS service start my task?