Reputation: 1962
I having a problem with a new AWS Load balancer and AWS-ECS repository, cluster, and task I'm creating in AWS with Terraform. Everything is being created without errors. There are some IAM roles and certificates in a separate file. These are the relevant definitions here. What's happening is the ECS service is creating a task, but the task shuts down immediately after it starts. I am not seeing any logs in the Cloudwatch log group at all. In fact it's never even created.
It makes sense to me that this whole thing would fail to run when I first run the infrastructure, because the ECS repository is brand new and doesn't have any Docker image pushed to it. But I've pushed the image and the service never starts again. I would imagine it would infinitely loop trying to start a task after failing but it does not.
I have forced it to restart by destroying the service and then recreating it. That I would expect to work, given that there's now an image to run. It has the same behavior of the initial start up which is that the service creates one task which fails to start with no logs of why and then never runs a task again.
Does anyone know what's wrong with this or perhaps where I might be able to see an error?
locals {
container_name = "tdweb-web-server-container"
}
resource "aws_lb" "web_server" {
name = "tdweb-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.lb_sg.id]
subnets = [
aws_subnet.subnet_a.id,
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id
]
}
resource "aws_security_group" "lb_sg" {
name = "ALB Security Group"
description = "Allows TLS inbound traffic"
vpc_id = aws_vpc.main.id
ingress {
description = "TLS from VPC"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "web_server_service" {
name = "Web Sever Service Security Group"
description = "Allows HTTP inbound traffic"
vpc_id = aws_vpc.main.id
ingress {
description = "HTTP from VPC"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_alb_listener" "https" {
load_balancer_arn = aws_lb.web_server.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-2016-08"
certificate_arn = aws_acm_certificate.main.arn
default_action {
target_group_arn = aws_lb_target_group.web_server.arn
type = "forward"
}
}
resource "random_string" "target_group_suffix" {
length = 4
upper = false
special = false
}
resource "aws_lb_target_group" "web_server" {
name = "web-server-target-group-${random_string.target_group_suffix.result}"
port = 80
protocol = "HTTP"
target_type = "ip"
vpc_id = aws_vpc.main.id
lifecycle {
create_before_destroy = true
}
}
resource "aws_iam_role" "web_server_task" {
name = "tdweb-web-server-task-role"
assume_role_policy = data.aws_iam_policy_document.web_server_task.json
}
data "aws_iam_policy_document" "web_server_task" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ecs-tasks.amazonaws.com"]
}
}
}
resource "aws_iam_role_policy_attachment" "web_server_task" {
for_each = toset([
"arn:aws:iam::aws:policy/AmazonSQSFullAccess",
"arn:aws:iam::aws:policy/AmazonS3FullAccess",
"arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess",
"arn:aws:iam::aws:policy/AWSLambdaInvocation-DynamoDB"
])
role = aws_iam_role.web_server_task.name
policy_arn = each.value
}
resource "aws_ecr_repository" "web_server" {
name = "tdweb-web-server-repository"
}
resource "aws_ecs_cluster" "web_server" {
name = "tdweb-web-server-cluster"
}
resource "aws_ecs_task_definition" "web_server" {
family = "task_definition_name"
task_role_arn = aws_iam_role.web_server_task.arn
execution_role_arn = aws_iam_role.ecs_task_execution.arn
network_mode = "awsvpc"
cpu = "1024"
memory = "2048"
requires_compatibilities = ["FARGATE"]
container_definitions = <<DEFINITION
[
{
"name": "${local.container_name}",
"image": "${aws_ecr_repository.web_server.repository_url}:latest",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/tdweb-task",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"portMappings": [
{
"hostPort": 80,
"protocol": "tcp",
"containerPort": 80
}
],
"cpu": 0,
"essential": true
}
]
DEFINITION
}
resource "aws_ecs_service" "web_server" {
name = "tdweb-web-server-service"
cluster = aws_ecs_cluster.web_server.id
launch_type = "FARGATE"
task_definition = aws_ecs_task_definition.web_server.arn
desired_count = 1
load_balancer {
target_group_arn = aws_lb_target_group.web_server.arn
container_name = local.container_name
container_port = 80
}
network_configuration {
subnets = [
aws_subnet.subnet_a.id,
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id
]
assign_public_ip = true
security_groups = [aws_security_group.web_server_service.id]
}
}
Edit: To answer a comment, here's the VPC and subnets
resource "aws_vpc" "main" {
cidr_block = "172.31.0.0/16"
}
resource "aws_subnet" "subnet_a" {
vpc_id = aws_vpc.main.id
availability_zone = "us-east-1a"
cidr_block = "172.31.0.0/20"
}
resource "aws_subnet" "subnet_b" {
vpc_id = aws_vpc.main.id
availability_zone = "us-east-1b"
cidr_block = "172.31.16.0/20"
}
resource "aws_subnet" "subnet_c" {
vpc_id = aws_vpc.main.id
availability_zone = "us-east-1c"
cidr_block = "172.31.32.0/20"
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
}
Edit: This is a somewhat enlightening update. I found this error not in the task logs but in the container logs within the task. Which I never knew was there.
Status reason CannotPullContainerError: Error response from daemon: Get https://563407091361.dkr.ecr.us-east-1.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
It seems as though the service cannot pull the container from the ECR repo. I don't know how to fix this yet after doing some reading. I'm still looking around.
Upvotes: 2
Views: 2622
Reputation: 238299
Based on the comments, a likely issue is the lack of internet access in the subsets. This can be rectified as follows:
# Route table to connect to Internet Gateway
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
}
resource "aws_route_table_association" "subnet_public_a" {
subnet_id = aws_subnet.subnet_a.id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "subnet_public_b" {
subnet_id = aws_subnet.subnet_b.id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "subnet_public_c" {
subnet_id = aws_subnet.subnet_c.id
route_table_id = aws_route_table.public.id
}
Also you can add depends_on
to your aws_ecs_service
so that it
waits for these attachments to be completed.
A shorter alternative for the associations:
locals {
subnets = [aws_subnet.subnet_a.id,
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id]
}
resource "aws_route_table_association" "subnet_public_b" {
count = length(local.subnets)
subnet_id = local.subnets[count.index]
route_table_id = aws_route_table.public.id
}
Upvotes: 1