x89
x89

Reputation: 3490

auto scaling for GTM server side

I have setup GTM server side using ECS cluster and services as described here: https://aws-solutions-library-samples.github.io/advertising-marketing/using-google-tag-manager-for-server-side-website-analytics-on-aws.html

I use Snowbridge (by Snowplow) to send data from AWS kinesis to GTM (with a Snowplow client installed) using HTTP post requests.

When the data volume is high, I occasionally get a 502 error from GTM. If I filter out the data and reduce the amount of data being forwarded to GTM, I no longer get the error. What can I change on my GTM side to ensure that high amounts of data can be handled?

This is how my GTM configuration roughly looks like:

resource "aws_ecs_cluster" "gtm" {
  name = "gtm"
  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

resource "aws_ecs_task_definition" "PrimaryServerSideContainer" {
  family                   = "PrimaryServerSideContainer"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 2048
  memory                   = 4096
  execution_role_arn       = aws_iam_role.gtm_container_exec_role.arn
  task_role_arn            = aws_iam_role.gtm_container_role.arn
  runtime_platform {
    operating_system_family = "LINUX"
    cpu_architecture        = "X86_64"
  }
  container_definitions = <<TASK_DEFINITION
  [
  {
    "name": "primary",
    "image": "gcr.io/cloud-tagging-10302018/gtm-cloud-image",
    "environment": [
      {
        "name": "PORT",
        "value": "80"
      },
      {
        "name": "PREVIEW_SERVER_URL",
        "value": "${var.PREVIEW_SERVER_URL}"
      },
      {
        "name": "CONTAINER_CONFIG",
        "value": "${var.CONTAINER_CONFIG}"
      }
    ],
    "cpu": 2048,
    "memory": 4096,
    "essential": true,
    "logConfiguration": {
          "logDriver": "awslogs",
          "options": {
            "awslogs-group": "gtm-primary",
            "awslogs-create-group": "true",
            "awslogs-region": "eu-central-1",
            "awslogs-stream-prefix": "ecs"
          }
        },
    "portMappings" : [
        {
          "containerPort" : 80,
          "hostPort"      : 80
        }
      ]
  }
]
TASK_DEFINITION
}

resource "aws_ecs_task_definition" "PreviewContainer" {
  family                   = "PreviewContainer"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 2048
  memory                   = 4096
  execution_role_arn       = aws_iam_role.gtm_container_exec_role.arn
  task_role_arn            = aws_iam_role.gtm_container_role.arn
  runtime_platform {
    operating_system_family = "LINUX"
    cpu_architecture        = "X86_64"
  }
  container_definitions = <<TASK_DEFINITION
  [
  {
    "name": "preview",
    "image": "gcr.io/cloud-tagging-10302018/gtm-cloud-image",
    "environment": [
      {
        "name": "PORT",
        "value": "80"
      },
      {
        "name": "RUN_AS_PREVIEW_SERVER",
        "value": "true"
      },
      {
        "name": "CONTAINER_CONFIG",
        "value": "${var.CONTAINER_CONFIG}"
      }
    ],
    "cpu": 1024,
    "memory": 2048,
    "essential": true,
    "logConfiguration": {
          "logDriver": "awslogs",
          "options": {
            "awslogs-group": "gtm-preview",
            "awslogs-region": "eu-central-1",
            "awslogs-create-group": "true",
            "awslogs-stream-prefix": "ecs"
          }
        },
    "portMappings" : [
        {
          "containerPort" : 80,
          "hostPort"      : 80
        }
      ]
  }
]
TASK_DEFINITION
}

resource "aws_ecs_service" "PrimaryServerSideService" {
  name             = var.primary_service_name
  cluster          = aws_ecs_cluster.gtm.id
  task_definition  = aws_ecs_task_definition.PrimaryServerSideContainer.id
  desired_count    = var.primary_service_desired_count
  launch_type      = "FARGATE"
  platform_version = "LATEST"

  scheduling_strategy = "REPLICA"

  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 50

  network_configuration {
    assign_public_ip = true
    security_groups  = [aws_security_group.gtm-security-group.id]
    subnets          = module.vpc.private_subnet_ids
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.PrimaryServerSideTarget.arn
    container_name   = "primary"
    container_port   = 80
  }

  lifecycle {
    ignore_changes = [task_definition]
  }
}

resource "aws_ecs_service" "PreviewService" {
  name             = var.preview_service_name
  cluster          = aws_ecs_cluster.gtm.id
  task_definition  = aws_ecs_task_definition.PreviewContainer.id
  desired_count    = var.preview_service_desired_count
  launch_type      = "FARGATE"
  platform_version = "LATEST"

  scheduling_strategy = "REPLICA"

  network_configuration {
    assign_public_ip = true
    security_groups  = [aws_security_group.gtm-security-group.id]
    subnets          = module.vpc.private_subnet_ids
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.PreviewTarget.arn
    container_name   = "preview"
    container_port   = 80
  }

  lifecycle {
    ignore_changes = [task_definition]
  }
}

resource "aws_lb" "PrimaryServerSideLoadBalancer" {
  name               = "PrimaryServerSideLoadBalancer"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.gtm-security-group.id]
  subnets            = module.vpc.public_subnet_ids

  enable_deletion_protection = false
}

resource "aws_security_group" "gtm-security-group" {
  name        = "gtm-security-group"
  description = "Security Group that allows all traffic for GTM"

  vpc_id = module.vpc.vpc_id

  // Allow all inbound traffic for IPv4
  ingress {
    from_port   = 0
    to_port     = 65535
    protocol    = "tcp"         # All TCP traffic
    cidr_blocks = ["0.0.0.0/0"] # Allow all sources (IPv4)
  }

  // Allow all outbound traffic for IPv4
  egress {
    from_port   = 0
    to_port     = 65535
    protocol    = "tcp"         # All TCP traffic
    cidr_blocks = ["0.0.0.0/0"] # Allow all destinations (IPv4)
  }
}

resource "aws_lb_target_group" "PrimaryServerSideTarget" {
  name        = "PrimaryServerSideTarget"
  port        = 80
  protocol    = "HTTP"
  vpc_id      = module.vpc.vpc_id
  target_type = "ip"

  health_check {
    path = "/healthz"
  }
}

resource "aws_lb_listener" "primarylistener" {
  load_balancer_arn = aws_lb.PrimaryServerSideLoadBalancer.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-2016-08"
  certificate_arn   = aws_acm_certificate.cert.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.PrimaryServerSideTarget.arn
  }
}

// Public subnets

resource "aws_lb" "PreviewLoadBalancer" {
  name               = "PreviewLoadBalancer"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.gtm-security-group.id]
  subnets            = module.vpc.public_subnet_ids

  enable_deletion_protection = false
}

resource "aws_lb_listener" "previewlistener" {
  load_balancer_arn = aws_lb.PreviewLoadBalancer.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-2016-08"
  certificate_arn   = aws_acm_certificate.cert.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.PreviewTarget.arn
  }
}

resource "aws_lb_target_group" "PreviewTarget" {
  name        = "PreviewTarget"
  port        = 80
  protocol    = "HTTP"
  vpc_id      = module.vpc.vpc_id
  target_type = "ip"

  health_check {
    path = "/healthz"
  }
}

I tried to implement auto scaling like this and kept thresholds pretty low but despite that, I occasionally get 502 erros.

resource "aws_appautoscaling_target" "ecs_service_target" {
  max_capacity       = 15
  min_capacity       = 1
  resource_id        = "service/${aws_ecs_cluster.gtm.name}/${aws_ecs_service.PrimaryServerSideService.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "ecs_policy" {
  name               = "scale-cpu"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_service_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_service_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_service_target.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }

    target_value       = 40
    scale_in_cooldown  = 1
    scale_out_cooldown = 300
  }
  depends_on = [aws_appautoscaling_target.ecs_service_target]
}

Could it be that the problem is not with GTM but with the Snowplow client installed on GTM?

Upvotes: 0

Views: 288

Answers (1)

James G
James G

Reputation: 2914

It doesn't look like you have any kind of inbound load balancing in front of your ECS service. The document you linked to outlines the deployment of an AWS ECS service that relies on the built-in AWS ECS load balancing.

Native ECS load balancing is not container-aware and does not route in any deterministic way (it is not round robin, latency, or connection-count aware). Therefore it is possible that over time a very busy container may receive a request, and this would lead to a timeout or other error.

Your idea of adding autoscaling is a good one, although it too has limitations. According to the AWS Auto Scaling documentation:

Amazon ECS sends metrics in 1-minute intervals to CloudWatch. Metrics are not available until the clusters and services send the metrics to CloudWatch, and you cannot create CloudWatch alarms for metrics that do not exist.

Therefore you will potentially see your cluster under a load higher than you intend for up to one minute before the Autoscaler identifies this condition and begins launching a new container to increase throughput.

You will still need to wait for the Autoscaler to launch the container, and for it to become healthy, before it will begin to serve requests and contribute to your performance. For these reasons it may be some time before you get the benefit of the new scaled-out container - and very spiky load will always mean you do not get that throughput fast enough.

It is also important to note that the metric you are monitoring for is ECSServiceAverageCPUUtilization - that is, you are looking at the average utilisation across all instances. The average of containers with a usage of 100%, 100%, 0% and 0% is 50% - even though two containers are completely smashed. This problem becomes worse with more containers.

Due to the aforementioned delay in scaling, coupled with the tricky "average" CPU usage metric being monitored, it is very likely your system will choke on spiky load.

Most large scale systems account for this by "over-provisioning", effectively ensuring there is sufficient excess capacity to handle any unexpected increase in load until the scaling can kick in and absorb the impact. You may try this to see if increasing your desired count will affect the incidence of these errors. Obtaining detailed metrics from your containers to identify whether any of them are encountering very high CPU or RAM usage, or error rates, will help you isolate under what conditions you are seeing errors occur and can help you tune your policies.

Another option is to use more sophisticated load balancing as described in the ECS documentation for adding Load Balancing to your service.

Upvotes: 2

Related Questions