Reputation: 6043
I set a CloudWatch Alarm to add 1 capacity unit to EC2 autoscaling group when memory reservation is > 70%. The Alarm was triggered at the right moment, but it has since been in alarm for 16 hours+ with no change at all in the EC2 autoscaling group. What could possibly be going wrong?
Here's my ECS CloudFormation template:
ECSCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: !Ref EnvironmentName
ECSAutoScalingGroup:
DependsOn: ECSCluster
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
VPCZoneIdentifier: !Ref Subnets
LaunchConfigurationName: !Ref ECSLaunchConfiguration
MinSize: !Ref ClusterMinSize
MaxSize: !Ref ClusterMaxSize
DesiredCapacity: !Ref ClusterDesiredCapacity
CreationPolicy:
ResourceSignal:
Timeout: PT15M
UpdatePolicy:
AutoScalingRollingUpdate:
MinInstancesInService: 1
MaxBatchSize: 1
PauseTime: PT15M
SuspendProcesses:
- HealthCheck
- ReplaceUnhealthy
- AZRebalance
- AlarmNotification
- ScheduledActions
WaitOnResourceSignals: true
ScaleUpPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref ECSAutoScalingGroup
Cooldown: '1'
ScalingAdjustment: '1'
MemoryReservationAlarmHigh:
Type: AWS::CloudWatch::Alarm
Properties:
EvaluationPeriods: '2'
Statistic: Average
Threshold: '70'
AlarmDescription: Alarm if Cluster Memory Reservation is too high
Period: '60'
AlarmActions:
- Ref: ScaleUpPolicy
Namespace: AWS/ECS
Dimensions:
- Name: ClusterName
Value: !Ref ECSCluster
ComparisonOperator: GreaterThanThreshold
MetricName: MemoryReservation
ScaleDownPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref ECSAutoScalingGroup
Cooldown: '1'
ScalingAdjustment: '-1'
MemoryReservationAlarmLow:
Type: AWS::CloudWatch::Alarm
Properties:
EvaluationPeriods: '2'
Statistic: Average
Threshold: '30'
AlarmDescription: Alarm if Cluster Memory Reservation is too Low
Period: '60'
AlarmActions:
- Ref: ScaleDownPolicy
Namespace: AWS/ECS
Dimensions:
- Name: ClusterName
Value: !Ref ECSCluster
ComparisonOperator: LessThanThreshold
MetricName: MemoryReservation
ECSLaunchConfiguration:
Type: AWS::AutoScaling::LaunchConfiguration
Properties:
KeyName: !If [IsProd, !Ref 'AWS::NoValue', !Ref KeyName]
ImageId: !Ref ECSAMI
InstanceType: !Ref InstanceType
SecurityGroups:
- !Ref SecurityGroup
IamInstanceProfile: !Ref ECSInstanceProfile
UserData:
"Fn::Base64": !Sub |
#!/bin/bash
source /etc/profile.d/proxy.sh
yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
yum install -y https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
yum install -y aws-cfn-bootstrap hibagent
cat >> /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml <<EOF
[proxy]
http_proxy="${!http_proxy}"
https_proxy="${!https_proxy}"
no_proxy="${!no_proxy}"
EOF
/opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration
/opt/aws/bin/cfn-signal -e $? --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSAutoScalingGroup
/usr/bin/enable-ec2-spot-hibernation
Metadata:
AWS::CloudFormation::Init:
config:
packages:
yum:
collectd: []
commands:
01_add_instance_to_cluster:
command: !Sub echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config
02_enable_cloudwatch_agent:
command: !Sub /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:${ECSCloudWatchParameter} -s
files:
/etc/cfn/cfn-hup.conf:
mode: 000400
owner: root
group: root
content: !Sub |
[main]
stack=${AWS::StackId}
region=${AWS::Region}
/etc/cfn/hooks.d/cfn-auto-reloader.conf:
content: !Sub |
[cfn-auto-reloader-hook]
triggers=post.update
path=Resources.ECSLaunchConfiguration.Metadata.AWS::CloudFormation::Init
action=/opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration
services:
sysvinit:
cfn-hup:
enabled: true
ensureRunning: true
files:
- /etc/cfn/cfn-hup.conf
- /etc/cfn/hooks.d/cfn-auto-reloader.conf
# This IAM Role is attached to all of the ECS hosts. It is based on the default role
# published here:
# http://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html
#
# You can add other IAM policy statements here to allow access from your ECS hosts
# to other AWS services. Please note that this role will be used by ALL containers
# running on the ECS host.
ECSRole:
Type: AWS::IAM::Role
Properties:
Path: /
RoleName: !Sub ${EnvironmentName}-ECSRole-${AWS::Region}
AssumeRolePolicyDocument: |
{
"Statement": [{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
}
}]
}
ManagedPolicyArns:
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
- arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
- arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
Policies:
- PolicyName: ecs-service
PolicyDocument: |
{
"Statement": [{
"Effect": "Allow",
"Action": [
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:Poll",
"ecs:RegisterContainerInstance",
"ecs:StartTelemetrySession",
"ecs:Submit*",
"ecr:BatchCheckLayerAvailability",
"ecr:BatchGetImage",
"ecr:GetDownloadUrlForLayer",
"ecr:GetAuthorizationToken"
],
"Resource": "*"
}]
}
ECSInstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
Path: /
Roles:
- !Ref ECSRole
ECSServiceAutoScalingRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
Action:
- "sts:AssumeRole"
Effect: Allow
Principal:
Service:
- application-autoscaling.amazonaws.com
Path: /
ManagedPolicyArns:
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
Policies:
- PolicyName: ecs-service-autoscaling
PolicyDocument:
Statement:
Effect: Allow
Action:
- application-autoscaling:*
- cloudwatch:DescribeAlarms
- cloudwatch:PutMetricAlarm
- ecs:DescribeServices
- ecs:UpdateService
Resource: "*"
ECSCloudWatchParameter:
Type: AWS::SSM::Parameter
Properties:
Description: CloudWatch Log configs for ECS cluster
Name: !Sub AmazonCloudWatch-${ECSCluster}-ECS
Type: String
Value: !Sub |
{
"logs": {
"force_flush_interval": 5,
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/messages",
"log_group_name": "${ECSCluster}/var/log/messages",
"log_stream_name": "{instance_id}",
"timestamp_format": "%b %d %H:%M:%S"
},
{
"file_path": "/var/log/dmesg",
"log_group_name": "${ECSCluster}/var/log/dmesg",
"log_stream_name": "{instance_id}"
},
{
"file_path": "/var/log/docker",
"log_group_name": "${ECSCluster}/var/log/docker",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%S.%f"
},
{
"file_path": "/var/log/ecs/ecs-init.log",
"log_group_name": "${ECSCluster}/var/log/ecs/ecs-init.log",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
},
{
"file_path": "/var/log/ecs/ecs-agent.log.*",
"log_group_name": "${ECSCluster}/var/log/ecs/ecs-agent.log",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
},
{
"file_path": "/var/log/ecs/audit.log",
"log_group_name": "${ECSCluster}/var/log/ecs/audit.log",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
}
]
}
}
},
"metrics": {
"append_dimensions": {
"AutoScalingGroupName": "${!aws:AutoScalingGroupName}",
"InstanceId": "${!aws:InstanceId}",
"InstanceType": "${!aws:InstanceType}"
},
"metrics_collected": {
"collectd": {
"metrics_aggregation_interval": 60
},
"disk": {
"measurement": [
"used_percent"
],
"metrics_collection_interval": 60,
"resources": [
"/"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 60
},
"statsd": {
"metrics_aggregation_interval": 60,
"metrics_collection_interval": 10,
"service_address": ":8125"
}
}
}
}
ECSClusterParameter:
Type: AWS::SSM::Parameter
Properties:
Description: !Sub ${EnvironmentName} - ECS Cluster
Name: !Sub /${EnvironmentName}/ecs-cluster
Type: String
Value: !Ref ECSCluster
ECSServiceAutoScalingRoleParameter:
Type: AWS::SSM::Parameter
Properties:
Description: !Sub ${EnvironmentName} - ECS Service ASG Role
Name: !Sub /${EnvironmentName}/ecs-service-asg-role
Type: String
Value: !GetAtt ECSServiceAutoScalingRole.Arn
The Alarm activity history:
2019-12-26 11:40:54 Action Successfully executed action arn:aws:autoscaling:ap-southeast-2:031539715286:scalingPolicy:95e836b6-2f56-498d-b931-7ec4184bedc4:autoScalingGroupName/ECS-UEBZA8GAP8S7-ECSAutoScalingGroup-1BIBTJH5I50W9:policyName/ECS-UEBZA8GAP8S7-ScaleUpPolicy-17LUWE42DC7EO
2019-12-26 11:40:54 State update Alarm updated from OK to In alarm
Upvotes: 0
Views: 934
Reputation: 781
Make sure there aren't any processes suspended. Alarm notification means that incoming alarms won't trigger scaling policies. Launch means even if the desired goes up nothing will be launched
Other common issues that can cause this:
If you're using weights and increasing desired by 1, but the lowest weigh isn't 1, then it might never be able to scale.
Make sure there aren't any other scaling policies being triggered that might override this one
Check the activity history to make sure there aren't any healthcheck replacements constantly happening, since that would start a 5 minute cooldown (default since one isn't set on the ASG, only the scaling policy), and would block simple scaling policies
Make sure the desired isn't already at the Max
In addition to the alarm being triggered, make sure you see in the Alarm history that the autoscaling 'action' happened (The action actually happens every minute the alarm stays in the Alarm state, no mater what your evaluation settings, but only the first one gets posted to the Alarm history)
Check the ASG Activity history for launch failures, this is especially common if using spot instances, and the ASG will eventually enter a backoff state after enough failures. Any manual update to the group will reset this backoff
Upvotes: 1