Reputation: 1569
I have the following situation:
m4.large
EC2 instance running RHEL6I have a Lambda function that logs all EC2 state changes across the VPC as follows:
'use strict';
exports.handler = (event, context, callback) => {
console.log('LogEC2InstanceStateChange');
console.log('Received event:', JSON.stringify(event, null, 2));
callback(null, 'Finished');
}
And another Lambda function that tries to start EC2 instances based on a schedule, written in Java, which is a lot of code, but the core of it is something like this:
public void handleRequest(Object input, Context context) {
final List<String> instancesToStart = getInstancesToStart(); //implementation not shown
try {
StartInstancesRequest startRequest = new StartInstancesRequest().withInstanceIds((String[]) instancesToStart.toArray());
context.logger.log("StartInstancesRequest: " + startRequest.toString());
StartInstancesResult res = ec2.startInstances(startRequest);
context.logger.log("StartInstancesResult: " + res.toString());
}
catch(Exception e) {
logException(e); //calls context.logger.log on the stack trace string
}
}
The instancesToStart
array is populated with instance IDs like i-0abcdef1234567890
.
I create the Lambda functions and all required IAM roles, etc. using CloudFormation. Here is the bit describing the role/permissions for the Java-based Lambda function that does the work:
Resources:
EC2SchedulerRole:
Type: 'AWS::IAM::Role'
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- 'sts:AssumeRole'
Path: /
EC2SchedulerPolicy:
DependsOn:
- EC2SchedulerRole
Type: 'AWS::IAM::Policy'
Properties:
PolicyName: ec2-scheduler-role
Roles:
- !Ref EC2SchedulerRole
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- 'logs:*'
Resource:
- 'arn:aws:logs:*:*:*'
- Effect: Allow
Action:
- 'ec2:DescribeInstanceAttribute'
- 'ec2:DescribeInstanceStatus'
- 'ec2:DescribeInstances'
- 'ec2:StartInstances'
- 'ec2:StopInstances'
- 'ec2:DeleteTags'
Resource:
- '*'
What ends up happening is, according to the CloudWatch logs from the first function (the script that logs instance state transitions), we get:
Received event:
{
"version": "0",
"id": "<guid>",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "12345678",
"time": "2019-06-20T19:01:35Z",
"region": "us-east-1",
"resources": [
"arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
],
"detail": {
"instance-id": "i-0abcdef12345678",
"state": "pending"
}
}
Received event:
{
"version": "0",
"id": "<guid>",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "12345678",
"time": "2019-06-20T19:01:37Z",
"region": "us-east-1",
"resources": [
"arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
],
"detail": {
"instance-id": "i-0abcdef12345678",
"state": "stopping"
}
}
Received event:
{
"version": "0",
"id": "<guid>",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "12345678",
"time": "2019-06-20T19:01:37Z",
"region": "us-east-1",
"resources": [
"arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
],
"detail": {
"instance-id": "i-0abcdef12345678",
"state": "stopped"
}
}
And according to the CloudWatch logs from the "worker" function (the function that actually tries to start the instances), we get:
StartInstancesRequest: {InstanceIds: [i-0abcdef12345678],}
StartInstancesResult: {StartingInstances: [{CurrentState: {Code: 0,Name: pending},InstanceId: i-0abcdef12345678,PreviousState: {Code: 80,Name: stopped}}]}
So it seems from the perspective of the Java-based Lambda that does the work, it's doing all it needs to do, to give the command to make the EC2 instance start; but then when the EC2 instance tries to actually start, it goes from "pending" to "stopping" to "stopped". If it didn't have permission, it wouldn't even get that far, right?
If it were an issue with the instance itself (e.g. hardware), I would expect that manually starting it using the AWS Console would fail. But it doesn't fail. It succeeds when started manually!
So what's happening? How do I diagnose this further? Is it permissions or is the instance screwed up?
I'm 99% sure this isn't due to a lack of available capacity in the AZ, because whenever I try to start the instance manually it always works. It's not an ephemeral issue or something that has only been happening recently. This has been persisting for several months like this, where manual starting works 100% of the time, and script based starting works 0% of the time.
Upvotes: 0
Views: 1047
Reputation: 1336
Booting up EBS might be the issue. As you have mentioned EC2 is having 3 EBS volumes with KMS encryption. You have to provide KMS permission(kms:CreateGrant) to start your instances
{
"Sid": "GrantAccess",
"Effect": "Allow",
"Action": "kms:CreateGrant",
"Resource": "arn:aws:kms:::key/1234"
}
Upvotes: 5
Reputation: 990
Try this policy and see if it works. If it does, there is the problem with the policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
},
{
"Effect": "Allow",
"Action": [
"ec2:Start*",
"ec2:Stop*"
],
"Resource": "*"
}
]
}
Upvotes: 0