Reputation: 69
I used Cloud Formation template to create a 3 node cluster in AWS. I have the EC2 instances in the private subnet and the ELB in public subnet with 'AssociatePublicIpAddress' configuration enabled. Configured the security groups to allow necessary communication between ELB & EC2 instances. On initial creation of the stack the EC2 instances are able to join the cluster and working fine but when I stop the EC2 instances and later when I start it again, the nodes are not joining the cluster. Any direction to resolving the issue is apprecitated.
[UPDATE]: Here are the additional information
ELB Type: AWS::ElasticLoadBalancing::LoadBalancer with scheme as "internet-facing"
ASG: Yes using ASG with MinSize=1, MaxSize=3, DesiredCapacity=3
HealthCheck Type: Tried both ELB & EC2 set at ASG level.
HealthCheck setting: find below the detail on setting for HealthCheck.
"HealthCheck": {
"Target": "HTTP:7997/",
"HealthyThreshold": "2",
"UnhealthyThreshold": "10",
"Interval": "60",
"Timeout": "30"
}
Upvotes: 0
Views: 191
Reputation: 3732
If the EC2 instances are in a 'private subnet' - they may not have connectivity to all the necessary AWS service endpoints for the managed cluster to operate. Since they joined on first start, that means they can talk to the DynamoDB Instance. Since they did not join after stop/start many things could be wrong. The behaviour doesnt quite make sense if the sample template was used -- if you "stop" an EC2 instance it should be terminated then restarted by the ASG. If that is not happening then something unrelated to ML is not configured right. Several approaches to debug.
1) Temporarily remake your cluster with public IP addresses but no other changes and see if that fixes things. --> If so then connectivity to AWS services is a likely issue, you may need a NAT instance or a VPN Endpoint created.
2) Determine why stopping an instance does not trigger the ASG to terminate it. ( enable cloud watch logs for the ASG and look at its messages )
3) Create an SNS topic, subscribe to it via email or some other method and provide the SNS topic ARN to the CF script in the indicated paramater. There is a vast amount of detail in this which is difficult to find using other means.
4) Look at the log files on all of the EC2 instances including system logs (/var/log/* )
5) Check the DynamoDB table to see if it is being updated as your instance state changes.
6) Check the behaviour of the cron job which is created on initial install -- it should be polling the EC2 status of all nodes in the cluster and updating DynamoDB.
Upvotes: 0