rabbitmq autoscaling cluster in aws: managing scaling events

Question

I'm deploying rabbitmq (3.8.0) in aws using an autoscaling group and a load balancer (for management console web access, and message routing).

The peer-discovery-aws plugin for rabbitmq autoscaling seems to work great to get new nodes to automatically join the cluster. (though we could only get the autoscaling mode to work).

The target-group health check will kill an instance that's not responding on the amq port (5672), and then the auto scaling group will replace it. This is great when one node goes down. A scaling policy can also be used to increase cluster size when CPU utilization goes above a certain threshold (though I don't know if that's a realistic use-case).

So we have run into a couple of scenarios in testing that I'm unsure how to resolve:

We had an issue where the rabbitmqctl application died, across all nodes of the cluster. The scaling group killed the non-responding instance, and then replaced it with a new instance. The new instance would not auto-join the cluster because the remaining instance was also dead. When the auto scaling group kills the last instance, since the new instance couldn't join, all of the stored messages were lost. So this problem is: how do I persist the data in case the last cluster node is terminated by a scaling event? (my assumption is to use an external EBS volume to store messages, and attach it on startup, but when that instance gets terminated, the volume is also terminated, and not attached to a new instance - I have the feeling like I'm trying to re-invent the HA-wheel of what Erlang's supposed to be doing).
The other (related) issue is: when I re-started the service with rabbitmqctl start_app; and shutdown the new instance, the scaling group replaces the new instance (because the old instance is now passing health check) - - for whatever reason though, the new instance could not join the cluster. The original instance that had terminated, was still "remembered" as a cluster node. So when the 2nd new instance came up and tried to join the cluster - the application crashed on the original node. (which is strange behavior - but I repeated it three times). So problem 2 is: when a cluster node is terminated by the auto scaling group, how can I automatically remove this node from the cluster so that the replacement cluster node can auto-join the cluster? (It seems that what I should do is write a service that runs on each cluster node and uses aws cli to continuously monitor the state of other cluster nodes, and when one is "terminated" (this state only lasts for a few minutes before the instance disappears) - run the remove node command on the node that went down). Or am I re-inventing the Erlang wheel again?
The final problem is: We initially deploy a new cluster and resources using Terraform. The rabbitmq nodes start in an "initialized" state, configured with no queues or exchanges, just the admin user and default password, etc. We export the json config file from the existing cluster, then import it to the new cluster, then change the cluster name back to it's original name. When we're ready to decomission the old cluster, we simply shovel the messages across and swap the DNS name of the load balancers, and we're able to do blue-green deploy upgrades this way. The problem is: when a new node is instantiated by autoscaling, it's in an initial state, until it joins the cluster. The Load Balancer will route to any node, because it doesn't know if that node is in the RabbitMQ cluster or not. So no producer or consumer can authenticate to the "initial-state" node, which is good. But the web console will alternately route to an old or new cluster node; where the admin passwords are different than the two nodes. (ie. the cluster can't be managed until all nodes have joined). Problem #3 is: I need a way to tell the Amazon ALB to NOT route 15672 (web console) traffic to nodes that aren't joined to the cluster yet. It seems like this should be a simple problem, but this is causing havoc for our administrators.

Shahad · Accepted Answer

Disclaimer I've never used RabbitMQ or setup autoscaling with it

1) - You can use EFS for a shared secondary data store[1]. EFS does have a bit worse performance than EBS, so you'll need to test if it works for RabbitMQ specifically. - You can also have a secondary EBS volume attached to the instances, but this introduces a lot of annoying issues that you'd have to script for (EBS is specific to an AZ, race conditions on the healthcheck replacement for the old instances going away in time for the new one to start, AZ failures where instances all get moved out of an AZ, etc).

2) - You can add a Terminating lifecycle hook[2] to the group. This will trigger an event and keep the instance running in pending for a short period of time after the ASG decides to terminate it, but before its actually terminated. You can have a CloudWatch Event get triggered whenever this lifecycle hook starts. The CW Event can trigger something like Lambda or an SSM runbook that can call the command to deregister the instance from the cluster. - If you don't need the instance to still be running when deregistering it from the cluster, there's also a CloudWatch event for anytime an ASG instance gets terminated[3], so you could trigger that event without needing to setup the lifecycle hook on the ASG.

3) - ALB just released a new feature for people doing blue/green deployments where you can have 2 target groups connected to a single listener action, and put weights on how much traffic should go to each one[4]. Have the old instances/ASG attached to one target group, and the new instances/ASG on a different target group. Don't send traffic to the new nodes until they're up and running. - Alternatively, if you want this in general, and not just during deployments take advantage of the ELB healthchecks. Setup a custom healthcheck path to a page that doesn't exist by default. Once the instance is configured and you want to send traffic to it (I assume its an automated process doing this?) create a file on that healthcheck path so that it starts passing healthchecks and the ALB will send that instance traffic. (I'm assuming you have 2 different target groups associated with the ASG right now, one for each of the two ports/types of traffic, each with its own healthchecks). When doing this, make sure the ASG's healthcheck grace period is long enough for the instances to not get terminated due to being unhealthy.

[1] https://docs.aws.amazon.com/efs/latest/ug/mount-fs-auto-mount-onreboot.html

[2] https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html

[3] https://docs.aws.amazon.com/autoscaling/ec2/userguide/cloud-watch-events.html#terminate-successful

[4] https://aws.amazon.com/blogs/aws/new-application-load-balancer-simplifies-deployment-with-weighted-target-groups/

rabbitmq autoscaling cluster in aws: managing scaling events

Answers (1)

Related Questions