Go Webapp Cluster Leader Election

Question

I'm writing a fairly complex Go webapp that I want to make highly available. I'm planning on having multiple VMs running the app, with a load-balancer directing traffic between them for what I want.

Where it gets complex is that the webapp has a sort of database book-keeping routine running, which I only want (at most) one instance of at any time. So if I have three webapp VMs, only one of them should be doing the book-keeping.

(Yes, I'm aware that I could split the book-keeping off into a separate VM instance entirely, but the code has been rather heavily integrated with the rest of the webapp.)

I've spent several hours looking at things like etcd, raft, bully, memberlist, Pacemaker, and so on and so forth. These all seem like quite a lot of information to absorb to accomplish what I'm after, or I can't see a clear way of using them.

What I would like, in this specific use case, is a system whereby Go webapp nodes automatically detect each other and elect a "leader" to do the book-keeping. Ideally this would scale anywhere from 2 to 10 nodes, and not require manually adding IP addresses to config files (but possible, if necessary).

I was thinking in the case of a network partition or something, where one node cannot see the others, I wouldn't want it to elect itself as a leader, because it would be possible to have two nodes attempting to do book-keeping at the same time. That also means that if I stripped down the cluster to being just a single VM, no book-keeping would occur, but that could be tolerated for a brief period during maintenance, or I could set some sort of flag somewhere.

I'm wondering if someone could point me in the right direction, and hopefully how I can accomplish this with low complexity, while leveraging existing code libraries as much as possible.

kuujo · Accepted Answer

Based on your fault tolerance and consistency requirements - in particular preventing split brain in a partition - a consensus algorithm like Raft is what you most definitely want. But even though Raft was designed for understandability, it still requires significant expertise to implement correctly. Therefore, as others have mentioned you should look into existing services or implementations.

ZooKeeper (ZAB), etcd (Raft), and Consul (Raft) are the most widely used systems for doing things like this. Considering that you want your VMs to scale from 2 to 10 nodes, this is most likely the way you want to go. Raft and other consensus algorithms have quorum requirements that can make scaling in this manner less practical if the algorithm is directly embedded in your VMs. By using an external service, your VMs simply become clients of the consensus service and can scale independently of it.

Alternatively, if you don't want depend on an external service for coordination, the Raft website has an exhaustive list of implementations in various languages, some of which are consensus services and some of which can be embedded. However, note that many of them are incomplete. At a minimum, any Raft implementation suitable for production must have implemented log compaction. Without log compaction, servers can really only operate for a finite amount of time - until the disk fills up with logs.

Go Webapp Cluster Leader Election

Answers (1)

Related Questions