how to prove a consensus implementation like multipaxos is right?

Question

I want to prove that my implementation of multi-paxos is right. Are there any valid examples for me to test on? Or there can be some other ways to convince others that my implementation is right.

I tried to find some paper that contained the examples, but most papers just specified the algorithms.

simbo1905 · Accepted Answer

Elastic the company behind Elasticsearch wanted to tighten up on whether they had design bugs. They built TLA+ models of all their algorithms up on GitHub here that prove the algorithms lead to safety. Then they needed to check that their code doesn’t deviate from the model. They wrote a blog about finding and fixing an old bug this way. Such an approach prevents errors of design in that you learn your intended implementation is correct. Then you have to worry about errors of commission which is implementation bugs where you code deviates from the model. Clearly that is a very significant investment of work significantly greater than actually writing the code you are proving.

In contrast if you look at the famous google chubby paper about using Paxos at Google they didnt use formal proofs. They stress tested with tests that injected random message drops and crashes for a very long time looking to shake out bugs. You then have no proof it’s correct only some evidence that there are no observed bugs in thousands of hours of simulations of crashes and network errors. That sort of confidence building exercise is something that is feasible a single person who wrote an implementation can setup and run.

Kyle Kingsbury’s Jepson project shows how he finds and proves bugs in other people’s implementations. He carefully studies what people claim are their safety properties, then designs a test client, and runs the system on vms and injects network partitions, message lose and crashes. He then has a checker that checks all responses seen by all test clients searching for inconsistencies. He has found a lot of bugs in a lot of systems. So companies now hire him to find bugs. If he finds none it is not proof of no bugs only something to make people feel more confident (and usually bugs are found!). Hiring someone who wrote an opensource checker to spend a few months trying to glitch your code is a significant investment. Kyle teaches in-person training course to show you how to run his opensource software and you practice in code finding a bug in an old version of an sql database. I have attended the course and I can highly recommend it.

In the case of writing you own implementation it is a question of how much effort you are going to spend. Paxos is proven to he correct where implementations struggle are all the real world things you need to add to the core algorithm to make a practical system. By way of example you might have a bug in how nodes catch-up after being unreachable for a while. The approach of running experiments that simulate a vast amount of errors for long periods, that validate that all nodes remain constant, and that no client saw an an inconstant state, is probably the most feasible. Checking that all nodes went through the same states is trivial. Proving that no client observed a state that nodes never entered is harder to code. You can use Knassos which is Kyle's opensource checker written in Clojure.

Finally there is an online course by University of Washington with code on GitHub called DSLabs where students must write there own Paxos implementation in a project that links to the universities opensource checker that will check inconsistencies seen by clients during simulated network errors and crashes. As it is all opensource you can use it to check your own implementation. You can read a comsci paper about it titled Teaching rigorous distributed systems with efficient model checking. DSLabs is written in Java so plugging in your own implementation might not be so easy if it isn't written in a jvm language. Then again you can have Java call out to any other processes running in other languages so in theory you can write a Java shim that calls out to your implementation running in another process.

Update: People may be interested in this paper that mentions that the cost of proving an algorithm is correct takes person years and may be ten times larger than the code it proves https://blog.acolyer.org/2019/11/13/scaling-symbolic-evaluation-serval/

how to prove a consensus implementation like multipaxos is right?

Answers (2)

Related Questions