Synchronize actions in a distributed system

Question

What techniques/tools can be used to implement a distributed system with these requirements:

At a given time, the system can be in one of 3 states: SYNCING, COMPUTING, or IDLE.
Each node in the system can receive two instructions: sync() and compute().
A sync() instruction will be sent to all nodes at once. Upon getting a sync() instruction, if the system is IDLE, each node should sync its local cache with the database, and system state is changed to SYNCING. When all nodes finish syncing, the system state is changed to IDLE. In case of a node failure, the system state should still change to IDLE as soon as all the alive nodes finish syncing.
Upon getting a compute() instruction, if the system is not SYNCING, a node will run some computation and the system state should be changed to COMPUTING. When the computation is finished, or in case of a node failure, if no other computation is in progress, the state should change to IDLE.

Nipun Talukdar · Accepted Answer

Zookeeper will be a good option to synchronize actions. Assume the below approach. There is a /sync znode which is updated with current timestamp or some new data each time syncing needs to be triggerd. Basically a master node updates the value of /sync node to trigger actions on the worker nodes.

The worker nodes put watch on /sync znode for data changes. So, everytime the master node update /sync, the workers get notified and they update their local cache.

Also, worker nodes register themselves under /workers znode (by creating ephemeral znode with some uuid under /workers). The nodes will be ephemeral so that if a worker node/process dies, the corresponding ephemeral node disappears. The workers put an watch for childern under /workers znode, so they get notified when new workers come up or any existing worker disappears. Also, they put watch for data change on all the ephemeral nodes under /workers.

Now how everything works:

Master updates /sync with current timestamp to trigger a sync on the worker nodes. All workers get notified about the data change in /sync node. They get the modified data for the /sync node. The workers sync their caches from database. The workers update their corresponding znode under /workers node. For example, worker with id 4dc1efd2-01c8-11e5-bee1-08002791d032 updates the znode /workers/4dc1efd2-01c8-11e5-bee1-08002791d032. The data put on the worker specific znode is "synced_at_timestamp" (e.g. synced_at_1432451046000) Whenever any worker updates its znode, all other workers get notified. All the workers keep checking the current data on all znodes under /workers. When all the worker znodes under /workers have the same data synced_at_timestamp, they switch to IDLE state.

There may be many other possible approaches. If you are familiar with memcache, redis, hazelcast etc., you may use them also to achieve such a system.

Synchronize actions in a distributed system

Answers (2)

Related Questions