Krishna M
Krishna M

Reputation: 1195

Zookeeper Working

I am learning about Zookeeper. I learned that zookeeper is very useful when the cluster contains 1000's of machines. I have few doubts.

I tried reading the following links to understand Explaining Apache ZooKeeper
Explaining Apache ZooKeeper

I have the following questions

1) How zookeeper is helpful when handling thousands of machines in cluster ?
2) How zookeeper solves the distributed synchronization problem ?
3) How exactly zookeeper solves centralized configuration problem ?

Upvotes: 1

Views: 851

Answers (1)

JensG
JensG

Reputation: 13421

How zookeeper is helpful when handling thousands of machines in cluster?

There are lot of possible use cases for ZooKeeper, the most prominent ones are surely

  • service registry
  • configuration store
  • distributed locking
  • distributed notification services (using watchers)
  • and more ...

You can run as many ZooKeeper instances as required for your particular use case. Each of your 1000 machines and/or programs in your cluster connects to one of them.

Of course, the ZooKeeper instances have to be configured accordingly, in order to properly work together as a cluster. This is from the ZooKeeper website

Clustered (Multi-Server) Setup

For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster known as an ensemble. As long as a majority of the ensemble are up, the service will be available. Because Zookeeper requires a majority, it is best to use an odd number of machines. For example, with four machines ZooKeeper can only handle the failure of a single machine; if two machines fail, the remaining two machines do not constitute a majority. However, with five machines ZooKeeper can handle the failure of two machines.

[...]

4.Create a configuration file. This file can be called anything. Use the following settings as a starting point:

tickTime=2000
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888

Every machine that is part of the ZooKeeper ensemble should know about every other machine in the ensemble. You accomplish this with the series of lines of the form server.id=host:port:port. The parameters host and port are straightforward. You attribute the server id to each machine by creating a file named myid, one for each server, which resides in that server's data directory, as specified by the configuration file parameter dataDir.

5.The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.

[...]

How zookeeper solves the distributed synchronization problem?

ZooKeeper ensures consistency throughout the cluster by using a special protocol (ZooKeeper Atomic Broadcast, ZAB) to gather consensus and publish the results about the state of the internal data structure. Although not exactly equivalent, it can roughly be imagined as a variant of the Paxos protocol. This protocol ensures the data being consistent at every point in time which allows for easy and safe failover e.g. when one particular ZooKeeper instance crashes.

Technically, ZAB is not Paxos, but the difference does not really matter if you just want to use ZooKeeper.

How exactly zookeeper solves centralized configuration problem?

By not having a single, centralized copy of your (configuration) data. Although each ZooKeeper client program typically connects to one particular ZooKeeper instance, all data are always replicated.

Upvotes: 3

Related Questions