Monday, September 29, 2014

ZooKeeper source code analysis

ZooKeeper servers start up with three ports reading from configuration file, one for listening client requests, the other two from server URL (ip:port:port) is used for inter-server communication and leader election.

When adding more replicas into zookeeper clusters, the read throughput will increase but write performance is reduce, because every operation proposed by leader needs to wait until more than half of the replicas to ACK. To resolve this issue, zookeeper introduced a new type of replica, observer. Unlike follower, observer do everything else except votes. Therefore, adding new observers to the cluster, will improve read performance without harming write throughput. However, availability still remain the same with the number of followers in the system. So there are three types of nodes: leader, follower and observer.


Follower and observer are two different kinds of learner. They extends the Learner class.


If zookeeper finds out there's only one server configured in the system when started, it will start the stand alone version of "ZooKeeperServer". If there is multiple servers, QuorumPeer thread will start the lead election process. QuorumPeer has four states, looking, following, leading and observing. Leader election is based on Fast Paxos algorithm, implemented in "FastLeaderElection". After the election process, QuorumPeer will start the server class according to its role and call their main method:


leader -> LeaderZooKeeperServer -> lead()
follower -> FollowerZooKeeperServer -> followLeader()
observer -> ObserverZooKeeperServer -> observeLeader()

All types of servers shared the same request processors, but each has different processor chain.








The actual update operation is done in the FinalRequestProcessor.


DataTree

The client API provides create, get, set methods with path as the access point. I was wondering how the tree structure access would be better than a hash table lookup, with a simple id as the access point. Turns out, it's not. In zookeeper, the DataTree maintains two parallel data structures: a hashtable that maps from full paths to DataNodes and a tree of DataNodes. All accesses to a path is through the hashtable. The tree is traversed only when serializing to disk.

So logically given client a path to manage data node in hierarchy, but access to a full path is through a hashtable lookup.

2 comments: