Saturday, September 19, 2015

why ZooKeeper

“After working for a long time on many different distributed systems, you learn to appreciate how complicated the littlest things can become.”

This is the opening sentence of the blog which Yahoo research tells the story of how ZooKeeper came into being. It not only precisely describes the engineering effort needs to put into a mature distributed systems, also provides the major motivation of system like ZooKeeper. Although there exist various form of applications that can be distributed and gain benefits like fault-tolerant and scalability, some very general requirements are shared among them:

(1) Distributed applications need some “ground truth”.

The truth can come from an oracle embedded in the system or a service externally available. Some examples of the “truth” are leader election and failure detector. [Chandra et al. 1996] has shown that the weakest failure detector to solve consensus in an asynchronous system is of class $\diamond S$, eventual weak accuracy, thus it's safe to say we only require “partial truth”.

(2) The service needs to be simple.

Simple means easy to understand and easy to use. This blog elaborates the reason behind consensus is no longer used as a library is that it required developer to be experts in distributed systems to understand the operational characteristics of failure modes and how to debug them.

(3) The service needs good performance.

Consensus is expensive. Even so, a service should make full effort of it to provide performance.

Therefore, after a bunch of distributed projects with animal names, Yahoo designed ZooKeeper which uses Zab (Paxos-like consensus) algorithm to satisfy (1), uses hierarchal namespace and file system like model to satisfy (2), uses non-blocking pipelined architecture to satisfy (3).

Zab provides different properties than Paxos. It's designed for coordination with high availability thanks to primary ordered passive state machine replication (SMR).

ZooKeeper's file system like API model has three advantages: one, file system API is easy to understand by more developers; two, hierarchal namespace makes more use cases possible of the service (e.g. discovery); three, partition the clients of different use. However, it is unlike a real file system, because some features (e.g. rename) is not provided due to hard to implement and possibly in the cost of system performance.

Pipelined architecture is the different request processors we described in few blogs earlier. Different request processor chain is the main reason for different behavior of leader and follower nodes.

Early version of ZooKeeper is written by one engineer in three months. To developer's surprise, the service can actually provide all kinds of coordination primitives than their initial request. Same story happened to Google's distributed lock service Chubby, that can be used for more tasks other than locks. Since release, more contributors have added rich features into ZooKeeper, and years of work to implement, optimize and test made ZooKeeper mature and popular.

So far, ZooKeeper is the only consensus system that pass the test, whereas etcd, based on Raft algorithm failed in partitions.

[Chandra et al. 1996] Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.