“After working for
a long time on many different distributed systems, you learn to
appreciate how complicated the littlest things can become.”
This is the opening
sentence of the blog which Yahoo research tells the story of how
ZooKeeper came into being. It not only precisely describes the
engineering effort needs to put into a mature distributed systems,
also provides the major motivation of system like ZooKeeper. Although
there exist various form of applications that can be distributed and
gain benefits like fault-tolerant and scalability, some very general
requirements are shared among them:
(1) Distributed
applications need some “ground truth”.
The truth can come
from an oracle embedded in the system or a service externally
available. Some examples of the “truth” are leader election and
failure detector. [Chandra et al. 1996] has shown that the weakest
failure detector to solve consensus in an asynchronous system is of
class $\diamond S$, eventual weak accuracy, thus it's safe to say we
only require “partial truth”.
(2) The service
needs to be simple.
Simple means easy to
understand and easy to use. This blog elaborates the reason behind
consensus is no longer used as a library is that it required
developer to be experts in distributed systems to understand the
operational characteristics of failure modes and how to debug them.
(3) The service
needs good performance.
Consensus is
expensive. Even so, a service should make full effort of it to
provide performance.
Therefore, after a
bunch of distributed projects with animal names, Yahoo designed
ZooKeeper which uses Zab (Paxos-like consensus) algorithm to
satisfy (1), uses hierarchal namespace and file system like
model to satisfy (2), uses non-blocking pipelined architecture
to satisfy (3).
Zab provides
different properties than Paxos. It's designed for coordination with
high availability thanks to primary ordered passive state machine
replication (SMR).
ZooKeeper's file
system like API model has three advantages: one, file system API is
easy to understand by more developers; two, hierarchal namespace
makes more use cases possible of the service (e.g. discovery); three,
partition the clients of different use. However, it is unlike a real
file system, because some features (e.g. rename) is not provided due
to hard to implement and possibly in the cost of system performance.
Pipelined
architecture is the different request processors we described in few
blogs earlier. Different request processor chain is the main reason
for different behavior of leader and follower nodes.
Early version of
ZooKeeper is written by one engineer in three months. To developer's
surprise, the service can actually provide all kinds of coordination
primitives than their initial request. Same story happened to
Google's distributed lock service Chubby, that can be used for more
tasks other than locks. Since release, more contributors have added
rich features into ZooKeeper, and years of work to implement,
optimize and test made ZooKeeper mature and popular.
So far, ZooKeeper is
the only consensus system that pass the test, whereas etcd, based on
Raft algorithm failed in partitions.
[Chandra et al. 1996] Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.