Monday, September 29, 2014

ZooKeeper source code analysis

ZooKeeper servers start up with three ports reading from configuration file, one for listening client requests, the other two from server URL (ip:port:port) is used for inter-server communication and leader election.

When adding more replicas into zookeeper clusters, the read throughput will increase but write performance is reduce, because every operation proposed by leader needs to wait until more than half of the replicas to ACK. To resolve this issue, zookeeper introduced a new type of replica, observer. Unlike follower, observer do everything else except votes. Therefore, adding new observers to the cluster, will improve read performance without harming write throughput. However, availability still remain the same with the number of followers in the system. So there are three types of nodes: leader, follower and observer.


Follower and observer are two different kinds of learner. They extends the Learner class.


If zookeeper finds out there's only one server configured in the system when started, it will start the stand alone version of "ZooKeeperServer". If there is multiple servers, QuorumPeer thread will start the lead election process. QuorumPeer has four states, looking, following, leading and observing. Leader election is based on Fast Paxos algorithm, implemented in "FastLeaderElection". After the election process, QuorumPeer will start the server class according to its role and call their main method:


leader -> LeaderZooKeeperServer -> lead()
follower -> FollowerZooKeeperServer -> followLeader()
observer -> ObserverZooKeeperServer -> observeLeader()

All types of servers shared the same request processors, but each has different processor chain.








The actual update operation is done in the FinalRequestProcessor.


DataTree

The client API provides create, get, set methods with path as the access point. I was wondering how the tree structure access would be better than a hash table lookup, with a simple id as the access point. Turns out, it's not. In zookeeper, the DataTree maintains two parallel data structures: a hashtable that maps from full paths to DataNodes and a tree of DataNodes. All accesses to a path is through the hashtable. The tree is traversed only when serializing to disk.

So logically given client a path to manage data node in hierarchy, but access to a full path is through a hashtable lookup.

Thursday, September 25, 2014

On GlusterFS

Type of a volume is specified at the time of volume creation, it determines how and where data is placed. Following volume types are supported in glusterfs:
1) Distribute
    Distributes files across various bricks of the volume, similar to file-level RAID 0.
    Directories are present on all bricks of the volume.
    It uses Davies-Meyer hash algorithm, with 32-bit hash space being divided into N ranges for N bricks. When a directory created, a range is assigned to each directory. Hash value is computed on the file name when a file is created.

2) Stripe
    Individual files split among bricks, similar to block-level RAID 0.

3) Replication
    Copy files to multiple bricks, similar to file-level RAID 1.
    Synchronous replication of all directory and file updates.
    Transaction driven consistency.

4) Distributed Replicate
    Most preferred model of deployment currently. Reads get load balanced.

5) Striped Replicate
    Similar to RAID 10 (1+0)

6) Distributed Striped Replicate
    Limited Use Cases – Map Reduce

There is no external metadata servers.

Access GlusterFS can be FUSE mount point or through libgfsapi, both sync and async interfaces are available.

A new feature in 3.5 is distributed geo-replication, which based on change log, one driver per replica set on the master, with no SPOF. It is asynchronous operation across LAN or WAN, in a master-slave model (cascading), so data is passed between defined master and slave only, continuous and incremental replicate.

Tuesday, September 23, 2014

On GPFS

From GPFS paper [2002], there's no open source version.

GPFS, to some perspective, still is the fastest parallel file system implemented in shared disk environment over storage area network (SAN). Since any file system node can access any portion of shared disks, GPFS requires a lock mechanism to support fully parallel access both to file data and metadata.

GPFS uses a centralized global lock manager, in conjunction with local lock managers in each file system node. Lock manager handing out lock tokens to requested node.

GPFS guarantees single-node equivalent POSIX semantics for file system operations across the cluster, meaning a read on node A will see either all or none of concurrent write on node B (read/write atomicity). But with one exception, the access time (atime in metadata) updates only periodically, due to concurrent read is very common, synchronizing atime would be very expensive.

The paper claims there are two approaches to achieving the necessary synchronization:
1. Distributed Locking : every FS operation acquires read/write lock to synchronize with conflicting operations.
2. Centralized Management : all conflicting operations are forwarded to a designated node, which performs requests.

GPFS uses byte-range locking for updates to file data, and dynamically elected "metanodes" for centralized management of file metadata. The argument of using different approach for data and metadata is this: 
(1) when lock conflicts are frequent, (e.g. many nodes may access different parts of a file, but all need to access the same metadata), the overhead for distributed locking may exceed the cost of forwarding requests to a central node.
(2) if different nodes operates on different pieces of file data, distributed locking allows greater parallelism.

Also, a smaller lock granularity means more overhead due to frequent lock requests. Whereas larger granularity may cause more frequent lock conflicts. Thus, byte-rang lock for file data, lock-per-file used for metadata.

However, there could be third approach, a middle solution: Panopticon