Wednesday, November 25, 2015

TLA+ wiki

Modules:
  • Integers
  • Naturals
  • Reals
  • Sequences
  • TLC -- print
  • Bags -- (+) (-)
  • FiniteSets
  • RealTime -- RTBound RTnow now
Definition:
$<A>_v$ == $A \land (v' \neq v)$
$[A]_v$ == $A \lor (v'=v)$

Operators:
Divides (p, n) == $\exists q \in Int : n = q * p$
DivisorsOf (n) == {$p \in Int$ : Divides(p, n)}
SetMax (S) == CHOOSE $i \in S$ : $\forall j \in S$ : $i \geq j$
GCD (m, n) == SetMax(DivisorsOf(m) $\cap$ DivisorsOf(n))
SetGCD(T) == SetMax( {$d \in Int$ : $\forall t \in T$ : Divides(d, t)} )
max(x, y) == IF x>y THEN x ELSE y
min(x, y) == IF x<y THEN x ELSE y
maxv(x, y) == [$i \in$ DOMAIN x |-> max(x[i], y[i])]

Cardinality(set)
SortSeq(s, <)
Len(s)
Head(s)
Tail(s)
Append(s,e)
Seq(s)
SubSeq(s, m, n)
SelectSeq(s, op)

successor[$i \in Nat$] == i+1
successor == [$i \in Nat$ |-> i+1]
factorial[$n \in Nat$] == IF n = 0 THEN 1 ELSE n * factorial[n-1]

RECURSIVE FactorialOp(_)
FactorialOp(n) == IF n=0 THEN 1 ELSE n * Factorial(n-1)

RECURSIVE Cardinality(_)
Cardinality(S) == IF S={} THEN 0 ELSE 1+Cardinality(S \ {CHOOSE $x \in S$ : TRUE})

Sortings(S) == LET D == 1..Cardinality(S) 
IN {$seq \in [D \to S] : $
$\land S \subseteq \{seq[i] : i \in D\}$
$\land \forall i,j \in D : (i<j) \implies (seq[i].key \leq seq[j].key)$}

RECURSIVE SetSum(_)
SetSum(S) == IF S={} THEN 0 ELSE LET s==CHOOSE $x \in S$ : TRUE IN s + SetSum(S \ {s})

RECURSIVE SeqSum(_)
SeqSum(s) == IF s=<<>> THEN 0 ELSE Head(s) + SeqSum(Tail(s))

Network:
variable network = [from $\in 1..N$ |-> [to $\in 1..N$ |-> <<>>]];

define
{
    send(from, to, msg) == [network EXCEPT ![from][to] = Append(@, msg)]
    bcast(from, msg) == [network EXCEPT ![from] = [to $\in 1..N$ |-> Append(network[from][to], msg)]]
}

macro rcv()
{
    with (from $\in$ {$j \in 1..N$ : Len(network[j][self]) > 0}) {
        msg := Head(network[from][self]);
        network[from][self] := Tail(@)
    };
}


Wednesday, October 28, 2015

Building reliable service on top of unreliable services




NFV and Container

Even though there's only one NFV related paper appeared in SOSP 2015 distributed systems, I still feel that it is a hot topic in the field since 2014.

NFV stands for network function virtualization, which inspired by the benefits of cloud computing, NFV tends to moving network functions out of dedicated physical devices, into virtualized software apps that can be run on commodity, general purpose servers. The network functions includes but not limited to: firewalls, NAT, QoS, VPN, routing, etc. Network functions can be also viewed as some softwares that are specifically designed for packet processing.




Tuesday, October 27, 2015

Proactive Serialization

This is a followup discussion after my adviser's original idea blog, proactive serialization.

The general idea is serialization does not need to be Paxos-like total order, and using a lock service, it can be proactive and anticipatory. The lock service can anticipate a future request and provide some locks ahead of time, so the worker proceed without an explicit request. Latency will be reduced and throughput can be improved.

However, with a slightly different design, anticipation is no longer obvious.

Problem statement: Given the star graph $G = <V, E>$, $|V| = 1+|E|$, $E = \{\forall j \in V, (k, j) \in E\}$. $n$ data items that shared between all nodes, and each data item is associated with one token. A node $j$ requires all tokens of the data it access in one transaction in order to commit that transaction. What is the most efficient location for each token? In other words, the goal is, given the setup and application access pattern, find the best location of tokens with minimum cost. The cost would be communication between sites.

If node $k$ is a simple token manager / lock broker, then the problem is easy to solve, because it keeps access history for all other nodes, then assign the token to it upon request or anticipation.

However, if node $k$ is the elected leader that responsible for exchanging tokens (locks) between other nodes in the cluster. In the meanwhile, $k$ is like any other node, that also will accept request from client and propose/commit on transactions, then the problem becomes, does $k$ move the token or keep it and commit at $k$. Should node $j$ block a transaction while waiting for token, or forward the transaction to $k$ instead.

When two nodes $i, j$ competing the same data item, it is better to keep the token at $k$, instead of moving it back and forth all the time. A simple way to make a decision of when to move a token is to remember the N consecutive accesses for each item. Let's say for a certain point of time, $i$ got the token, now if $j$ forwards its transaction to $k$, should $k$ wait until $i$'s leas timeout? Which will block $j$'s request, or revoke token back to $k$ immediately.

Let's try to setup some rules.

Rule #1: node $k$ should be one of the node that geographically located in the middle among all nodes. So that it's relatively faire for every site to communication with $k$. $\forall j$, $\sum cost(j,k)$ is minimum.

Rule #2: Token should remain in one site while there're still outstanding transactions. This rule will guarantee the safety property, hopefully by following two means:

(1) tokens are send within a replication message.
(2) every token includes the last transaction id, node $k$ only release a token after it has seen that transaction. (seen by state machine??)

Rule #3: Creating operation will create a new data item, thus create a new token, which is unknown by $k$ initially. This special operation immediately leads to two way of creation. (1) create the data globally, requires the creating site holds the father data item in a tree data structure. So no other site will create a conflict key, or (2) all create request are forward to site $k$ to commit, (3) any site may create its local data item that can be only accessed by itself. In which case, no token is needed.

Rule #4:
Let assume each site to site one-way communication takes T sec. The time to commit a transaction is t, where t << T. If data D has the following access pattern from site i and j: [i, j, i, j, i, j, …]. The most efficient way is to keep token(D) at site $k$, so the average time for each txn would be (2T+2t)/2 = T+t.

However, if the access pattern is [iii, j, iii, j, iii, j, …]. Then keeping the token at $k$ takes same amount of time (6T+4t)/4 = 1.5T+t. But send token to $i$ first, then revoke back to $j$ for one txn, then send token again, would take (T+3t+T+T)/4 = 0.75T+0.75t.

Such patterns are not be able to detect and anticipate by two consecutive accesses. 



On Ceph MDS

The Ceph Filesystem service includes the Ceph Metadata Server (MDS) deployed with the Ceph Storage cluster. The purpose of the MDS is to store all the filesystem metadata (directories, file ownership, access modes, etc) in high-availability Ceph Metadata Servers where the metadata resides in memory. The reason for the MDS (a daemon called ceph-mds) is that simple filesystem operations like listing a directory or changing a directory (lscd) would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from the data means that the Ceph Filesystem can provide high performance services without taxing the Ceph Storage Cluster.
Ceph FS separates the metadata from the data, storing the metadata in the MDS, and storing the file data in one or more objects in the Ceph Storage Cluster. The Ceph filesystem aims for POSIX compatibility. ceph-mds can run as a single process, or it can be distributed out to multiple physical machines, either for high availability or for scalability.
  • High Availability: The extra ceph-mds instances can be standby, ready to take over the duties of any failed ceph-mds that was active. This is easy because all the data, including the journal, is stored on RADOS. The transition is triggered automatically by ceph-mon.
  • Scalability: Multiple ceph-mds instances can be active, and they will split the directory tree into subtrees (and shards of a single busy directory), effectively balancing the load amongst all active servers.
Combinations of standby and active etc are possible, for example running 3 active ceph-mds instances for scaling, and one standby instance for high availability.


Above is from Ceph doc, need more details on multiple active MDS.

Saturday, September 19, 2015

why ZooKeeper

“After working for a long time on many different distributed systems, you learn to appreciate how complicated the littlest things can become.”

This is the opening sentence of the blog which Yahoo research tells the story of how ZooKeeper came into being. It not only precisely describes the engineering effort needs to put into a mature distributed systems, also provides the major motivation of system like ZooKeeper. Although there exist various form of applications that can be distributed and gain benefits like fault-tolerant and scalability, some very general requirements are shared among them:

(1) Distributed applications need some “ground truth”.

The truth can come from an oracle embedded in the system or a service externally available. Some examples of the “truth” are leader election and failure detector. [Chandra et al. 1996] has shown that the weakest failure detector to solve consensus in an asynchronous system is of class $\diamond S$, eventual weak accuracy, thus it's safe to say we only require “partial truth”.

(2) The service needs to be simple.

Simple means easy to understand and easy to use. This blog elaborates the reason behind consensus is no longer used as a library is that it required developer to be experts in distributed systems to understand the operational characteristics of failure modes and how to debug them.

(3) The service needs good performance.

Consensus is expensive. Even so, a service should make full effort of it to provide performance.

Therefore, after a bunch of distributed projects with animal names, Yahoo designed ZooKeeper which uses Zab (Paxos-like consensus) algorithm to satisfy (1), uses hierarchal namespace and file system like model to satisfy (2), uses non-blocking pipelined architecture to satisfy (3).

Zab provides different properties than Paxos. It's designed for coordination with high availability thanks to primary ordered passive state machine replication (SMR).

ZooKeeper's file system like API model has three advantages: one, file system API is easy to understand by more developers; two, hierarchal namespace makes more use cases possible of the service (e.g. discovery); three, partition the clients of different use. However, it is unlike a real file system, because some features (e.g. rename) is not provided due to hard to implement and possibly in the cost of system performance.

Pipelined architecture is the different request processors we described in few blogs earlier. Different request processor chain is the main reason for different behavior of leader and follower nodes.

Early version of ZooKeeper is written by one engineer in three months. To developer's surprise, the service can actually provide all kinds of coordination primitives than their initial request. Same story happened to Google's distributed lock service Chubby, that can be used for more tasks other than locks. Since release, more contributors have added rich features into ZooKeeper, and years of work to implement, optimize and test made ZooKeeper mature and popular.

So far, ZooKeeper is the only consensus system that pass the test, whereas etcd, based on Raft algorithm failed in partitions.

[Chandra et al. 1996] Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.