Qubit: October 2015

Wednesday, October 28, 2015

Building reliable service on top of unreliable services

NFV and Container

Even though there's only one NFV related paper appeared in SOSP 2015 distributed systems, I still feel that it is a hot topic in the field since 2014.

NFV stands for network function virtualization, which inspired by the benefits of cloud computing, NFV tends to moving network functions out of dedicated physical devices, into virtualized software apps that can be run on commodity, general purpose servers. The network functions includes but not limited to: firewalls, NAT, QoS, VPN, routing, etc. Network functions can be also viewed as some softwares that are specifically designed for packet processing.

Tuesday, October 27, 2015

Proactive Serialization

This is a followup discussion after my adviser's original idea blog, proactive serialization.

The general idea is serialization does not need to be Paxos-like total order, and using a lock service, it can be proactive and anticipatory. The lock service can anticipate a future request and provide some locks ahead of time, so the worker proceed without an explicit request. Latency will be reduced and throughput can be improved.

However, with a slightly different design, anticipation is no longer obvious.

Problem statement: Given the star graph $G = <V, E>$, $|V| = 1+|E|$, $E = \{\forall j \in V, (k, j) \in E\}$. $n$ data items that shared between all nodes, and each data item is associated with one token. A node $j$ requires all tokens of the data it access in one transaction in order to commit that transaction. What is the most efficient location for each token? In other words, the goal is, given the setup and application access pattern, find the best location of tokens with minimum cost. The cost would be communication between sites.

If node $k$ is a simple token manager / lock broker, then the problem is easy to solve, because it keeps access history for all other nodes, then assign the token to it upon request or anticipation.

However, if node $k$ is the elected leader that responsible for exchanging tokens (locks) between other nodes in the cluster. In the meanwhile, $k$ is like any other node, that also will accept request from client and propose/commit on transactions, then the problem becomes, does $k$ move the token or keep it and commit at $k$. Should node $j$ block a transaction while waiting for token, or forward the transaction to $k$ instead.

When two nodes $i, j$ competing the same data item, it is better to keep the token at $k$, instead of moving it back and forth all the time. A simple way to make a decision of when to move a token is to remember the N consecutive accesses for each item. Let's say for a certain point of time, $i$ got the token, now if $j$ forwards its transaction to $k$, should $k$ wait until $i$'s leas timeout? Which will block $j$'s request, or revoke token back to $k$ immediately.

Let's try to setup some rules.

Rule #1: node $k$ should be one of the node that geographically located in the middle among all nodes. So that it's relatively faire for every site to communication with $k$. $\forall j$, $\sum cost(j,k)$ is minimum.

Rule #2: Token should remain in one site while there're still outstanding transactions. This rule will guarantee the safety property, hopefully by following two means:

(1) tokens are send within a replication message.

(2) every token includes the last transaction id, node $k$ only release a token after it has seen that transaction. (seen by state machine??)

Rule #3: Creating operation will create a new data item, thus create a new token, which is unknown by $k$ initially. This special operation immediately leads to two way of creation. (1) create the data globally, requires the creating site holds the father data item in a tree data structure. So no other site will create a conflict key, or (2) all create request are forward to site $k$ to commit, (3) any site may create its local data item that can be only accessed by itself. In which case, no token is needed.

Rule #4:

Let assume each site to site one-way communication takes T sec. The time to commit a transaction is t, where t << T. If data D has the following access pattern from site i and j: [i, j, i, j, i, j, …]. The most efficient way is to keep token(D) at site $k$, so the average time for each txn would be (2T+2t)/2 = T+t.

However, if the access pattern is [iii, j, iii, j, iii, j, …]. Then keeping the token at $k$ takes same amount of time (6T+4t)/4 = 1.5T+t. But send token to $i$ first, then revoke back to $j$ for one txn, then send token again, would take (T+3t+T+T)/4 = 0.75T+0.75t.

Such patterns are not be able to detect and anticipate by two consecutive accesses.

On Ceph MDS

The Ceph Filesystem service includes the Ceph Metadata Server (MDS) deployed with the Ceph Storage cluster. The purpose of the MDS is to store all the filesystem metadata (directories, file ownership, access modes, etc) in high-availability Ceph Metadata Servers where the metadata resides in memory. The reason for the MDS (a daemon called ceph-mds) is that simple filesystem operations like listing a directory or changing a directory (ls, cd) would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from the data means that the Ceph Filesystem can provide high performance services without taxing the Ceph Storage Cluster.

Ceph FS separates the metadata from the data, storing the metadata in the MDS, and storing the file data in one or more objects in the Ceph Storage Cluster. The Ceph filesystem aims for POSIX compatibility. ceph-mds can run as a single process, or it can be distributed out to multiple physical machines, either for high availability or for scalability.

High Availability: The extra ceph-mds instances can be standby, ready to take over the duties of any failed ceph-mds that was active. This is easy because all the data, including the journal, is stored on RADOS. The transition is triggered automatically by ceph-mon.
Scalability: Multiple ceph-mds instances can be active, and they will split the directory tree into subtrees (and shards of a single busy directory), effectively balancing the load amongst all active servers.

Combinations of standby and active etc are possible, for example running 3 active ceph-mds instances for scaling, and one standby instance for high availability.

Above is from Ceph doc, need more details on multiple active MDS.