Wednesday, October 28, 2015
NFV and Container
Even
though there's only one NFV related paper appeared in SOSP 2015
distributed systems, I still feel that it is a hot topic in the field
since 2014.
NFV stands for network function virtualization, which inspired by the benefits of cloud computing, NFV tends to moving network functions out of dedicated physical devices, into virtualized software apps that can be run on commodity, general purpose servers. The network functions includes but not limited to: firewalls, NAT, QoS, VPN, routing, etc. Network functions can be also viewed as some softwares that are specifically designed for packet processing.
NFV stands for network function virtualization, which inspired by the benefits of cloud computing, NFV tends to moving network functions out of dedicated physical devices, into virtualized software apps that can be run on commodity, general purpose servers. The network functions includes but not limited to: firewalls, NAT, QoS, VPN, routing, etc. Network functions can be also viewed as some softwares that are specifically designed for packet processing.
Tuesday, October 27, 2015
Proactive Serialization
This
is a followup discussion after my adviser's original idea blog,
proactive serialization.
The
general idea is serialization does not need to be Paxos-like total
order, and using a lock service, it can be proactive and
anticipatory. The lock service can anticipate a future request and
provide some locks ahead of time, so the worker proceed without an
explicit request. Latency will be reduced and throughput can be
improved.
However,
with a slightly different design, anticipation is no longer obvious.
Problem
statement: Given the star graph $G
= <V, E>$, $|V| = 1+|E|$, $E = \{\forall j \in V, (k, j) \in
E\}$.
$n$ data items that shared between all nodes, and each data item is
associated with one token. A node $j$ requires all tokens of the data
it access in one transaction in order to commit that transaction.
What is the most efficient location for each token? In
other words, the goal is, given the setup and application access
pattern, find the best location of tokens with minimum cost. The cost
would be communication between sites.
If
node $k$ is a simple token manager / lock broker, then the problem is
easy to solve, because it keeps access history for all other nodes,
then assign the token to it upon request or anticipation.
However,
if node $k$ is the elected leader that responsible for exchanging
tokens (locks) between other nodes in the cluster. In the meanwhile,
$k$ is like any other node, that also will accept request from client
and propose/commit on transactions, then the problem becomes, does
$k$ move the token or keep it and commit at $k$. Should node $j$
block a transaction while waiting for token, or forward the
transaction to $k$ instead.
When
two nodes $i, j$ competing the same data item, it is better to keep
the token at $k$, instead of moving it back and forth all the time. A
simple way to make a decision of when to move a token is to remember
the N consecutive accesses for each item. Let's say for a certain
point of time, $i$ got the token, now if $j$ forwards its transaction
to $k$, should $k$ wait until $i$'s leas timeout? Which will block
$j$'s request, or revoke token back to $k$ immediately.
Let's
try to setup some rules.
Rule
#1: node $k$ should be one of the node that geographically located in
the middle among all nodes. So that it's relatively faire for every
site to communication with $k$. $\forall j$, $\sum cost(j,k)$ is
minimum.
Rule
#2: Token should remain in one site while there're still outstanding
transactions. This rule will guarantee the safety property, hopefully
by following two means:
(1)
tokens are send within a replication message.
(2)
every token includes the last transaction id, node $k$ only release a
token after it has seen that transaction. (seen by state machine??)
Rule
#3: Creating operation will create a new data item, thus create a new
token, which is unknown by $k$ initially. This special operation
immediately leads to two way of creation. (1) create the data
globally, requires the creating site holds the father data item in a
tree data structure. So no other site will create a conflict key, or
(2) all create request are forward to site $k$ to commit, (3) any
site may create its local data item that can be only accessed by
itself. In which case, no token is needed.
Rule
#4:
Let
assume each site to site one-way communication takes T sec. The time
to commit a transaction is t, where t << T. If data D has the
following access pattern from site i and j: [i, j, i, j, i, j, …].
The most efficient way is to keep token(D) at site $k$, so the
average time for each txn would be (2T+2t)/2 = T+t.
However,
if the access pattern is [iii, j, iii, j, iii, j, …]. Then keeping
the token at $k$ takes same amount of time (6T+4t)/4 = 1.5T+t. But
send token to $i$ first, then revoke back to $j$ for one txn, then
send token again, would take (T+3t+T+T)/4 = 0.75T+0.75t.
Such
patterns are not be able to detect and anticipate by two consecutive
accesses.
On Ceph MDS
The Ceph Filesystem service includes the Ceph Metadata Server (MDS) deployed with the Ceph Storage cluster. The purpose of the MDS is to store all the filesystem metadata (directories, file ownership, access modes, etc) in high-availability Ceph Metadata Servers where the metadata resides in memory. The reason for the MDS (a daemon called ceph-mds) is that simple filesystem operations like listing a directory or changing a directory (ls, cd) would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from the data means that the Ceph Filesystem can provide high performance services without taxing the Ceph Storage Cluster.
Ceph FS separates the metadata from the data, storing the metadata in the MDS, and storing the file data in one or more objects in the Ceph Storage Cluster. The Ceph filesystem aims for POSIX compatibility. ceph-mds can run as a single process, or it can be distributed out to multiple physical machines, either for high availability or for scalability.
- High Availability: The extra ceph-mds instances can be standby, ready to take over the duties of any failed ceph-mds that was active. This is easy because all the data, including the journal, is stored on RADOS. The transition is triggered automatically by ceph-mon.
- Scalability: Multiple ceph-mds instances can be active, and they will split the directory tree into subtrees (and shards of a single busy directory), effectively balancing the load amongst all active servers.
Combinations of standby and active etc are possible, for example running 3 active ceph-mds instances for scaling, and one standby instance for high availability.
Above is from Ceph doc, need more details on multiple active MDS.
Subscribe to:
Posts (Atom)