Sunday, February 21, 2016

Paper Review: Durability with BookKeeper

Logs, sometimes called write-ahead logs or commit logs or transaction logs, are at the heart of many distributed data systems. Logs provide the simplest abstraction that is append-only, totally-ordered sequence of records that is ordered by time. On the other hand, time is abstracted by the log’s sequence number (entry id). Therefore, logs solves many problems in distributed system.

The log, solves the consistency issue in state machine replication approach, since it is guaranteed that if two identical, deterministic processes begin in the same state and get the same inputs in the same order (from a shared log), they will produce the same output and end in the same state. Not only distributed data system uses the log for consistent replication, common consensus systems like Raft and ZooKeeper maintains a shared log at heart.

The log abstraction can bring even more benefits if wisely used:
1. Decouples the event (log entry) producer and consumers, which avoids slow consumer problem, and simplifies communication model.
2. Unifies messaging format.
3. Provides durability, easy for consumer recovery.
4. Provides ability of streaming data flow.

There are two ways of using log for replication, primary-back replication and active-active replication, as shown in following figure:



The drawback of shared log approach is the cost. Existing distributed log service systems such as BookKeeper and Kafka, implements different design details to optimize the performance with their different usage assumptions, in following ways:

1. Cost of disk space
Problem: In order to provide durability, every single append operation to the log must be synchronously written to the disk, which consumes lot of space.
Solution: Traditional magnetic disks are cheap, and is a good fit for logs.

2. Cost of read/write speed
Problem: Writing to disk is slow, but the append-only property of logs makes it fast for magnetic disk due to much less motion of the head across the disk. Sequential reading is relatively fast, but some of the entry is redundant, might be overwritten by later record.
Solution: BookKeeper assumes a write intensive workload, requires high throughput and low latency for writes. BookKeeper separates the read/write device to minimize the read operations interference with writes. Such design uses different I/O components to handle each of the core log storage workloads, results in an optimized write performance, and not too bad reads. There is a trade off between fast writing of interleaved entries from different ledgers and the random read seek time, BookKeeper chooses interleaved log, then optimize random reads using cache and memory index. Periodic Log compaction and update-to-date read from memory table also improves the performance for tailing reads and catch-up reads.


3. SPOF
Problem: The log becomes single point of failure, if the log storage server can fail. No read/write operation to the log is available during faults.
Solution: Replicate at log entry level (BookKeeper) or at log partition level (Kafka).
BookKeeper assumes there’s a single writer for each ledger, then let the client choose the quorum of bookies to replicate each log entry. The problem of such approach is partial writes can easily results in inconsistent reads. BK uses ZooKeeper to guarantee the agreements of the content of a ledger, by letting BK client to write ledger metadata into ZooKeeper.
At early version of Kafka, there’s no replication of the log. Later, Kafka uses ZooKeeper to elect a leader in a set of replicas for each partition. The leader then serves all write and read requisition from a client for that partition. Any synchronous replication is handled by the partition leader.


4. Scalability
Problem: A single log maintainer does not scale with the number of log producers and online/offline consumers. 
Solution: BookKeeper producer writes to ledgers (segments of the log), each entry in a ledger is distributed and striped in an ensemble of bookies (storage server). Therefore, the write throughput is determined by the bookie rate R, multiply ensemble size E and divided by quorum size Q. BookKeeper consumers reads ledger entries in a similar way.
Kafka uses a larger number of partitions for each topic. Usually the number of partitions |P| is greater than number of consumers in a consumer group |C|. The Kafka producer can randomly select which partition the next message should be send to, therefore balancing the workload.






Kafka is unlike any traditional messaging services, and lies between a replicated log service and messaging service. The reason for that is the stateless approach of Kafka’s brokers (storage servers). Kafka broker don’t not maintain consumer subscription information, instead let consumers manage such info themselves. There’s two side effects: (1) it allows consumer to deliberately pull the same message more than once, (2) messages cannot be deleted any time soon.

Hedwig is a messaging service built on top of BookKeeper, which can also be compared to Kafka. Hedwig’s hubs (brokers) keeps track of subscription info and consuming information, therefore it is guaranteed message will be received by consumer exactly once. Since Hedwig depends on BookKeeper to store its internal data, it focus on high durability.

In general, Hedwig works better in a situation with large number of topics and few consumers, because more topics will open more concurrent ledgers. Kafka works better with large number of consumers, given its stateless broker design.

Paper Review: Cost-Aware Compressive Sensing for Networked Sensing Systems

The major issue addressed by this paper is to design algorithms in compressed sensing systems where each samples has a different cost, such algorithmic goal is to minimize the total cost subject to recovery accuracy. Introducing the concept of cost in samples is novel and really filling the gap between theory of compressive sensing and practical systems of crowed sensing, especially systems with limited resources.

The paper is creative while using Regularized Column Sum (RCS) to predict recovery accuracy, which is equivalent to Statistical RIP (Restricted Isometry Property) condition, to formulate the RCS-constrained Optimization (RO) to fund an optimal randomized sampling strategy $\pi$. In order to make RO solvable, the paper uses two relaxation methods:
$$ \sigma (F_\Omega) \leq \alpha \to \mathbb{E}[\sigma (F_\Omega)] \leq \alpha $$
$$ \pi \in \{0,1\}^n \to \pi \in [0,1]^n $$

However, RO is centralized algorithm since it requires global information of the cost map, which makes it impractical. The paper further provides two decentralized algorithm that only uses partial information.

The paper is very well written in structure. It first gives a very intuitive example of how greedy algorithm could failed in real-world applications with spatially correlated cost map, then clearly state all the challenges for solving such problem, including challenge of predicting accuracy, computational complexity, etc. Paper gives solution with proofs as the foundation of proposed algorithms. Finally, the evaluation is comprehensive with two types of application and cost.

Critiques

(1) Both proposed algorithm assumes a fixed sampling size m, then calculate the sample strategy $\pi$, but didn’t discuss how would a crowd sensing application estimate the m value in the first place.

(2) The cost map $c = \{c_1, c_2, …, c_n\}$ in CACS is represented by vector of real numbers where $0 \leq c_i \leq 1$. However, as the paper stated, there exists more complicated scenarios where the cost can exist in many forms. The paper only consider a linear combination of spatial cost maps and the battery penalty as the overall sensing cost.

(3) It seems in evaluation section Figure 7, most of the cost map choices, the proposed distributed algorithms exhibit a similar cost ratio as the base line greedy algorithm, even through the RO algorithm performs much better in all situations. I think this is due to the partial information limitation in PW and DWS leads to a poor estimation for better strategy exists in RO. It also makes one wonder if it is worth to use PW/DWS instead of simple greedy strategy. The distributed algorithm of CACS definitely requires an improvements.