Logs, sometimes called write-ahead logs or commit logs or transaction logs, are at the heart of many distributed data systems. Logs provide the simplest abstraction that is append-only, totally-ordered sequence of records that is ordered by time. On the other hand, time is abstracted by the log’s sequence number (entry id). Therefore, logs solves many problems in distributed system.
The log, solves the consistency issue in state machine replication approach, since it is guaranteed that if two identical, deterministic processes begin in the same state and get the same inputs in the same order (from a shared log), they will produce the same output and end in the same state. Not only distributed data system uses the log for consistent replication, common consensus systems like Raft and ZooKeeper maintains a shared log at heart.
The log abstraction can bring even more benefits if wisely used:
1. Decouples the event (log entry) producer and consumers, which avoids slow consumer problem, and simplifies communication model.
2. Unifies messaging format.
3. Provides durability, easy for consumer recovery.
4. Provides ability of streaming data flow.
There are two ways of using log for replication, primary-back replication and active-active replication, as shown in following figure:
The drawback of shared log approach is the cost. Existing distributed log service systems such as BookKeeper and Kafka, implements different design details to optimize the performance with their different usage assumptions, in following ways:
The drawback of shared log approach is the cost. Existing distributed log service systems such as BookKeeper and Kafka, implements different design details to optimize the performance with their different usage assumptions, in following ways:
1. Cost of disk space
Problem: In order to provide durability, every single append operation to the log must be synchronously written to the disk, which consumes lot of space.
Solution: Traditional magnetic disks are cheap, and is a good fit for logs.
2. Cost of read/write speed
Problem: Writing to disk is slow, but the append-only property of logs makes it fast for magnetic disk due to much less motion of the head across the disk. Sequential reading is relatively fast, but some of the entry is redundant, might be overwritten by later record.
Solution: BookKeeper assumes a write intensive workload, requires high throughput and low latency for writes. BookKeeper separates the read/write device to minimize the read operations interference with writes. Such design uses different I/O components to handle each of the core log storage workloads, results in an optimized write performance, and not too bad reads. There is a trade off between fast writing of interleaved entries from different ledgers and the random read seek time, BookKeeper chooses interleaved log, then optimize random reads using cache and memory index. Periodic Log compaction and update-to-date read from memory table also improves the performance for tailing reads and catch-up reads.
3. SPOF
Problem: The log becomes single point of failure, if the log storage server can fail. No read/write operation to the log is available during faults.
Solution: Replicate at log entry level (BookKeeper) or at log partition level (Kafka).
BookKeeper assumes there’s a single writer for each ledger, then let the client choose the quorum of bookies to replicate each log entry. The problem of such approach is partial writes can easily results in inconsistent reads. BK uses ZooKeeper to guarantee the agreements of the content of a ledger, by letting BK client to write ledger metadata into ZooKeeper.
At early version of Kafka, there’s no replication of the log. Later, Kafka uses ZooKeeper to elect a leader in a set of replicas for each partition. The leader then serves all write and read requisition from a client for that partition. Any synchronous replication is handled by the partition leader.
4. Scalability
Problem: A single log maintainer does not scale with the number of log producers and online/offline consumers.
Solution: BookKeeper producer writes to ledgers (segments of the log), each entry in a ledger is distributed and striped in an ensemble of bookies (storage server). Therefore, the write throughput is determined by the bookie rate R, multiply ensemble size E and divided by quorum size Q. BookKeeper consumers reads ledger entries in a similar way.
Kafka uses a larger number of partitions for each topic. Usually the number of partitions |P| is greater than number of consumers in a consumer group |C|. The Kafka producer can randomly select which partition the next message should be send to, therefore balancing the workload.
Kafka is unlike any traditional messaging services, and lies between a replicated log service and messaging service. The reason for that is the stateless approach of Kafka’s brokers (storage servers). Kafka broker don’t not maintain consumer subscription information, instead let consumers manage such info themselves. There’s two side effects: (1) it allows consumer to deliberately pull the same message more than once, (2) messages cannot be deleted any time soon.
Hedwig is a messaging service built on top of BookKeeper, which can also be compared to Kafka. Hedwig’s hubs (brokers) keeps track of subscription info and consuming information, therefore it is guaranteed message will be received by consumer exactly once. Since Hedwig depends on BookKeeper to store its internal data, it focus on high durability.
In general, Hedwig works better in a situation with large number of topics and few consumers, because more topics will open more concurrent ledgers. Kafka works better with large number of consumers, given its stateless broker design.
No comments:
Post a Comment