Monday, September 5, 2016

What Paxos Sounds like

Sonification is the use of non-speech audio to convey information or perceptualize data. Auditory perception has advantages in temporal, spatial, amplitude, and frequency resolution that open possibilities as an alternative or complement to visualization techniques.    -- Wikipedia

Following simple rules, to assign different audio frequencies to meaningful events or state changes in a computer program, we were able to clearly hear the "characteristic" and pattern that distinguish them, like the sorting algorithms in this video

What about a distributed algorithm? When multiple processes joined to play the sound, will it become an orchestra? In this fun idea, we tried to make Paxos implementation from this paper perform.

With 5 Acceptors, 2 Replicas and 2 Leaders, each process will play a different MIDI instruments, and different event in the Paxos system is assigned with different note. To make it sounds better, we delay every event for 150 milliseconds, so that each note is played 150ms.



Since there is no deterministic execution in any distributed systems (uncertain arrival of messages), the sound in realtime is different every time we run the program, but the pattern remains. Our experimental program can be downloaded here, which can be run in terminal in following commands:


java -jar paxos-sonification.jar


Friday, May 27, 2016

Survey on Compressive Sensing

Introduction

This survey is inspired by a seminar on the topic of crowd sensing systems. In which I have been exposed to the celebrated theory of compressive sensing, also known as compressive sampling or CS, a novel sensing paradigm that can help reduce the sampling rate of sensing tasks. Sampling is data acquisition protocol that aims to recover the original signal of interest. Such signals includes audio, medical images, radio, computer visions, etc. Traditionally, Shannon’s theorem states that: the sampling rate must be at least twice the maximum frequency present in the signal (Nyquist rate). However, CS suggest that it is possible to surpass the traditional limits, by the fact that we can represent many signals using only a few non-zero coefficients in a suitable basis or dictionary, utilizing nonlinear optimization to enable recovery of such signals from the small set of measurements. Another way to look at it is that rather than measuring at a high rate first, then compressing the sampled data, CS directly sense the data in a compressed form.

In this survey, we first provide a detailed review of the mathematical foundation underlying compressive sensing problem. Then we present the extensions of CS techniques.


Compressive Sensing Primer

Signals

Signals can be treated as real-valued functions $f$, with continuous or discrete domains, finite or infinite. Signals produced by natural or man-made systems, usually has discrete, finite domain. We can model such signal as linear structure as vectors living in an $n$-dimensional Euclidean vector space, denoted by $\mathbb{R}^n$.  Vector space allow us to add two signals together to generate a new, physically meaningful signal. Other geometry tools such as lengths, distances, angles can also be used to describe and compare signals of interest.

The length of a vector $x = (x_1, x_2, ..., x_n)$ in $n$-dimensional real vector space $\mathbb{R}^n$ can be given by the Euclidean norm:
$$ \| x \|_2 = (x_1^2 + x_2^2 + ... + x_n^2)^{\frac{1}{2}}$$
The Euclidean distance between two points $x$ and $y$ is the length $\| x-y \|_2$ of the straight line between the two points. However, the Euclidean distance is insufficient in many situations of a given space. Instead, we make frequent use of the $p$-norms, which are defined for $p \in [0,\infty]$ as

\[ \| x \|_p =
\begin{cases}
|\text{supp}(x)|  & \quad p = 0; \\
(\sum_{i=1}^{n}|x_i|^p)^\frac{1}{p}       & \quad p \in [1, \infty); \\
\text{max}|x_i|   & \quad p = \infty \\
\end{cases}
\]

where $|\text{supp}(x)| = \{ i : x_i \neq 0 \}$ denotes the cardinality of support of $x$.
Also note that, in standard \emph{inner product} in $\mathbb{R}^n$,
$$ \langle x, x \rangle = x^Tx = \sum_{i=1}^{n}x_ix_i $$
thus $\ell_2$ norm can be writen as $\|x\|_2 = \sqrt{\langle x,x \rangle}$.
The norms are typically used as a measure of the strength of a signal or size of an error.


Bases

Any vector $f \in \mathbb{R}^n$ can be represented by a linear combination of basis vectors $\{\psi_i\}_{i=1}^{n}$ and (unique) coefficients $\{x_i\}_{i=1}^{n}$, such that
\begin{equation}
f(t) = \sum_{i=1}^{n} x_i \psi_i(t)
\label{eq:basis}
\end{equation}
Let $\Psi$ denote $n \times n$ matrix with $\psi_i$ as columns, $x$ denote vector with entries $x_i = \langle f,\psi_i \rangle$, we can rewrite the equation more compactly as
\begin{equation}
f = \Psi x
\end{equation}
Discrete Fourier base and discrete cosine base are examples of typical choices for the sparsifying base $\Psi$.


Sparsity

A signal $f$ is $S$-sparse when it has at most $S$ nonzeros, i.e., $\|f\|_0 \leq S$. Signals that are not themselves sparse, but can expressed as $f=\Psi x$ in some basis $\Psi$, where $\|x\|_0 \leq S$, can still be refered as $S$-sparse. When the signal has a sparse expansion, we can discard the small coefficients without much perceptual loss. Formally, $f_S(t)$ is obtained by keeping only the terms corresponding to the $S$ largest values of $x_i$ in above equation, such that $f_S := \Psi x_S$. We have
\begin{equation}
\| f - f_S \|_{\ell_2} = \| x - x_S \|_{\ell_2}
\label{eq:sparse}
\end{equation}
and if the sorted magnitudes of $x_i$ decay quickly, we say $x$ is sparse or compressible, then $x$ is well approximated by $x_S$, therefore, the error $\|f-f_S\|_{\ell_2}$ is small. In other words, a large fraction of the coefficients can be ignored (set to 0) without much loss.


Sensing Matrices

In this survey, we only consider the standard finite dimensional CS model. Specifically, given a signal $f \in \mathbb{R}^n$, we consider measurement systems that acquire $m$ linear measurements. This process is mathematically represent as
\begin{equation}
y_k = \langle f, \phi_k \rangle
\label{eq:measure}
\end{equation}
where $k = 1,...,m$. Let $A$ denote the $m\times n$ sensing matrix with vectors $\phi_1^*,...,\phi_m^*$ as rows (* is the complex transpose), we can rewrite equation as
\begin{equation}
y = Af
\end{equation}
The matrix $A$ represents a dimensionality reduction, i.e., it maps $\mathbb{R}^n$ into $\mathbb{R}^m$ where $n\gg m$. We assume that $f$ is a finite-length vector with a discrete-valued index (such as time or space).


Incoherent Sampling

With given pair $(\Phi, \Psi)$ of orthobases of $\mathbb{R}^n$, $\Phi$ is sampling matrix as in equation~\ref{eq:measure}, $\Psi$ is basis used to represent $f$ in equation~\ref{eq:basis}.
Definition 1 The coherence between the sensing basis $\Phi$ and representation basis $\Psi$ is
\begin{equation}
\mu(\Phi, \Psi) = \sqrt{n} \cdot \max_{1\leq k,j\leq n} |\langle \phi_k, \psi_j \rangle|.
\end{equation}
The coherence measures the largest correlation between any two elements of $\Phi$ and $\Psi$ \cite{uncertainty}. If there is correlated elements, the coherence is large. Compressive sampling is mainly concerned with low coherence pairs.

Random matrices are highly incoherent with any fixed basis $\Psi$. If we select $\Phi$ uniformly at random, then with high probability, the coherence between $\Psi$ and $\Phi$ is $\mu(\Phi, \Psi) = \sqrt{2\log n}$.


Signal Recovery via $\ell_1$ Minimization




To summarize above sections, consider $f\in\mathbb{R}^n$ is a target signal we would like to reconstruct, $f$ can be decomposed under a certain base $\Psi$, i.e. $f = \Psi x$ where $x$ is the coefficient vector. $\Phi$ is a linear encoder which projects an $n$-dimensional data into an $m$-dimensional subspace ($m < n$). CS allows us to reconstruct sparse $f$ from its linear measurements
\begin{equation}
s = \Phi f = \Phi \Psi x
\label{eq:recover}
\end{equation}
In above equation, $m$ can be as small as $O(S\log\frac{n}{S})$ \cite{compressive}. The Sensing matrix $A$ as mentioned in Table 1 has to satisfy the Restricted Isometry Property (RIP) \cite{csstable}.

Finally, we can perform the $\ell_1$-norm minimization
\begin{equation}
(\mathrm{P}_1) \text{arg min}_{\hat{x} \in \mathbb{R}^n} \|\hat{x} \|_{\ell_1}, \text{ subject to } \Phi\Psi\hat{x} = y = Ax
\label{eq:l1min}
\end{equation}
where $\|x\|_{\ell_1} := \sum_i |x_i|$.

This $\ell_1$-norm minimization replaces the NP-hard $\ell_0$-norm minimization that directly searches for the sparsest $\hat{x}$. $\ell_1$-min can be solved in polynomial time by linear programming, besides, various greedy algorithms are also practical alternatives.



Extensions

Robust Compressive Sampling

In section~\ref{sec:math} we have show that how to recover sparse signals from a few measurements. However, to be powerful in practice, CS needs to be able to deal with nearly sparse signals and with noise. In real applications, measured data will inevitably be corrupted by some noise as the sensing device does not have infinite precision. Here, our goal is that small perturbations in the data should cause small perturbations in the reconstruction.
Recovering a vector $x\in\mathbb{R}^n$ from data
$$ y = Ax + e$$
where $A$ is the same $m\times n$ sensing matrix, and $e$ is a stochastic or deterministic unknown error term. We can modify the $\ell_1$ minimization with relaxed constraints for reconstruction:
\begin{equation}
(\mathrm{P}_1) \text{arg min}_{\hat{x} \in \mathbb{R}^n} \|\hat{x} \|_{\ell_1}, \text{ subject to } \|\Phi\Psi\hat{x} - y \|_{\ell_2} \leq e
\label{eq:l1min2}
\end{equation}


Hybrid CS

In some problems, the compressive sensing model can be deployed in a hybrid way, in particular, CS is not to carry all the load. In such deployment, we can apply CS to measure only fine scale properties of the signal, while ordinary linear measurement and reconstruction was used to obtain the coarse scale properties of the signal.
For example, we can expand the object $x_0$ in the wavelet basis
$$x_0 = \sum_k \beta_{j_0,k}\phi_{j_0,k} + \sum_{j=j_0}^{j_1}\sum_k \alpha_{j,k}\psi_{j,k}$$
where $j_0$ is some specified coarse scale, $j_1$ is the finest scale, $\phi_{j_0,k}$ are mail wavelets at coarse scale and $\psi_{j,k}$ are fine scale female wavelets. $\alpha$ denote the grouping together of all wavelet coefficients, and $\beta$ denote the male coefficients. For male coarse scale coefficients, we take direct measurements, whereas for the female fine scale coefficients, we apply the CS scheme. Let
$$ y = \Phi\Psi^T x_0 $$
To reconstruct from these observations, consider the basis-pursuit optimization problem
\begin{equation}
(\mathrm{BP}) \min_{\hat{x}} \|\hat{x}\|_{\ell_1} \text{ subject to } y_n = \Phi\Psi^T\Psi \hat{x}
\end{equation}
Results show that Hybrid CS reconstruct signals with even fewer samples and the accuracy is evidently comparable.


Multiscale CS

Inspired by the success of Hybrid CS, we may consider a fully multiscale deployment of CS. We may expand the object $x_0$ in the wavelet basis in the same way, then partition the coefficient vector as $[\beta_{j_0,\cdot}, \alpha_{j_0,\cdot},...,\alpha_{j_1-1,\cdot}]$. We then apply ordinary linear samping to measure the coefficients $\beta_{j_0,\cdot}$ directly, and then separately apply compressed sensing scale-by-scale, sampling data $y_j$ about the coefficients $\alpha_{j,\cdot}$ at level $j$ using an $n_j \times 2^j$ CS matrix $\Phi_j$. To obtain a reconstruction, we solve the sequence of problems
\begin{equation}
(\mathrm{BP_j})  \min_{\hat{x}} \|\hat{x}\|_{\ell_1} \text{ subject to } y_j = \Phi_j \hat{x}
\end{equation}


Cost Aware CS

In applications like mobile crowd-sensing or wireless sensor networks, the resource burden of collecting samples is usually a major concern. However, the ordinary CS assumes that every sample has the same cost, it simply reduce the number of samples while the Cost Aware CS takes the cost of each sample into consideration. The new objective becomes not only maximizing recovery accuracy, also minimizing the sampling cost. CACS in \cite{cacs} uses Regularized Column Sum (RCS) to predict recovery accuracy, which is equivalent to Statistical RIP (Restricted Isometry Property) condition, to formulate the RCS-constrained Optimization (RO) to fund an optimal randomized sampling strategy $\pi$. In order to make RO solvable, the paper uses two relaxation methods:
\begin{align}
\sigma(F_{\Omega}) \leq \alpha \quad & \to \quad \mathbb{E}[\sigma(F_{\Omega})] \leq \alpha \\
\pi \in \{0,1\}^n \quad & \to \quad \pi \in [0,1]^n
\end{align}

Then we can use the following convex optimization problem to find an optimal randomized sampling strategy $\pi$ which satisfies the given RCS constraint with the lowest cost.
\begin{align*}
(\mathrm{P}) \min_{\pi} \quad & c^T \pi \\
\text{subject to} \quad & 1^T \pi = m \\
\quad & (Re(F_j)^T \pi)^2 + (Im(F_j)^T \pi)^2 \leq \alpha^2 \\
\quad & 0 \leq \pi_i \leq 1, i = 1,...,n.
\end{align*}
where $c$ is the cost map, $\pi$ is the sampling strategy, $m$ is the expected sample size, and $Re(F_j)$ and $Im(F_j)$ denote the real and imaginary component of the $j$th column in $F$. Above problem can be solved in polynomial time via standard interior point methods.

However, RO is centralized algorithm since it requires global information of the cost map, which makes it impractical. The paper further provides two decentralized algorithm that only uses partial information.


Distributed CS

Distributed Compressive Sensing (DCS) \cite{dcs} improves the signal recovery performance of multi signal ensembles by exploiting both intra- and inter-signal correlation and sparsity structure. DCS is particularly useful in a scenario of multiple sensors carry out the compression in a distributed way without cooperation with each other and transmit the compressed signal to the sink node. At the sink node, the received signals from all the sensors are recovered jointly. The key of DCS is joint sparsity, defined as the sparsity of the entire signal ensemble. DCS consider three types of models as joint sparse signal:

  1. Each signal is individually sparse, there are also common components shared by every signal, namely common information. This allows joint recovery with a reduced measurements.
  2. All signals share the supports, the locations of the nonzero coefficients.
  3. No signal is sparse itself, nevertheless, they share the large amount of common information.

One example application for DCS is in \cite{multimodal}, while combining multiple images of the same scene into a single image, they made the constant background image as common information and the variable foreground image as innovation information for efficiency of the process.


Generalized Distributed CS

The Generalized Distributed Compressive Sensing (GDCS) \cite{gdcs} can improve sparse signal detection performance given arbitrary types of common information which are classified into not just full common information as DCS, but also a variety of partial common information.

Sunday, February 21, 2016

Paper Review: Durability with BookKeeper

Logs, sometimes called write-ahead logs or commit logs or transaction logs, are at the heart of many distributed data systems. Logs provide the simplest abstraction that is append-only, totally-ordered sequence of records that is ordered by time. On the other hand, time is abstracted by the log’s sequence number (entry id). Therefore, logs solves many problems in distributed system.

The log, solves the consistency issue in state machine replication approach, since it is guaranteed that if two identical, deterministic processes begin in the same state and get the same inputs in the same order (from a shared log), they will produce the same output and end in the same state. Not only distributed data system uses the log for consistent replication, common consensus systems like Raft and ZooKeeper maintains a shared log at heart.

The log abstraction can bring even more benefits if wisely used:
1. Decouples the event (log entry) producer and consumers, which avoids slow consumer problem, and simplifies communication model.
2. Unifies messaging format.
3. Provides durability, easy for consumer recovery.
4. Provides ability of streaming data flow.

There are two ways of using log for replication, primary-back replication and active-active replication, as shown in following figure:



The drawback of shared log approach is the cost. Existing distributed log service systems such as BookKeeper and Kafka, implements different design details to optimize the performance with their different usage assumptions, in following ways:

1. Cost of disk space
Problem: In order to provide durability, every single append operation to the log must be synchronously written to the disk, which consumes lot of space.
Solution: Traditional magnetic disks are cheap, and is a good fit for logs.

2. Cost of read/write speed
Problem: Writing to disk is slow, but the append-only property of logs makes it fast for magnetic disk due to much less motion of the head across the disk. Sequential reading is relatively fast, but some of the entry is redundant, might be overwritten by later record.
Solution: BookKeeper assumes a write intensive workload, requires high throughput and low latency for writes. BookKeeper separates the read/write device to minimize the read operations interference with writes. Such design uses different I/O components to handle each of the core log storage workloads, results in an optimized write performance, and not too bad reads. There is a trade off between fast writing of interleaved entries from different ledgers and the random read seek time, BookKeeper chooses interleaved log, then optimize random reads using cache and memory index. Periodic Log compaction and update-to-date read from memory table also improves the performance for tailing reads and catch-up reads.


3. SPOF
Problem: The log becomes single point of failure, if the log storage server can fail. No read/write operation to the log is available during faults.
Solution: Replicate at log entry level (BookKeeper) or at log partition level (Kafka).
BookKeeper assumes there’s a single writer for each ledger, then let the client choose the quorum of bookies to replicate each log entry. The problem of such approach is partial writes can easily results in inconsistent reads. BK uses ZooKeeper to guarantee the agreements of the content of a ledger, by letting BK client to write ledger metadata into ZooKeeper.
At early version of Kafka, there’s no replication of the log. Later, Kafka uses ZooKeeper to elect a leader in a set of replicas for each partition. The leader then serves all write and read requisition from a client for that partition. Any synchronous replication is handled by the partition leader.


4. Scalability
Problem: A single log maintainer does not scale with the number of log producers and online/offline consumers. 
Solution: BookKeeper producer writes to ledgers (segments of the log), each entry in a ledger is distributed and striped in an ensemble of bookies (storage server). Therefore, the write throughput is determined by the bookie rate R, multiply ensemble size E and divided by quorum size Q. BookKeeper consumers reads ledger entries in a similar way.
Kafka uses a larger number of partitions for each topic. Usually the number of partitions |P| is greater than number of consumers in a consumer group |C|. The Kafka producer can randomly select which partition the next message should be send to, therefore balancing the workload.






Kafka is unlike any traditional messaging services, and lies between a replicated log service and messaging service. The reason for that is the stateless approach of Kafka’s brokers (storage servers). Kafka broker don’t not maintain consumer subscription information, instead let consumers manage such info themselves. There’s two side effects: (1) it allows consumer to deliberately pull the same message more than once, (2) messages cannot be deleted any time soon.

Hedwig is a messaging service built on top of BookKeeper, which can also be compared to Kafka. Hedwig’s hubs (brokers) keeps track of subscription info and consuming information, therefore it is guaranteed message will be received by consumer exactly once. Since Hedwig depends on BookKeeper to store its internal data, it focus on high durability.

In general, Hedwig works better in a situation with large number of topics and few consumers, because more topics will open more concurrent ledgers. Kafka works better with large number of consumers, given its stateless broker design.

Paper Review: Cost-Aware Compressive Sensing for Networked Sensing Systems

The major issue addressed by this paper is to design algorithms in compressed sensing systems where each samples has a different cost, such algorithmic goal is to minimize the total cost subject to recovery accuracy. Introducing the concept of cost in samples is novel and really filling the gap between theory of compressive sensing and practical systems of crowed sensing, especially systems with limited resources.

The paper is creative while using Regularized Column Sum (RCS) to predict recovery accuracy, which is equivalent to Statistical RIP (Restricted Isometry Property) condition, to formulate the RCS-constrained Optimization (RO) to fund an optimal randomized sampling strategy $\pi$. In order to make RO solvable, the paper uses two relaxation methods:
$$ \sigma (F_\Omega) \leq \alpha \to \mathbb{E}[\sigma (F_\Omega)] \leq \alpha $$
$$ \pi \in \{0,1\}^n \to \pi \in [0,1]^n $$

However, RO is centralized algorithm since it requires global information of the cost map, which makes it impractical. The paper further provides two decentralized algorithm that only uses partial information.

The paper is very well written in structure. It first gives a very intuitive example of how greedy algorithm could failed in real-world applications with spatially correlated cost map, then clearly state all the challenges for solving such problem, including challenge of predicting accuracy, computational complexity, etc. Paper gives solution with proofs as the foundation of proposed algorithms. Finally, the evaluation is comprehensive with two types of application and cost.

Critiques

(1) Both proposed algorithm assumes a fixed sampling size m, then calculate the sample strategy $\pi$, but didn’t discuss how would a crowd sensing application estimate the m value in the first place.

(2) The cost map $c = \{c_1, c_2, …, c_n\}$ in CACS is represented by vector of real numbers where $0 \leq c_i \leq 1$. However, as the paper stated, there exists more complicated scenarios where the cost can exist in many forms. The paper only consider a linear combination of spatial cost maps and the battery penalty as the overall sensing cost.

(3) It seems in evaluation section Figure 7, most of the cost map choices, the proposed distributed algorithms exhibit a similar cost ratio as the base line greedy algorithm, even through the RO algorithm performs much better in all situations. I think this is due to the partial information limitation in PW and DWS leads to a poor estimation for better strategy exists in RO. It also makes one wonder if it is worth to use PW/DWS instead of simple greedy strategy. The distributed algorithm of CACS definitely requires an improvements.