Paxos Consensus Algorithm

Paxos consensus algorithm is used to achieve consensus across a set of unreliable processes which are running independently.  Consensus is required to agree on a data value proposed by one of the processes. Reaching consensus becomes difficult when processes fail or messages can be lost or delayed. An example of this kind of environment is asynchronous distributed system.

Paxos consensus algorithm is used in some of the popular highly scalable systems such as Google Spanner and Amazon Dynamo DB. Most of the modern distributed systems use commodity hardware and they fall in asynchronous distributed system category.  They use multiple machines and in some cases hosted in multiple data centers to provide high availability and scalability. Machine outages, disk failures and network partitions are common in these kind of environments. Consensus is required for the common functions such as designating one of the process as a leader to perform some computation, ordering input events so that same output is calculated by multiple processes, acquiring a lock/lease on a shared resource, replicating some state across multiples machines or to provide distributed transactions.

An example use case of Paxos algorithm is to replicate the state of a key-value store to multiple instances to make it fault tolerant and to improve read latency. In a key-value store, the data is managed as a sorted map of key-value records. Each record is associated with a unique key and it is used to access its associated value. A reliable technique to replicate data is using state machine replication approach.

In state machine replication, a set of server processes called replicas, each execute a finite state machine. Each replica starts in the same initial state. The input to the state machine is client requests. The state machine processes the input request and transitions to an output state and produces output which is sent back to client. State machine maps (input, state) pair to (output, state) pair.

Client can send requests to any replica. Each replica based on its current state executes client requests and produces results.  If each replica started with same initial state and receives inputs in same order, all replicas will transition to the same state and the data is consistent.

In our example input is key-value store commands such as put or delete. Commands are first appended to a replication log and data snapshots are created from it. The replication log at each replica needs to contain the input commands in the same order. In other words all replicas need to agree on the order of inputs in replication log. This can be accomplished using Paxos consensus algorithm.

A variant of Paxos algorithm called Multi-Paxos is used to sequence the commands in the replication log which uses instance numbers.One of the replica is disignated as a leader which proposes values. Client commands are forwarded to leader replica which assigns them sequence numbers in the replication log file.

Replication log contains a sequence of log records. Each log record is uniquely identified by a sequence number and contains an input command to data store. For example a log record of insert operation contains key and value being inserted. Replication log file is stored on disk. Only read and append operations supported on the log file. Database snapshots are created by replaying the records in replication log.


No comments:

Post a Comment