Cassandra Tombstones

In an eventual consistent system like Cassandra, information about deleted keys should be stored to avoid reading the deleted data. When a row or column is deleted, this information is stored as tombstone.  Tombstones are stored until gc grace period associated with column family not reached. Only major compaction removes tombstones that are older than gc grace period. Tombstones are spread to all replicas when repair is performed. It is important to run repair regularly to eliminate resurrecting deleted data.

A Row tombstone is created when a row is deleted, a Column tombstone is created when a specified column is deleted from a row and a Range tombstone is created when user deletes a super column. Tombstone includes application's deletion timestamp (MarkedForDeleteAt) and local deletion timestamp (LocalDeletionTime). Range tombstone also includes start and end column names.

MarkedForDeleteAt

It is a timestamp in milliseconds specified by Application in delete request. For a live column or row it is stored as max long value 0x8000000000000000.

LocalDeletionTime

It is a timestamp in seconds set by Cassandra server when it receives a request to delete a column or super column. The value is set as current time in seconds. For a live column or row it is stored as max integer value 0x7FFFFFFF..

Tombstone thresholds

tombstone_warn_threshold (Default value 1000) is used to log a warning message when  scanned tombstone count is greater than this limit but the query execution will continue and tombstone_failure_threshold (Default value 100000) is used to abort slice query if scanned tombstone count exceeds this limit. An error message (TombstoneOverwhelmingException) is also logged. These settings can be changed in cassandra.yaml

Tombstones can impact performance of slice queries especially for wider rows. When Cassandra executes a column slice query it needs to read columns from all the SSTables that include the given row and filter out tombstones. And all these tombstones need to be kept in memory till the row fragments from all SSTables are merged which increase heap space usage.

 Estimated droppable tombstones

sstablemetadata tool can be used to estimate the amount of tombsones in a given SSTable. The value is the ratio of tombsones to estimated column count in a SSTable. Tombsones are cleared during compaction based on the gc_grace period. The tool by default uses the current time as the gc time to decide whether a tombstone can be dropped or not. To find out the tombstones are created before a given time, use  command line argument --gc_grace_seconds to pass GC time in seconds

Usage: sstablemetadata [--gc_grace_seconds n] <sstable filenames>

$ ./tools/bin/sstablemetadata data/data/ycsb/usertable-8dd787404e2c11e8b1a22f4e9082bb4e/mc-1-big-Data.db | grep "Estimated droppable tombstones"
Estimated droppable tombstones: 0.0

Paxos Consensus Algorithm

Paxos consensus algorithm is used to achieve consensus across a set of unreliable processes which are running independently.  Consensus is required to agree on a data value proposed by one of the processes. Reaching consensus becomes difficult when processes fail or messages can be lost or delayed. An example of this kind of environment is asynchronous distributed system.

Paxos consensus algorithm is used in some of the popular highly scalable systems such as Google Spanner and Amazon Dynamo DB. Most of the modern distributed systems use commodity hardware and they fall in asynchronous distributed system category.  They use multiple machines and in some cases hosted in multiple data centers to provide high availability and scalability. Machine outages, disk failures and network partitions are common in these kind of environments. Consensus is required for the common functions such as designating one of the process as a leader to perform some computation, ordering input events so that same output is calculated by multiple processes, acquiring a lock/lease on a shared resource, replicating some state across multiples machines or to provide distributed transactions.

An example use case of Paxos algorithm is to replicate the state of a key-value store to multiple instances to make it fault tolerant and to improve read latency. In a key-value store, the data is managed as a sorted map of key-value records. Each record is associated with a unique key and it is used to access its associated value. A reliable technique to replicate data is using state machine replication approach.

In state machine replication, a set of server processes called replicas, each execute a finite state machine. Each replica starts in the same initial state. The input to the state machine is client requests. The state machine processes the input request and transitions to an output state and produces output which is sent back to client. State machine maps (input, state) pair to (output, state) pair.

Client can send requests to any replica. Each replica based on its current state executes client requests and produces results.  If each replica started with same initial state and receives inputs in same order, all replicas will transition to the same state and the data is consistent.

In our example input is key-value store commands such as put or delete. Commands are first appended to a replication log and data snapshots are created from it. The replication log at each replica needs to contain the input commands in the same order. In other words all replicas need to agree on the order of inputs in replication log. This can be accomplished using Paxos consensus algorithm.

A variant of Paxos algorithm called Multi-Paxos is used to sequence the commands in the replication log which uses instance numbers.One of the replica is disignated as a leader which proposes values. Client commands are forwarded to leader replica which assigns them sequence numbers in the replication log file.

Replication log contains a sequence of log records. Each log record is uniquely identified by a sequence number and contains an input command to data store. For example a log record of insert operation contains key and value being inserted. Replication log file is stored on disk. Only read and append operations supported on the log file. Database snapshots are created by replaying the records in replication log.