Cassandra nodetool repair

Cassandra nodetool provides a command to run repair on a specified endpoint. The repair command uses Anti-Entropy service which detects inconsistencies and repairs data across all replicas of the specified endpoint. Cassandra provides high availability and durability by storing multiple copies (replicas) of data. Data inconsistencies can occur due to server, disk failures or network partitions. 

nodetool repair is executed on a specified endpoint in the cluster and provides several options to control the amount of data to be repaired. Data is organized in to keyspaces and column families and distributed to the nodes using data partitioning scheme. User can execute repair on a specified column family or all column families in a specified keyspace or repair all column families from all keyspaces.

At storage layer, each column family data is stored separately in SSTables. A column family contains a set of rows. Each row is mapped to a token using its key and it is assigned to a node using data partitioning scheme. Rows in a column family are sorted based on their token order in SSTables. Rows are replicated to other nodes using replication strategy and the number of replicas are determined by replication factor. Each keyspace can have different replication strategies, so Cassandra repairs each keyspace data separately.

In order to perform repair, we need to identify replicas of a row.  Each row can have a different set of replicas based on its token. But all the rows in a token range will have same set of replicas. So instead of repairing each row separately, Cassandra repairs rows of a token range together.
command t
Cassandra uses token ranges to split a node repair activity in to small repair jobs. By default all token ranges assigned to a given node are repaired. These token ranges include primary partition ranges as well as replication ranges. nodetool repair -pr command can be used to repair only primary partition ranges of a node.


The target node which receives repair request from nodetool repair command is also called Repair coordinator node.  Repair coordinator node creates repair sessions for each token range being repaired and each repair session is assigned with a unique UUID.
 
A repair session (RepairSession) is responsible to perform repair on all the column families in a specified keyspace. it creates repair jobs (RepairJob) for each column family to be repaired and adds jobs to a queue. A repair job repairs a single token range across all replicas associated to its token range. Repair job contains two phases. The first phase is Validation phase in which Merkle trees are built at each replica. The second phase is called Sync phase where each replica's Merkle tree is compared with each other replica's Merkle tree and the differing rows are synced. Repair jobs in a repair session are either can be executed sequentially or in parallel.

Validation Phase

In validation phase, repair coordinator node requests Merkle tree from each replica for the token range being repaired.  Each replica builds a Merkle tree that contains hash values of rows from the requested column family and token range. A validation compaction is triggered for the requested column family. For each row, a MD5 hash value of its contents is computed and added to Merkle tree. It is not feasible to construct a Merkle tree to its maximum hash depth to include each row hash value in a separate leaf node. Cassandra builds Merkle tree of variable depth based on the estimated number of keys in the requested token range. It limits the max depth of Merkle tree to 20. Each leaf node's hash value is computed by mixing hash values of rows that belong to its sub token range. If a leaf node's sub token range do not contain any rows then an empty hash value is used as its hash value.

Sync Phase

In this phase, received Merkle trees are compared. Each replica's Merkle tree is compared with that of each other replica and differing sub token ranges identified between replicas. If root hash values of two Merkle trees match then the data in the token range being compared is consistent. If root hash values are not equal then recursively hash values of left branches are compared followed by hash values of right branches until all the nodes in trees are traversed. For each differing sub token range, streaming repair is performed between the replicas to sync up the data.

Cassandra Data Replication

In a distributed system like Cassandra, data replication enables high availability and durability. Cassandra replicates rows in a column family on to multiple endpoints based on the replication strategy associated to its keyspace. The endpoints which store a row are called replicas or natural endpoints for that row. Number of replicas and their location are determined by replication factor and replication strategy. Replication strategy controls how the replicas are chosen and replication factor determines the number of replicas for a key. Replication strategy is defined when creating a keyspace and replication factor is configured differently based on the chosen replication strategy.

Two kinds of replication strategies available in Cassandra. First one is SimpeStrategy which is rack unaware and data center unaware policy. It is commonly used when nodes are in a single data center. The second strategy is NetworkTopologyStategy which is both rack aware and data center aware. Even if you have single data center but nodes are in different racks it is better to use NetworkTopologyStategy which is rack aware and enables fault tolerance by choosing replicas from different racks. Also if you are planning to expand the cluster in the future to have more than one data center then choosing NetworkTopologyStategy from the beginning avoids data migration.

Data partitioner determines coordinator node for each key. The coordinator node is the fist replica for a key which is also called primary replica. If replication factor is N, a key's coordinator node replicates it to other N-1 replicas.

In SimpleStrategy, successor nodes or the nodes on the ring immediate following in clockwise direction to the coordinator node are selected as  replicas.



In NetworkTopologyStategy, nodes from distinct available racks in each data center are chosen as replicas. User need to specify per data center replication factor in multiple data center environment. In each data center, successor nodes to the coordinator node which are from distinct racks are chosen.



Cassandra nodetool getendpoints command can be used to retrieve the replicas of a key.

Cassandra Data partitioning with random partitioner

Cassandra is a distributed hash table (DHT) based storage system where data is dynamically partitioned to a set of storage nodes in a cluster using a data partitioning scheme. It provides two data partition schemes called RandomPartitioner and ByteOrderedPartitioner. This blog article provides an overview of RandomPartitioner which is the commonly used partitioning scheme.

In RandomPartitioner strategy, keys are uniformly distributed across the nodes using MD5 hash algorithm. MD5 hash provides a natural way of load balancing keys to nodes. Each data item is mapped to a token by calculating MD5 hash of its key. A token is a big integer of 16 bytes length and it is represented using 2's complement notation. The hash function output range is from 0 to 2^127 – 1.  It uses consistency hashing scheme to minimize the data movement when nodes are added or removed. In consistency hashing scheme, the token space is treated as a circular space (ring) where the largest hash value (token) is followed by the smallest hash value.

Each storage node is randomly assigned with a token which also represents its position on the ring. A data item is stored on a node based on its position on the ring. The first node with the token that is next in the clock wise order from item's token. This node is also called its primary replica or coordinator for this item.

Each node stores the keys whose tokens are between its own token (including) and the predecessor node token (excluding). This region on the ring is also called token range. When a node leaves or joins, only the immediate neighbor nodes are affected.

In the below example, a hypothetical Cassandra cluster is shown with token space from token 0 to token 1023. It consists of four nodes N1,N2,N3 and N4.

An example data item which is mapped to token 125 is stored on primary node N2 and replicated on N3 and N4 using SimpleStrategy replication with replication factor (RF) 3.