Cassandra how to repair a row if regular repair fails

Recently we had a strange issue with Cassandra repair. A table is replicated across three Datacenters and some of rows were missing on all the nodes in one of the newly added Datacenter. We have hundreds of production Cassandra clusters of various sizes, having few nodes to 10s of nodes in each cluster. Most of the cluster have multiple Datacenters and use replication across them. We didn't had this kind of issue so far. This article describes various way we tried to fix the problem.  These steps will be useful to recover the data in Cassandra 2.2.*.

First thing we tried listing rows by running cqlsh on each node and executed select query to list the rows in this particular table. All the nodes except the nodes in recently added DC, lets call it DC3 where all the rows were missing in the result. This is an obvious problem when the replication settings for the Keyspace where this table resides do not have the DC3 in it.

Keyspace: Test: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [DC1:3, DC2:3]

We manually added the DC3 to the Keyspace replication strategy and ran Cassandra nodetool repair on DC3 nodes for this Keyspace and Table. After running repair, we verified the table rows on DC3 nodes but still the rows were missing.

Keyspace: Test: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [DC1:3, DC2:3, DC3:3]

Second thing we tried was running Cassandra nodetool repair with different DC options and running it on different Datacenter nodes but still the rows were missing. We tried full repair and entire keyspace repair but still it was unable to repair the rows.

Third thing we tried is removing all SSTables for this problematic Table and ran repair. The repair restored the SSTables but the rows were still missing

Fourth thing we tried is resetting the SSTable repairedAt status to unrepaired for all SSTables using sstablerepairedset tool and ran the repair again. Still the rows were missing.

Finally we used Cassandra read repair to fix the missing rows problems. We ran cqlsh on the DC3 nodes, set the consistency level to quorum and read each missing row one by one by executing select and this fixed the issue and was able to repair the data.

# cat commands 
CONSISTENCY
CONSISTENCY QUORUM
CONSISTENCY
select * from "Test"."SampleTable" WHERE key = 60b8b2e1-5866-4a4a-b6f8-3e521c44f43c;
 

Hope this helps if some one stuck with similar issue.


No comments:

Post a Comment