In an eventual consistent system like Cassandra, information about deleted keys should be stored to avoid reading the deleted data. When a row or column is deleted, this information is stored as tombstone. Tombstones are stored until gc grace period associated with column family not reached. Only major compaction removes tombstones that are older than gc grace period. Tombstones are spread to all replicas when repair is performed. It is important to run repair regularly to eliminate resurrecting deleted data.
A Row tombstone is created when a row is deleted, a Column tombstone is created when a specified column is deleted from a row and a Range tombstone is created when user deletes a super column. Tombstone includes application's deletion timestamp (MarkedForDeleteAt) and local deletion timestamp (LocalDeletionTime). Range tombstone also includes start and end column names.
MarkedForDeleteAt
It is a timestamp in milliseconds specified by Application in delete request. For a live column or row it is stored as max long value 0x8000000000000000.
LocalDeletionTime
It is a timestamp in seconds set by Cassandra server when it receives a request to delete a column or super column. The value is set as current time in seconds. For a live column or row it is stored as max integer value 0x7FFFFFFF..
Tombstone thresholds
tombstone_warn_threshold (Default value 1000) is used to log a warning message when scanned tombstone count is greater than this limit but the query execution will continue and tombstone_failure_threshold (Default value 100000) is used to abort slice query if scanned tombstone count exceeds this limit. An error message (TombstoneOverwhelmingException) is also logged. These settings can be changed in cassandra.yaml
Tombstones can impact performance of slice queries especially for wider rows. When Cassandra executes a column slice query it needs to read columns from all the SSTables that include the given row and filter out tombstones. And all these tombstones need to be kept in memory till the row fragments from all SSTables are merged which increase heap space usage.
Estimated droppable tombstones
sstablemetadata tool can be used to estimate the amount of tombsones in a given SSTable. The value is the ratio of tombsones to estimated column count in a SSTable. Tombsones are cleared during compaction based on the gc_grace period. The tool by default uses the current time as the gc time to decide whether a tombstone can be dropped or not. To find out the tombstones are created before a given time, use command line argument --gc_grace_seconds to pass GC time in seconds
Usage: sstablemetadata [--gc_grace_seconds n] <sstable filenames>
$ ./tools/bin/sstablemetadata data/data/ycsb/usertable-8dd787404e2c11e8b1a22f4e9082bb4e/mc-1-big-Data.db | grep "Estimated droppable tombstones"
Estimated droppable tombstones: 0.0
A Row tombstone is created when a row is deleted, a Column tombstone is created when a specified column is deleted from a row and a Range tombstone is created when user deletes a super column. Tombstone includes application's deletion timestamp (MarkedForDeleteAt) and local deletion timestamp (LocalDeletionTime). Range tombstone also includes start and end column names.
MarkedForDeleteAt
It is a timestamp in milliseconds specified by Application in delete request. For a live column or row it is stored as max long value 0x8000000000000000.
LocalDeletionTime
It is a timestamp in seconds set by Cassandra server when it receives a request to delete a column or super column. The value is set as current time in seconds. For a live column or row it is stored as max integer value 0x7FFFFFFF..
Tombstone thresholds
tombstone_warn_threshold (Default value 1000) is used to log a warning message when scanned tombstone count is greater than this limit but the query execution will continue and tombstone_failure_threshold (Default value 100000) is used to abort slice query if scanned tombstone count exceeds this limit. An error message (TombstoneOverwhelmingException) is also logged. These settings can be changed in cassandra.yaml
Tombstones can impact performance of slice queries especially for wider rows. When Cassandra executes a column slice query it needs to read columns from all the SSTables that include the given row and filter out tombstones. And all these tombstones need to be kept in memory till the row fragments from all SSTables are merged which increase heap space usage.
Estimated droppable tombstones
sstablemetadata tool can be used to estimate the amount of tombsones in a given SSTable. The value is the ratio of tombsones to estimated column count in a SSTable. Tombsones are cleared during compaction based on the gc_grace period. The tool by default uses the current time as the gc time to decide whether a tombstone can be dropped or not. To find out the tombstones are created before a given time, use command line argument --gc_grace_seconds to pass GC time in seconds
Usage: sstablemetadata [--gc_grace_seconds n] <sstable filenames>
$ ./tools/bin/sstablemetadata data/data/ycsb/usertable-8dd787404e2c11e8b1a22f4e9082bb4e/mc-1-big-Data.db | grep "Estimated droppable tombstones"
Estimated droppable tombstones: 0.0