Cassandra compaction

Cassandra compaction is a background process which merges multiple SSTables of a Column Family to one or more SSTables to improve the read performance and to reclaim the space occupied deleted data. Cassandra supports different Compaction strategies to accommodate different workloads. In this blog post, the size tiered strategy is described.

In size tiered compaction strategy, SSTables are grouped in to different buckets based on their size. SSTables with similar sizes are grouped in to same bucket. For each bucket an average size is calculated using the sizes of SSTables in it. There are two options for a bucket, bucketLow and bucketHigh which define the bucket width. If a SSTable size falls between (bucketLow * average bucket size) and  (bucketHigh * average bucket size) then it will be includes in this bucket. The average size of the bucket is updated by the newly added SSTable size.  The default value for bucketLow is 0.5 and for bucketHigh is 1.5.

Next, the buckets with SSTable count greater than or equal to the minimum compaction threshold are selected. The thresholds are defined as part of Compaction settings in Column Family schema. 

compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} 

Finally, Cassandra selects  the bucket that contains SSTables with highest read activity. For each bucket a read hotness score is assigned and the bucket with highest read hotness score is selected for compaction. Read Hotness score for a SSTable is the number of reads per second per key. A two hour read rate is used in the calculation. For a bucket, the read hotness score is the combined read hotness score of all SSTable in it. 

If two buckets have same read hotness score then the bucket with smaller SSTables is selected. If the selected bucket has SSTable count more than maximum compaction threshold then the coldest SSTables are dropped from the bucket to meet the max threshold. The SSTables in a bucket are sorted in descending order on their read hotness score and the coldest SSTables will be at the end of sorted order.

Before performing compaction, the available disk space is checked against the estimated compaction output size. If the available disk space is not enough then largest SSTables are dropped from the input list of SSTables to reduce the storage space needed. During compaction, the input SSTables are scanned and the same rows from multiple SSTables are merged. Compaction can cause heavy disk I/O. To limit the disk I/O, there is a configuration setting to throttle the compaction activity compaction_throughput_mb_per_sec. The setting concurrent_compactors limits the number of simultaneous compactions at a time.

How the actual compaction is performed

 A scanner object is created for each input SSTable which can sequentially iterate the partitions in it. The input SSTable scanners added to a min heap based Priority queue to merge the partitions. The partitions are consumed from the priority queue. The scanner at the root of the priority queue contains the least element in the sorted order of partitions. If multiple SSTables contain the same partition which matches with the root node of priority queue, matching partition from all those SSTables are consumes and the SSTables are advanced to point to the next partitions. Once a scanner is consumed from Priority queue, it is removed from the top and added back at the end and the data structure will re-heapify to bubble up the smallest element to the root. During compaction, deleted rows with delete timestamp older than the GC grace period (gc_before) are also deleted. 

Compaction Statistics

The running compaction progress can be monitored using nodetool compactionstats command. It will display the number of pending compactions, the keyspace name and the column family that is being compacted and the remaining time of compaction.

At the end of Compaction, Cassandra stores various statistics collected during compaction in to compaction_history table. The statistics can be viewed using nodetool compactionhistory command.

$ ./bin/nodetool compactionhistory

Compaction History:
id            keyspace_name   columnfamily_name compacted_at            bytes_in bytes_out rows_merged    
f0382bb0-c692-11eb-badd-7377db2be2d0 system  local 2021-06-05T23:46:36.266 10384 5117 {5:1}          
Stop Active Compaction

To stop an active compation use 

$ nodetool stop COMPACTION

Disable Auto Compactions

Cassandra nodetool provide a command to disable auto compactions. 

$ /opt/cassandra/bin/nodetool disableautocompaction

To check the compaction status

$ /opt/cassandra/bin/nodetool compactionstats

pending tasks: 0

Tombstone Compaction

Some times we want to run a single SSTable compaction to drop the tombstones which are older than gc_grace period. It can be done using the JMX interface. Cassandra JMX interface can be accessed using jmxterm jar. The allowed tombstone threshold can be set on Column family compaction properties using alter table command. This setting is allowed only for SizeTieredCompactionStrategy type.

$ java -jar jmxterm-1.0-alpha-4-uber.jar
$>open <Cassandra PID>
$>run -b org.apache.cassandra.db:type=CompactionManager forceUserDefinedCompaction <SSTable file path> 

We can chose the SSTable to perform tombstone compaction by checking the sstable metadata 

$>/opt/cassandra/tools/bin/sstablemetadata <SSTable path> | grep "Estimated droppable tombstones"
Estimated droppable tombstones: 0.1  

Major Compaction

By default auto compaction selects the input SSTables for compaction based on the threshold defined as part of Column family definition. To run compaction on all the SSTables for a given column family, nodetool compact command can be be used to force a major compaction. Major compaction can results in to two compaction runs one of repaired SSTables and other one is for not repaired SSTables. If tombstones and live columns spread across these two different sets of SSTables, major compaction may not be able to drop tombstones. The tool /opt/cassandra/tools/bin/sstablerepairedset can be used to reset the repairAt to 0 as to include all SSTables in single set.

Typical Problems

Compaction is causing too much high CPU usage. This could be due to many pending compactions or the throughput is low. Sometimes increasing the throughput limit (compaction_throughput_mb_per_sec) from default 16 MB to higher value can benefit. In other cases reducing the concurrent_compactors to 1 can help easing the load.