Designing Hyper Scalable Services

Scaling services is a hard problem but operating them at the scale is even more hard problem. You need to consider the operational aspects of running it at scale during design time. Service downtime can cause huge losses to business and leads to poor customer experience.

Operation Challenges

Workloads

Workloads can be unpredictable. Especially in case of multi-tenant systems, or multi application systems each tenant or application use case and the corresponding workload can be different. Tuning the service to one type of workload can cause severe problems to other workloads. 

Resources

Modern services are designed as Micro services. One micro service consuming high resources can cause problems to other micro services. Careful tuning and monitoring is required to operate overall service efficiently. Sometimes one machine can have high load compared to other machines due to uneven distribution of load or due to configuration issues in DNS.

Reliability

Commodity hardware has been used increasingly to deploy the services. Failures of different components is common in these environments. Detecting, self healing becomes critical to reduce the downtime. Failures can come from unexpected events also which are less frequent such as Datacenter power failure and Networking cable issues.

Security

Security should be the highest priority for any kind of service. From the beginning, you need to design a strong security posture. This includes how you authenticate, authorize the users, how do you protect the data privacy, and also how you monitor the vulnerabilities. Threats are becoming common and immediate remediation and patching needed when vulnerabilities identified. Different countries have different privacy regulations and these needs to be considered at the design time. You may need auditing capabilities also if you are providing compliance.

Performance

Faster response times are crucial for business to reduce the costs, gain competitive advantage and better user experience. It is crucial to know the bottlenecks ahead by doing some kind of back on the envelop calculations. Simulations can help understand the system at large scale and identify the limits theoratically.

Capacity

Dynamically expanding and shrinking like public clouds is critical for modern services to reduce the operation costs of service.

Management

With the popularity of public clouds customers or devops are expecting easy to manage services and operate them at scale

Upgrades

It is difficult to upgrade the service at scale and the release cycles will tend to take more time as the service is in operation longer. You need to spend more time on testing and ensure the quality for each release so that you can have less impact when a upgrade is done. The upgrading testing should be run with some real world service data and different deployment configurations

Programmability

It is hard to release a patch for every small bug or improvement. If service can provide a well defined management or administrative API then it can be used to provide a workaround for the issues till the next release

Backward Compatibility

Between releases, API interface or data formats can change. Some of the functionality available in previous releases can be deprecated but users are depending on it. It is extremely important to consider the backward compatibility during design

Hardware Replacement

Commodity hardware will have certain time of life. Replacing the hardware impacts service availability. If the service is deployed in a redundancy configuration the impact can be minimized

Logging

Application logs are very important to investigate the production issues. There should be right amount of logging in each transaction or API processing to debug the issues

Statistics

You can not log each step of the processing. Even your logs are very detailed the logs can be rolled over due to the predefined thresholds. It is better to log the statistics in to a separate log file and also to expose them through an API or command. The Statistics can help in investigating the complex problems such as performance issues or resource issues.

QOS

When multiple applications,or users using the same platform, QOS will prevent noisy neighbor problems for fair sharing,

Tools

Tools provide various advantages to a service. They can provide a way to access some of the internal state and also provide a way to operate on the collected information. Some times they can provide a work around to provide a immediate relief on some operational issues.

With all the issues listed above, designing resilient services need to consider and plan for operational aspects during design time


RocksDB Java Native Memory Growth

RocksDB is developed in C/C++ programming language and has a JAVA binding which provides a set of JAVA classes to access RocksDB. In a Java application, the memory is managed by JVM. Object allocations are done on JVM heap and GC manages the heap. But RocksDB allocates C/C++ object using the OS malloc library. When RocksDB is used in a Java application we need to extra careful about the allocations made by RocksDB. 

For a Java process these are native memory allocations which are not on the heap. If Java process doesn't delete those objects or close the handles properly after use then they will remain and the JAVA process resident memory size (RSS) starts growing.  Even application is well written and properly handling the allocations, some time depending on workload, it may need to access several SSTables (RocksDB sst files) to serve client requests. This case cause uncontrollable growth if RocksDB not tuned properly  and can cause Out Of Memory and eventually lead to process crash by Kernel OOM killer. There is a lengthy discussion on the thread specified in Reference section which provide lot of details and how to monitor it and resolve it if you are experiencing this problem.



References

https://github.com/facebook/rocksdb/issues/4112


RocksDB

RocksDB is one of the popular open source embedded key value database used by several other popular systems. It is built based on LevelDB code which is developed by Google. More and more opensource as wells as commercial systems started using RocksDB due to its high performance, flexibility and tuning features. Also there are may other Databases improved or customized RocksDB based on their needs.

I am going to write posts which  can help in reviewing RocksDB code. Some of the posts which I already published are here below.

RocksDB Environment

How To Debbug RocksDB Source Code 

RocksDB Put Operation

RocksDB Get Operation

RocksDB Java Native Memory Growth 

Please let me know if you need to know any particular area of RocksDB. Also let me know if any updates are needed to above posts.

RocksDB Put Operation

 

RocksDB Put operation creates a new record in the specified DB. RocksDB code is very flexible and has several levels of abstractions. If you are planning to understand the code path you can start reading from the functions specified in this post.

DB Put Operation
RocksDB stores the recently inserted data in the MemTable and periodically flushes the data to disk based SSTables (sst files). But MemTable data can be lost when power failure occurs or the application crashes. It needs to make the changes durable. It uses WAL file to provide the durability. Every write is stored in MemTable and also committed to WAL. The entry function for the Put operation is db/db_impl/db_impl_write.cc:Put() .

Write Batch

RocksDB creates a write batch even for single write. The write batch will have a set of key, value records, the count of records in it, sequence number, and a checksum. This code is in db/write_batch.cc
 
WAL Commit
 
This code is in db/db_impl/db_impl_write.cc:WriteToWAL(). It serializes the write batch in to disk storage format and and appends to the current WAL file. db/log_writer.cc:og::Writer::EmitPhysicalRecord() and file/writable_file_writer.cc:WritableFileWriter::Append()

MemTable Write

The MemTable Insert code is in WriteBatchInternal::InsertInto() function. There are different configuration option to perform MemTable write. The code is in DBImpl::WriteImpl()

RocksDB Get Operation

RocksDB Get operation retrieves the record value for a given key. RocksDB code is very flexible and has several levels of abstractions. If you are planning to understand the code path you can start reading from the functions specified in this post.

DB Get Operation
RocksDB stores the recently inserted data in the MemTable and periodically flushes the data to disk based SSTables (sst files). It needs to search both MemTable and SSTables for the given key. 
The entry function for the Get operation is db_impl_readonly.cc:DBImplReadOnly::Get() .

MemTable lookup
 
First it searches the given key in MemTable and returns the value if it is found in it. MemTable managed using SkipList and a search is done on it. The main function for this is db/memtable.cc:MemTable::GetFromTable()

Table Lookup

If the record not found in the MemTable then it will search the key in the table cache this code is in db/table_cache.cc:TableCache::Get()
 
If it is not found in table cache then it searches the files at each level starting from level 0.  If you use BlockBasedTable the default option the entry point is table/block_based/block_based_table_reader.cc:BlockBasedTable::Get(). It will search the block index to find the given key. It uses BlockIter::Seek() function to search. The keys are compared using the comparator function and the corresponding value is returned.


RocksDB Environment

RocksDB and LevelDB has an abstraction called Env (environment) which provides an interface to access Operating system specific functions. This abstraction is there in the traditional BerkeleyDB also. It nicely separates Database code from the OS specific functionality by encapsulating it.

The Env object provides an interface to access System time, File System API, Thread API, Synchronization primitives and Shared Library management API. The RocksDB distributions comes with few supported environments such as Posix, HDFS, Windows, etc. You can also customize some part of the environment by providing new implementation. For example you can customize FileSystem calls used by Env object with memory based implementation or you can mirror the writes to another environment. 

To implement a new environment for ex: Cloud environment, you need to override the  virtual functions in Env class and provide an implementation with Cloud APIs.

Posix Environment

It is the default Environment when you create a new Database. It uses various POSIX calls for proving OS level services. Below table describes the various APIs used

ServiceSystem Calls
 Time      
 clock_gettime(), gettimeofday()
 File fopen(), fclose(), open(), close(), fcntl(), stat(), fsync(), statvfs(), rename(), link(), unlink(),  access(), pread(), pwrite()
 Directory             
 mkdir(), rmdir(), getcwd(), opendir() and readdir()
 Threads pthread_create(), pthread_self(), pthread_mutex_init(), pthread_mutex_lock(), pthread_mutex_unlock(), pthread_join()
 Libraries  
 dlopen(), dlclose(), dlsym() and dlerror()

 

If you want to understand the RocksDB source code you can debug the code using the information provided in another blog post How To Debug RocksDB Source Code.

How To Debug RocksDB Source Code

 RocksDB is one of the popular open source embedded key value database used by several other popular systems. It is a derivative of LevelDB which is developed by Google. More and more opensource as wells as commercial systems started using RocksDB due to its high performance, flexibility and tuning features.

I wanted to read the source code but the code base grown over the years. I tried to search some of the function implementations and it is not very easy to find due to lot if abstraction. RocksDB site provides very good information on the architecture, internal mechanisms, configuration options and API etc. It showed a sample program in Getting Started. I wanted to use this program to debug the RocksDB code.

First we need to build the RocksDB with debug information enabled in the library. For this you can download the source code from github.


git clone https://github.com/facebook/rocksdb.git

In order to compile the code you need to install/download dependent libraries. The best place to get this information is in INSTALL.md. In this file there are instruction for each supported OS type. Please follow and install all the required libraries. The latest code uses gcc version of 7.0 or greater. I am using CentOS 7.8. It support GCC version of 4.8.5. I need to get the gcc source code and compile it and install it to get the GCC version 7.3.0. The GCC compilation tool a while.Once the GCC installed successfully I compiled the RocksDB code. 

 Here is my sample program based on the Getting Started guide.

 

$ cat Test.cc
#include <iostream>
#include "assert.h"
#include "rocksdb/db.h"

int main(int argc, char** argv) {
    rocksdb::DB* db;
    rocksdb::Options options;
    options.create_if_missing = true;
    rocksdb::Status s =
      rocksdb::DB::Open(options, "/tmp/testdb", &db);
    assert(s.ok());
    std::string value = "Hello World";
    std::string key = "key";
    std::string value1;

    if (s.ok()) s = db->Put(rocksdb::WriteOptions(), key, value);
    if (s.ok()) s = db->Get(rocksdb::ReadOptions(), key, &value1);
    if (s.ok()) std::cout << value1 << std::endl;;

    delete db;
    return 0;
}

To compile you need to include all the dependent libraries. 

g++ -v -I./include -I/usr/include -g -o Test Test.cc \
-std=gnu++17 -lzstd -lpthread -lsnappy -lbz2 -llz4 \
-lz -ldl ./librocksdb_debug.a

I used -v flag to see more information when compile is reports any errors, -g flag to include debug information for the test program. Once you compile the sample code successfully then you can start debugging the code using gdb.

The dynamic libraries linked to program are 

$ ldd Test
    linux-vdso.so.1 =>  (0x00007ffd52ffb000)
    libzstd.so.1 => /usr/local/lib/libzstd.so.1 (0x00007f26d5a7a000)
    libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f26d585e000)
    libsnappy.so.1 => /usr/lib64/libsnappy.so.1 (0x00007f26d5658000)
    libbz2.so.1 => /usr/lib64/libbz2.so.1 (0x00007f26d5448000)
    liblz4.so.1 => /usr/lib64/liblz4.so.1 (0x00007f26d5239000)
    libz.so.1 => /usr/lib64/libz.so.1 (0x00007f26d5023000)
    libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f26d4e1f000)
    libstdc++.so.6 => /usr/local/lib64/libstdc++.so.6 (0x00007f26d4a9d000)
    libm.so.6 => /usr/lib64/libm.so.6 (0x00007f26d479b000)
    libgcc_s.so.1 => /usr/local/lib64/libgcc_s.so.1 (0x00007f26d4584000)
    libc.so.6 => /usr/lib64/libc.so.6 (0x00007f26d41b6000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f26d5cef000)

Please let me know if you encounter any issue.  I spent lot of time to figure out the things to debug this simple program. Hope this helps. Enjoy debugging the code. I also created another blog article on RocksDB environment which abstracts the operating system specifics from the DB code.

 

Cassandra Heap Issues

 Cassandra Out Of Memory (OOM) exceptions can cause performance impact and reduce availability.  The OOM exception is reported when Cassandra is unable to allocate heap space. Some times the application workload alone can cause Cassandra to quickly use all the available heap space and throw OOM exceptions. Common cause for OOM is application workload combined with background jobs such as repair or Compactions can increase heap usage and eventually lead to OOM exceptions. Some of the components and background jobs which uses high heap space are

Memtables

Cassandra by default allocates 1/4th of max heap to Memtables. It can be configured cassandra.yaml to reduce the heap usage by Memtable allocations.   

# memtable_heap_space_in_mb: 2048
# memtable_offheap_space_in_mb: 2048

When Cassandra is under severe heap pressure it will start flushing aggressively which will result in smaller SSTables and the creates high background compaction activity.

Reads

CQL queries that read many columns or rows per query can consume lot of heap space. Especially when the rows are wider (contains huge number of columns per row) or when rows contains tombstones. When rows are replicated to multiple nodes, the coordinator node needs to hold the data longer to meet the consistency constraints. When rows are spread across multiple SSTables, the results need to be merges from multiple SSTables and this increases heap space usage. Cassandra need to filter out the tombstones to read the live columns from a row. The number of tombstones scanned per query also affects the heap usage. Cassandra nodetool cfstats command provides some metrics on these factors. Following sample cfstat output collected from a live cluster shows the application is using wider rows with huge number of columns and there exist huge number of tombstones per row

                Average live cells per slice (last five minutes): 8.566489361702128
                Maximum live cells per slice (last five minutes): 1109
                Average tombstones per slice (last five minutes): 1381.0478723404256
                Maximum tombstones per slice (last five minutes): 24601

Cassandra nodetool cfhistograms command also provides some metrics on the SSTables. It displays statistics on row size, column count and average number of SSTables scanned per read.

Percentile  SSTables     Write Latency    Read Latency    Partition Size        Cell Count
                                      (micros)             (micros)             (bytes)
50%             1.00           51.01                 182.79               1331                      4
75%             1.00           61.21                 219.34               2299                      7
95%             3.00           73.46                 2816.16             42510                    179
98%             3.00           88.15                 4055.27             182785                  770
99%             3.00           105.78               4866.32             454826                  1916
Min              0.00           6.87                   9.89                  104                         0
Max             3.00           379.02               74975.55          44285675122         129557750

Concurrent Read threads

Cassandra uses by default 32 concurrent threads in read stage. These threads serve read queries. If there are more read requests are received then they go in to pending queue. Cassandra nodetool tpstats command ca be used to check the active and pending read stage tasks. If each read query consuming high amount of heap space then reducing this setting can help heap usage.

Repair 

Cassandra repair creates Merkle trees to detect inconsistencies row data among multiple replicas. By default Cassandra creates Merkle tree till depth 15. Each Merkle tree contains 32768 internal nodes and consumes lot of heap space. Repair running concurrently with application load can cause high heap usage.

Compaction

Background compactions can also cause high heap usage. By default compaction_throughput_mb_per_sec is set to 16 MB.

Other factors

Other factors that can cause high heap usage are Cassandra Hints,  Row Cache, Tombstones and  space used for Index entries. 

Workarounds

If enough system memory available, increasing the max heap size can help. Some time min heap and max heap set to same value. reducing the min heap size can help reduce the heap usage. If application is allocating bigger chunks of memory, increasing G1GC region size can help in reducing heap fragmentation and improve heap utilization. Some times reducing concurrent read threads also help in reducing heap usage which can impact read performance.
 

Cassandra compaction

Cassandra compaction is a background process which merges multiple SSTables of a Column Family to one or more SSTables to improve the read performance and to reclaim the space occupied deleted data. Cassandra supports different Compaction strategies to accommodate different workloads. In this blog post, the size tiered strategy is described.

In size tiered compaction strategy, SSTables are grouped in to different buckets based on their size. SSTables with similar sizes are grouped in to same bucket. For each bucket an average size is calculated using the sizes of SSTables in it. There are two options for a bucket, bucketLow and bucketHigh which define the bucket width. If a SSTable size falls between (bucketLow * average bucket size) and  (bucketHigh * average bucket size) then it will be includes in this bucket. The average size of the bucket is updated by the newly added SSTable size.  The default value for bucketLow is 0.5 and for bucketHigh is 1.5.

Next, the buckets with SSTable count greater than or equal to the minimum compaction threshold are selected. The thresholds are defined as part of Compaction settings in Column Family schema. 

compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} 

Finally, Cassandra selects  the bucket that contains SSTables with highest read activity. For each bucket a read hotness score is assigned and the bucket with highest read hotness score is selected for compaction. Read Hotness score for a SSTable is the number of reads per second per key. A two hour read rate is used in the calculation. For a bucket, the read hotness score is the combined read hotness score of all SSTable in it. 

If two buckets have same read hotness score then the bucket with smaller SSTables is selected. If the selected bucket has SSTable count more than maximum compaction threshold then the coldest SSTables are dropped from the bucket to meet the max threshold. The SSTables in a bucket are sorted in descending order on their read hotness score and the coldest SSTables will be at the end of sorted order.

Before performing compaction, the available disk space is checked against the estimated compaction output size. If the available disk space is not enough then largest SSTables are dropped from the input list of SSTables to reduce the storage space needed. During compaction, the input SSTables are scanned and the same rows from multiple SSTables are merged. Compaction can cause heavy disk I/O. To limit the disk I/O, there is a configuration setting to throttle the compaction activity compaction_throughput_mb_per_sec. The setting concurrent_compactors limits the number of simultaneous compactions at a time.

How the actual compaction is performed

 A scanner object is created for each input SSTable which can sequentially iterate the partitions in it. The input SSTable scanners added to a min heap based Priority queue to merge the partitions. The partitions are consumed from the priority queue. The scanner at the root of the priority queue contains the least element in the sorted order of partitions. If multiple SSTables contain the same partition which matches with the root node of priority queue, matching partition from all those SSTables are consumes and the SSTables are advanced to point to the next partitions. Once a scanner is consumed from Priority queue, it is removed from the top and added back at the end and the data structure will re-heapify to bubble up the smallest element to the root. During compaction, deleted rows with delete timestamp older than the GC grace period (gc_before) are also deleted. 

Compaction Statistics

The running compaction progress can be monitored using nodetool compactionstats command. It will display the number of pending compactions, the keyspace name and the column family that is being compacted and the remaining time of compaction.

At the end of Compaction, Cassandra stores various statistics collected during compaction in to compaction_history table. The statistics can be viewed using nodetool compactionhistory command.

$ ./bin/nodetool compactionhistory

Compaction History:
id            keyspace_name   columnfamily_name compacted_at            bytes_in bytes_out rows_merged    
f0382bb0-c692-11eb-badd-7377db2be2d0 system  local 2021-06-05T23:46:36.266 10384 5117 {5:1}          
Stop Active Compaction

To stop an active compation use 

$ nodetool stop COMPACTION

Disable Auto Compactions

Cassandra nodetool provide a command to disable auto compactions. 

$ /opt/cassandra/bin/nodetool disableautocompaction

To check the compaction status

$ /opt/cassandra/bin/nodetool compactionstats

pending tasks: 0

Tombstone Compaction

Some times we want to run a single SSTable compaction to drop the tombstones which are older than gc_grace period. It can be done using the JMX interface. Cassandra JMX interface can be accessed using jmxterm jar. The allowed tombstone threshold can be set on Column family compaction properties using alter table command. This setting is allowed only for SizeTieredCompactionStrategy type.

$ java -jar jmxterm-1.0-alpha-4-uber.jar
$>open <Cassandra PID>
$>run -b org.apache.cassandra.db:type=CompactionManager forceUserDefinedCompaction <SSTable file path> 

We can chose the SSTable to perform tombstone compaction by checking the sstable metadata 

$>/opt/cassandra/tools/bin/sstablemetadata <SSTable path> | grep "Estimated droppable tombstones"
Estimated droppable tombstones: 0.1  

Major Compaction

By default auto compaction selects the input SSTables for compaction based on the threshold defined as part of Column family definition. To run compaction on all the SSTables for a given column family, nodetool compact command can be be used to force a major compaction. Major compaction can results in to two compaction runs one of repaired SSTables and other one is for not repaired SSTables. If tombstones and live columns spread across these two different sets of SSTables, major compaction may not be able to drop tombstones. The tool /opt/cassandra/tools/bin/sstablerepairedset can be used to reset the repairAt to 0 as to include all SSTables in single set.

Typical Problems

Compaction is causing too much high CPU usage. This could be due to many pending compactions or the throughput is low. Sometimes increasing the throughput limit (compaction_throughput_mb_per_sec) from default 16 MB to higher value can benefit. In other cases reducing the concurrent_compactors to 1 can help easing the load.



Cassandra Commit Log

Cassandra Commit Log is an append only log that is used to track the mutations made to Column Families and provide durability for those mutations. Every mutation is appended to a Commit Log first before applied to an in memory write-back cache called Memtable. If a Memtable is not flushed to disk to create a SSTable then the mutations stored in it can be lost if Cassandra suddenly shutdowns or crashes. At startup, Cassandra replays the Commit Log to recover the data that was not stored to disk. 

A mutation can be insert, update or delete operation on a partition. A mutation can involve single or multiple columns. Each mutation is associated with a partition key and timestamp. If multiple mutations occur on same partition key, Cassandra will maintain the latest one. Memtable also combines different mutations occurring on the same partition and group them as a single Row mutation.

Commit Log provides a faster way to persist the changes applied to Memtables before they flushed to disk. SSTables are stored in separate files for each Column Family. Cassandra can receive mutations to multiple Column families at the same time. If these mutations need to be persisted then Cassandra needs to issue multiple disk writes one for each mutation applied to a Column Family. This will cause more disk seeks and random I/O to disk which will impact performance. Commit Log provides a way to solve this performance issue. Commit log provides a way to convert this random I/O to sequential I/O by writing mutations of different Column Families in to the same file. Also it provides batching mutation so in a single disk write, multiple mutations are synced to disk to reduce disk I/O and make the changes durable. It also helps to buffer mutations of each Column Family in to a separate Memtable and flush them to disk using sequential I/O.

Commit Log can not grow forever. It needs to be truncated often to reduce its size so that Cassandra spends less time in replaying the unsaved changes next time it starts from a sudden shutdown. Cassandra provides few configuration settings to tune the Commit Log size. Also mutations in Commit Log are cleaned up when a Memtable is flushed to SSTable. For each Memtable flush, Cassandra maintains the latest offset of the Commit Log called Replay position. When Memtable flush completes successfully, all the mutations for that particular Column Family which are stored before the Replay position are discarded from Commit Log. 

Commit Log Configuration

The setting commitlog_segment_size_in_mb controls the max size of an individual Commit log segment file. A new Commit Log segment is created when the current active segment size reaches this threshold. It defaults to 32 MB.

Commit Log disk sync frequency can be controlled using the configuration setting commitlog_sync.  If it is set to batch mode, then Cassandra will not acknowledge the writes until Commit Log is synced to disk. Instead of executing fsync for every write, it provides another setting commitlog_sync_batch_window_in_ms which indicates number of milliseconds between successive fsyncs. It is set to 2 milliseconds by default. Other alternative option is setting commitlog_sync to periodic. In this case Cassandra will immediately acknowledge writes and the Commit Log will be synced periodically for every commitlog_sync_period_in_ms milliseconds. The default value for commitlog_sync is periodic and the sync period is set to 10 milliseconds. Commit Log protects the data stored locally. The the interval between fsyncs will have direct impact on the amount of unsaved data when Cassandra is suddenly down. Having replication factor greater than one will protect the data if one of the replica goes down.

The total disk space used for Commit Log files is controlled using commitlog_total_space_in_mb setting. When the disk space reached this threshold, Cassandra will flush all the Column families that have mutations in the oldest Commit Log segment and removes the oldest segment. For this purpose a list of dirty Column families are managed in each Commit Log segment.  The threshold defaults to maximum of 8 GB or 1/4th size of total space of Commit Log volume.

There can be multiple segments storing the Commit Log at any time. All Commit Log Segments are managed in a queue and the current active segment is at the tail of the queue. A Commit Log segment can be recycled when all the mutations in it are flushed successfully.

Commit Log Segment

It represents a single Commit Log file on the disk and stores mutations of different Column Families. The file name format is CommitLog-<Version>-<Msec Timestamp>.log (For ex: CommitLog-6-1621835590940.log).  It maintains a map to manage the dirty Column Families and their last mutation positions in the file. Commit Log segment is memory mapped and the operations are applied to memory mapped buffer and synced to disk periodically.


 

The contents of Commit Log segment are memory mapped to a buffer. Mutations are applied to this buffer and on sync they are persisted to disk. At the beginning of Commit Log a file header is written which includes version, the Commit Log ID and parameters associated with Compressed or Encrypted Commit Logs. Also CRC checksum is created with header fields and is written at the end of the header.

Mutations are allocated on active segments. If there is not enough space to write it in the current active segment then a new segment is created. Commit Log Segments are managed as a queue and the active segment is at the tail of the queue

Commit Log remembers the mutations which are synced to disk and which are not synced using two pointers. Last synced Offset represents the position in the Commit Log where all the mutations which were written before it has been synced to disk. Last Marker Offset points to the position of last mutation which has not yet synced to the disk. All the mutations between these two pointers are not synced to the disk.

 A Commit Log divided in to sections separated with sync markers. These sync markers are chained where the previous marker points to next sync marker. Every time a sync is performed on a Commit Log a new marker is created.  The sync marker consists of two integers. The first integer stores a Commit Log file pointer where the next sync marker going to be written. It indicates the section length. Second integer stores the CRC of the Commit Log ID and the the file position where the sync marker is written.