Key Concepts

How KV Cache is revolutionizing AI inference

In AI inference, KV Cache is a memory optimization technique that reduces inference costs and improves Time to First Token (TTFT) from seconds to milliseconds. It helps in prefill stage of inference and prevents a model from re computing everything the model has read or processed during a single conversation. Without it, the computation work grows exponentially.

LLMs based on Transformer architecture are auto-regressive. They generate output text one token at a time. To predict the next word, the model needs to process all previous words in the sequence. For example if a user ask "How is the weather in San Francisco?", the model processes all seven words in the input prompt and predicts "It". To predict the next token may be "is" it needs to reprocess "How is the weather in San Francisco? It". This generation process will become incredibly slow as the length of the prompt increases. The solution is to cache the already processed previous tokens.

During the Attention phase, the LLM calculates three vectors Query(Q), Key(K), and Value(V) for each token. The Key vector describes what the token contains while the Value vector describes the actual meaning of the token in its context. The KV Cache stores these processed vectors. Instead of recalculating the K, V values for the tokens which have been already processed, every time a new word is generated, the LLM simply looks up in the cache for the previously calculated key, values for the previous words. It only processes the latest token.

So far we discussed KV Cache for a single user conversation. It can also be beneficial for multiple user conversations. KV Cache reuse will work well if the context that prefix a prompt remain the same for multiple users. This may not be easy but for shared content types the context can be cached. For example multiple developers working on the same code repository or multiple students trying to learn from the same text book.

The popular method is matching the prefix of the contexts of two users. If User A and User B both start their prompt with the same 1000 token system instruction, the model computes that 1000 token KV cache once. User B gets that part for free (near-zero compute time). This is the best case. But for most of the cases there will be some changes in the context. Even if there is a single character mismatch then the KV Cache can not be reused. Some advanced frameworks can split the input prompt into multiple srgments and reuse the KV Cache for the matching segments even if the prefixes don't match.

We use LLMs to perform common tasks like coding and solving problems. For these kind of tasks we need to provide additional information to LLM, called context. For example you attach a source code file to understand a function or upload a pdf document to summarize. In a multi turn interaction with LLM, all the information exchanged in that specific chat session becomes context. The context helps LLM to provide more accurate and relevant response to a prompt. Companies now freeze the KV Cache for popular documents (like a coding library or a textbook) so they don't have to pay to read it every time a new user asks a question. Many API providers offer Context Caching discounts, charging up to 90% less for tokens that hit the cache ($0.20 vs $2.00 per 1M tokens).

As context windows have exploded (from 8k to 2M+ tokens), the KV Cache has become the primary bottleneck for AI hardware.The KV Cache lives close to the GPU in the fastest memory (HBM). However a GPU has limited capacity of HBM.

KV Context is specific to each Model type. You cannot take a KV cache from GPT-4 and give it to Claude 3. The Keys and Values are calculated based on that specific model's internal weights, hidden dimensions, and number of attention heads.

KV Cache has revolutionized the way LLMs process and generate text. By caching already processed previous tokens in a sequence, it significantly reduces inference costs and improves Time to First Token (TTFT). While it may not be suitable for all use cases, its benefits are evident in many applications, including multiple user conversations, coding, and solving problems. As context windows continue to grow, the KV Cache will remain a crucial component of AI inference.

Designing Hyper Scalable Services

Scaling services is a hard problem but operating them at the scale is even more hard problem. You need to consider the operational aspects of running it at scale during design time. Service downtime can cause huge losses to business and leads to poor customer experience.

Operation Challenges

Workloads

Workloads can be unpredictable. Especially in case of multi-tenant systems, or multi application systems each tenant or application use case and the corresponding workload can be different. Tuning the service to one type of workload can cause severe problems to other workloads.

Resources

Modern services are designed as Micro services. One micro service consuming high resources can cause problems to other micro services. Careful tuning and monitoring is required to operate overall service efficiently. Sometimes one machine can have high load compared to other machines due to uneven distribution of load or due to configuration issues in DNS.

Reliability

Commodity hardware has been used increasingly to deploy the services. Failures of different components is common in these environments. Detecting, self healing becomes critical to reduce the downtime. Failures can come from unexpected events also which are less frequent such as Datacenter power failure and Networking cable issues.

Security

Security should be the highest priority for any kind of service. From the beginning, you need to design a strong security posture. This includes how you authenticate, authorize the users, how do you protect the data privacy, and also how you monitor the vulnerabilities. Threats are becoming common and immediate remediation and patching needed when vulnerabilities identified. Different countries have different privacy regulations and these needs to be considered at the design time. You may need auditing capabilities also if you are providing compliance.

Performance

Faster response times are crucial for business to reduce the costs, gain competitive advantage and better user experience. It is crucial to know the bottlenecks ahead by doing some kind of back on the envelop calculations. Simulations can help understand the system at large scale and identify the limits theoratically.

Capacity

Dynamically expanding and shrinking like public clouds is critical for modern services to reduce the operation costs of service.

Management

With the popularity of public clouds customers or devops are expecting easy to manage services and operate them at scale

Upgrades

It is difficult to upgrade the service at scale and the release cycles will tend to take more time as the service is in operation longer. You need to spend more time on testing and ensure the quality for each release so that you can have less impact when a upgrade is done. The upgrading testing should be run with some real world service data and different deployment configurations

Programmability

It is hard to release a patch for every small bug or improvement. If service can provide a well defined management or administrative API then it can be used to provide a workaround for the issues till the next release

Backward Compatibility

Between releases, API interface or data formats can change. Some of the functionality available in previous releases can be deprecated but users are depending on it. It is extremely important to consider the backward compatibility during design

Hardware Replacement

Commodity hardware will have certain time of life. Replacing the hardware impacts service availability. If the service is deployed in a redundancy configuration the impact can be minimized

Logging

Application logs are very important to investigate the production issues. There should be right amount of logging in each transaction or API processing to debug the issues

Statistics

You can not log each step of the processing. Even your logs are very detailed the logs can be rolled over due to the predefined thresholds. It is better to log the statistics in to a separate log file and also to expose them through an API or command. The Statistics can help in investigating the complex problems such as performance issues or resource issues.

QOS

When multiple applications,or users using the same platform, QOS will prevent noisy neighbor problems for fair sharing,

Tools

Tools provide various advantages to a service. They can provide a way to access some of the internal state and also provide a way to operate on the collected information. Some times they can provide a work around to provide a immediate relief on some operational issues.

With all the issues listed above, designing resilient services need to consider and plan for operational aspects during design time

RocksDB Java Native Memory Growth

RocksDB is developed in C/C++ programming language and has a JAVA binding which provides a set of JAVA classes to access RocksDB. In a Java application, the memory is managed by JVM. Object allocations are done on JVM heap and GC manages the heap. But RocksDB allocates C/C++ object using the OS malloc library. When RocksDB is used in a Java application we need to extra careful about the allocations made by RocksDB.

For a Java process these are native memory allocations which are not on the heap. If Java process doesn't delete those objects or close the handles properly after use then they will remain and the JAVA process resident memory size (RSS) starts growing. Even application is well written and properly handling the allocations, some time depending on workload, it may need to access several SSTables (RocksDB sst files) to serve client requests. This case cause uncontrollable growth if RocksDB not tuned properly and can cause Out Of Memory and eventually lead to process crash by Kernel OOM killer. There is a lengthy discussion on the thread specified in Reference section which provide lot of details and how to monitor it and resolve it if you are experiencing this problem.

References

https://github.com/facebook/rocksdb/issues/4112

RocksDB

RocksDB is one of the popular open source embedded key value database used by several other popular systems. It is built based on LevelDB code which is developed by Google. More and more opensource as wells as commercial systems started using RocksDB due to its high performance, flexibility and tuning features. Also there are may other Databases improved or customized RocksDB based on their needs.

I am going to write posts which can help in reviewing RocksDB code. Some of the posts which I already published are here below.

RocksDB Environment

How To Debbug RocksDB Source Code

RocksDB Put Operation

RocksDB Get Operation

RocksDB Java Native Memory Growth

Please let me know if you need to know any particular area of RocksDB. Also let me know if any updates are needed to above posts.

RocksDB Put Operation

RocksDB Put operation creates a new record in the specified DB. RocksDB code is very flexible and has several levels of abstractions. If you are planning to understand the code path you can start reading from the functions specified in this post.

DB Put Operation

RocksDB stores the recently inserted data in the MemTable and periodically flushes the data to disk based SSTables (sst files). But MemTable data can be lost when power failure occurs or the application crashes. It needs to make the changes durable. It uses WAL file to provide the durability. Every write is stored in MemTable and also committed to WAL. The entry function for the Put operation is db/db_impl/db_impl_write.cc:Put() .

Write Batch

RocksDB creates a write batch even for single write. The write batch will have a set of key, value records, the count of records in it, sequence number, and a checksum. This code is in db/write_batch.cc

WAL Commit

This code is in db/db_impl/db_impl_write.cc:WriteToWAL(). It serializes the write batch in to disk storage format and and appends to the current WAL file. db/log_writer.cc:og::Writer::EmitPhysicalRecord() and file/writable_file_writer.cc:WritableFileWriter::Append()

MemTable Write

The MemTable Insert code is in WriteBatchInternal::InsertInto() function. There are different configuration option to perform MemTable write. The code is in DBImpl::WriteImpl()

RocksDB Get Operation

RocksDB Get operation retrieves the record value for a given key. RocksDB code is very flexible and has several levels of abstractions. If you are planning to understand the code path you can start reading from the functions specified in this post.

DB Get Operation

RocksDB stores the recently inserted data in the MemTable and periodically flushes the data to disk based SSTables (sst files). It needs to search both MemTable and SSTables for the given key.

The entry function for the Get operation is db_impl_readonly.cc:DBImplReadOnly::Get() .

MemTable lookup

First it searches the given key in MemTable and returns the value if it is found in it. MemTable managed using SkipList and a search is done on it. The main function for this is db/memtable.cc:MemTable::GetFromTable()

Table Lookup

If the record not found in the MemTable then it will search the key in the table cache this code is in db/table_cache.cc:TableCache::Get()

If it is not found in table cache then it searches the files at each level starting from level 0. If you use BlockBasedTable the default option the entry point is table/block_based/block_based_table_reader.cc:BlockBasedTable::Get(). It will search the block index to find the given key. It uses BlockIter::Seek() function to search. The keys are compared using the comparator function and the corresponding value is returned.

RocksDB Environment

RocksDB and LevelDB has an abstraction called Env (environment) which provides an interface to access Operating system specific functions. This abstraction is there in the traditional BerkeleyDB also. It nicely separates Database code from the OS specific functionality by encapsulating it.

The Env object provides an interface to access System time, File System API, Thread API, Synchronization primitives and Shared Library management API. The RocksDB distributions comes with few supported environments such as Posix, HDFS, Windows, etc. You can also customize some part of the environment by providing new implementation. For example you can customize FileSystem calls used by Env object with memory based implementation or you can mirror the writes to another environment.

To implement a new environment for ex: Cloud environment, you need to override the virtual functions in Env class and provide an implementation with Cloud APIs.

Posix Environment

It is the default Environment when you create a new Database. It uses various POSIX calls for proving OS level services. Below table describes the various APIs used

Service	System Calls
Time	clock_gettime(), gettimeofday()
File	fopen(), fclose(), open(), close(), fcntl(), stat(), fsync(), statvfs(), rename(), link(), unlink(), access(), pread(), pwrite()
Directory	mkdir(), rmdir(), getcwd(), opendir() and readdir()
Threads	pthread_create(), pthread_self(), pthread_mutex_init(), pthread_mutex_lock(), pthread_mutex_unlock(), pthread_join()
Libraries	dlopen(), dlclose(), dlsym() and dlerror()

If you want to understand the RocksDB source code you can debug the code using the information provided in another blog post How To Debug RocksDB Source Code.

How To Debug RocksDB Source Code

RocksDB is one of the popular open source embedded key value database used by several other popular systems. It is a derivative of LevelDB which is developed by Google. More and more opensource as wells as commercial systems started using RocksDB due to its high performance, flexibility and tuning features.

I wanted to read the source code but the code base grown over the years. I tried to search some of the function implementations and it is not very easy to find due to lot if abstraction. RocksDB site provides very good information on the architecture, internal mechanisms, configuration options and API etc. It showed a sample program in Getting Started. I wanted to use this program to debug the RocksDB code.

First we need to build the RocksDB with debug information enabled in the library. For this you can download the source code from github.


  git clone https://github.com/facebook/rocksdb.git

In order to compile the code you need to install/download dependent libraries. The best place to get this information is in INSTALL.md. In this file there are instruction for each supported OS type. Please follow and install all the required libraries. The latest code uses gcc version of 7.0 or greater. I am using CentOS 7.8. It support GCC version of 4.8.5. I need to get the gcc source code and compile it and install it to get the GCC version 7.3.0. The GCC compilation tool a while.Once the GCC installed successfully I compiled the RocksDB code.

Here is my sample program based on the Getting Started guide.

$ cat Test.cc
#include <iostream>
#include "assert.h"
#include "rocksdb/db.h"

int main(int argc, char** argv) {
    rocksdb::DB* db;
    rocksdb::Options options;
    options.create_if_missing = true;
    rocksdb::Status s =
      rocksdb::DB::Open(options, "/tmp/testdb", &db);
    assert(s.ok());
    std::string value = "Hello World";
    std::string key = "key";
    std::string value1;

    if (s.ok()) s = db->Put(rocksdb::WriteOptions(), key, value);
    if (s.ok()) s = db->Get(rocksdb::ReadOptions(), key, &value1);
    if (s.ok()) std::cout << value1 << std::endl;;

    delete db;
    return 0;
}

To compile you need to include all the dependent libraries.

g++ -v -I./include -I/usr/include -g -o Test Test.cc \
-std=gnu++17 -lzstd -lpthread -lsnappy -lbz2 -llz4 \
-lz -ldl ./librocksdb_debug.a

I used -v flag to see more information when compile is reports any errors, -g flag to include debug information for the test program. Once you compile the sample code successfully then you can start debugging the code using gdb.

The dynamic libraries linked to program are

$ ldd Test
    linux-vdso.so.1 =>  (0x00007ffd52ffb000)
    libzstd.so.1 => /usr/local/lib/libzstd.so.1 (0x00007f26d5a7a000)
    libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f26d585e000)
    libsnappy.so.1 => /usr/lib64/libsnappy.so.1 (0x00007f26d5658000)
    libbz2.so.1 => /usr/lib64/libbz2.so.1 (0x00007f26d5448000)
    liblz4.so.1 => /usr/lib64/liblz4.so.1 (0x00007f26d5239000)
    libz.so.1 => /usr/lib64/libz.so.1 (0x00007f26d5023000)
    libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f26d4e1f000)
    libstdc++.so.6 => /usr/local/lib64/libstdc++.so.6 (0x00007f26d4a9d000)
    libm.so.6 => /usr/lib64/libm.so.6 (0x00007f26d479b000)
    libgcc_s.so.1 => /usr/local/lib64/libgcc_s.so.1 (0x00007f26d4584000)
    libc.so.6 => /usr/lib64/libc.so.6 (0x00007f26d41b6000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f26d5cef000)

Please let me know if you encounter any issue. I spent lot of time to figure out the things to debug this simple program. Hope this helps. Enjoy debugging the code. I also created another blog article on RocksDB environment which abstracts the operating system specifics from the DB code.

Cassandra Heap Issues

Cassandra Out Of Memory (OOM) exceptions can cause performance impact and reduce availability. The OOM exception is reported when Cassandra is unable to allocate heap space. Some times the application workload alone can cause Cassandra to quickly use all the available heap space and throw OOM exceptions. Common cause for OOM is application workload combined with background jobs such as repair or Compactions can increase heap usage and eventually lead to OOM exceptions. Some of the components and background jobs which uses high heap space are

Memtables

Cassandra by default allocates 1/4th of max heap to Memtables. It can be configured cassandra.yaml to reduce the heap usage by Memtable allocations.

# memtable_heap_space_in_mb: 2048
# memtable_offheap_space_in_mb: 2048

When Cassandra is under severe heap pressure it will start flushing aggressively which will result in smaller SSTables and the creates high background compaction activity.

Reads

CQL queries that read many columns or rows per query can consume lot of heap space. Especially when the rows are wider (contains huge number of columns per row) or when rows contains tombstones. When rows are replicated to multiple nodes, the coordinator node needs to hold the data longer to meet the consistency constraints. When rows are spread across multiple SSTables, the results need to be merges from multiple SSTables and this increases heap space usage. Cassandra need to filter out the tombstones to read the live columns from a row. The number of tombstones scanned per query also affects the heap usage. Cassandra nodetool cfstats command provides some metrics on these factors. Following sample cfstat output collected from a live cluster shows the application is using wider rows with huge number of columns and there exist huge number of tombstones per row

                Average live cells per slice (last five minutes): 8.566489361702128
                Maximum live cells per slice (last five minutes): 1109
                Average tombstones per slice (last five minutes): 1381.0478723404256
                Maximum tombstones per slice (last five minutes): 24601

Cassandra nodetool cfhistograms command also provides some metrics on the SSTables. It displays statistics on row size, column count and average number of SSTables scanned per read.

Percentile SSTables     Write Latency    Read Latency    Partition Size        Cell Count
                                      (micros)             (micros)             (bytes)
50%             1.00           51.01                 182.79               1331                      4
75%             1.00           61.21                 219.34               2299                      7
95%             3.00           73.46                 2816.16             42510                    179
98%             3.00           88.15                 4055.27             182785                  770
99%             3.00           105.78               4866.32             454826                  1916
Min              0.00           6.87                   9.89                  104                         0
Max             3.00           379.02               74975.55          44285675122         129557750

Concurrent Read threads

Cassandra uses by default 32 concurrent threads in read stage. These threads serve read queries. If there are more read requests are received then they go in to pending queue. Cassandra nodetool tpstats command ca be used to check the active and pending read stage tasks. If each read query consuming high amount of heap space then reducing this setting can help heap usage.

Repair

Cassandra repair creates Merkle trees to detect inconsistencies row data among multiple replicas. By default Cassandra creates Merkle tree till depth 15. Each Merkle tree contains 32768 internal nodes and consumes lot of heap space. Repair running concurrently with application load can cause high heap usage.

Compaction

Background compactions can also cause high heap usage. By default compaction_throughput_mb_per_sec is set to 16 MB.

Other factors

Other factors that can cause high heap usage are Cassandra Hints, Row Cache, Tombstones and space used for Index entries.

Workarounds

If enough system memory available, increasing the max heap size can help. Some time min heap and max heap set to same value. reducing the min heap size can help reduce the heap usage. If application is allocating bigger chunks of memory, increasing G1GC region size can help in reducing heap fragmentation and improve heap utilization. Some times reducing concurrent read threads also help in reducing heap usage which can impact read performance.

Cassandra compaction

Cassandra compaction is a background process which merges multiple SSTables of a Column Family to one or more SSTables to improve the read performance and to reclaim the space occupied deleted data. Cassandra supports different Compaction strategies to accommodate different workloads. In this blog post, the size tiered strategy is described.

In size tiered compaction strategy, SSTables are grouped in to different buckets based on their size. SSTables with similar sizes are grouped in to same bucket. For each bucket an average size is calculated using the sizes of SSTables in it. There are two options for a bucket, bucketLow and bucketHigh which define the bucket width. If a SSTable size falls between (bucketLow * average bucket size) and (bucketHigh * average bucket size) then it will be includes in this bucket. The average size of the bucket is updated by the newly added SSTable size. The default value for bucketLow is 0.5 and for bucketHigh is 1.5.

Next, the buckets with SSTable count greater than or equal to the minimum compaction threshold are selected. The thresholds are defined as part of Compaction settings in Column Family schema.

compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}

Finally, Cassandra selects the bucket that contains SSTables with highest read activity. For each bucket a read hotness score is assigned and the bucket with highest read hotness score is selected for compaction. Read Hotness score for a SSTable is the number of reads per second per key. A two hour read rate is used in the calculation. For a bucket, the read hotness score is the combined read hotness score of all SSTable in it.

If two buckets have same read hotness score then the bucket with smaller SSTables is selected. If the selected bucket has SSTable count more than maximum compaction threshold then the coldest SSTables are dropped from the bucket to meet the max threshold. The SSTables in a bucket are sorted in descending order on their read hotness score and the coldest SSTables will be at the end of sorted order.

Before performing compaction, the available disk space is checked against the estimated compaction output size. If the available disk space is not enough then largest SSTables are dropped from the input list of SSTables to reduce the storage space needed. During compaction, the input SSTables are scanned and the same rows from multiple SSTables are merged. Compaction can cause heavy disk I/O. To limit the disk I/O, there is a configuration setting to throttle the compaction activity compaction_throughput_mb_per_sec. The setting concurrent_compactors limits the number of simultaneous compactions at a time.

How the actual compaction is performed

A scanner object is created for each input SSTable which can sequentially iterate the partitions in it. The input SSTable scanners added to a min heap based Priority queue to merge the partitions. The partitions are consumed from the priority queue. The scanner at the root of the priority queue contains the least element in the sorted order of partitions. If multiple SSTables contain the same partition which matches with the root node of priority queue, matching partition from all those SSTables are consumes and the SSTables are advanced to point to the next partitions. Once a scanner is consumed from Priority queue, it is removed from the top and added back at the end and the data structure will re-heapify to bubble up the smallest element to the root. During compaction, deleted rows with delete timestamp older than the GC grace period (gc_before) are also deleted.

Compaction Statistics

The running compaction progress can be monitored using nodetool compactionstats command. It will display the number of pending compactions, the keyspace name and the column family that is being compacted and the remaining time of compaction.

At the end of Compaction, Cassandra stores various statistics collected during compaction in to compaction_history table. The statistics can be viewed using nodetool compactionhistory command.

$ ./bin/nodetool compactionhistory

Compaction History:
id keyspace_name columnfamily_name compacted_at bytes_in bytes_out rows_merged
f0382bb0-c692-11eb-badd-7377db2be2d0 system local 2021-06-05T23:46:36.266 10384 5117 {5:1}
Stop Active Compaction

To stop an active compation use

$ nodetool stop COMPACTION

Disable Auto Compactions

Cassandra nodetool provide a command to disable auto compactions.

$ /opt/cassandra/bin/nodetool disableautocompaction

To check the compaction status

$ /opt/cassandra/bin/nodetool compactionstats

pending tasks: 0

Tombstone Compaction

Some times we want to run a single SSTable compaction to drop the tombstones which are older than gc_grace period. It can be done using the JMX interface. Cassandra JMX interface can be accessed using jmxterm jar. The allowed tombstone threshold can be set on Column family compaction properties using alter table command. This setting is allowed only for SizeTieredCompactionStrategy type.

$ java -jar jmxterm-1.0-alpha-4-uber.jar

$>open <Cassandra PID>

$>run -b org.apache.cassandra.db:type=CompactionManager forceUserDefinedCompaction <SSTable file path>

We can chose the SSTable to perform tombstone compaction by checking the sstable metadata

$>/opt/cassandra/tools/bin/sstablemetadata <SSTable path> | grep "Estimated droppable tombstones"

Estimated droppable tombstones: 0.1

Major Compaction

By default auto compaction selects the input SSTables for compaction based on the threshold defined as part of Column family definition. To run compaction on all the SSTables for a given column family, nodetool compact command can be be used to force a major compaction. Major compaction can results in to two compaction runs one of repaired SSTables and other one is for not repaired SSTables. If tombstones and live columns spread across these two different sets of SSTables, major compaction may not be able to drop tombstones. The tool /opt/cassandra/tools/bin/sstablerepairedset can be used to reset the repairAt to 0 as to include all SSTables in single set.

Typical Problems

Compaction is causing too much high CPU usage. This could be due to many pending compactions or the throughput is low. Sometimes increasing the throughput limit (compaction_throughput_mb_per_sec) from default 16 MB to higher value can benefit. In other cases reducing the concurrent_compactors to 1 can help easing the load.