In AI inference, KV Cache is a memory optimization technique that reduces inference costs and improves Time to First Token (TTFT) from seconds to milliseconds. It helps in prefill stage of inference and prevents a model from re computing everything the model has read or processed during a single conversation. Without it, the computation work grows exponentially.
LLMs based on Transformer architecture are auto-regressive. They generate output text one token at a time. To predict the next word, the model needs to process all previous words in the sequence. For example if a user ask "How is the weather in San Francisco?", the model processes all seven words in the input prompt and predicts "It". To predict the next token may be "is" it needs to reprocess "How is the weather in San Francisco? It". This generation process will become incredibly slow as the length of the prompt increases. The solution is to cache the already processed previous tokens.
During the Attention phase, the LLM calculates three vectors Query(Q), Key(K), and Value(V) for each token. The Key vector describes what the token contains while the Value vector describes the actual meaning of the token in its context. The KV Cache stores these processed vectors. Instead of recalculating the K, V values for the tokens which have been already processed, every time a new word is generated, the LLM simply looks up in the cache for the previously calculated key, values for the previous words. It only processes the latest token.
So far we discussed KV Cache for a single user conversation. It can also be beneficial for multiple user conversations. KV Cache reuse will work well if the context that prefix a prompt remain the same for multiple users. This may not be easy but for shared content types the context can be cached. For example multiple developers working on the same code repository or multiple students trying to learn from the same text book.
The popular method is matching the prefix of the contexts of two users. If User A and User B both start their prompt with the same 1000 token system instruction, the model computes that 1000 token KV cache once. User B gets that part for free (near-zero compute time). This is the best case. But for most of the cases there will be some changes in the context. Even if there is a single character mismatch then the KV Cache can not be reused. Some advanced frameworks can split the input prompt into multiple srgments and reuse the KV Cache for the matching segments even if the prefixes don't match.
We use LLMs to perform common tasks like coding and solving problems. For these kind of tasks we need to provide additional information to LLM, called context. For example you attach a source code file to understand a function or upload a pdf document to summarize. In a multi turn interaction with LLM, all the information exchanged in that specific chat session becomes context. The context helps LLM to provide more accurate and relevant response to a prompt. Companies now freeze the KV Cache for popular documents (like a coding library or a textbook) so they don't have to pay to read it every time a new user asks a question. Many API providers offer Context Caching discounts, charging up to 90% less for tokens that hit the cache ($0.20 vs $2.00 per 1M tokens).
As context windows have exploded (from 8k to 2M+ tokens), the KV Cache has become the primary bottleneck for AI hardware.The KV Cache lives close to the GPU in the fastest memory (HBM). However a GPU has limited capacity of HBM.
KV Context is specific to each Model type. You cannot take a KV cache from GPT-4 and give it to Claude 3. The Keys and Values are calculated based on that specific model's internal weights, hidden dimensions, and number of attention heads.
KV Cache has revolutionized the way LLMs process and generate text. By caching already processed previous tokens in a sequence, it significantly reduces inference costs and improves Time to First Token (TTFT). While it may not be suitable for all use cases, its benefits are evident in many applications, including multiple user conversations, coding, and solving problems. As context windows continue to grow, the KV Cache will remain a crucial component of AI inference.