Beyond TCP/IP: How RDMA is Powering Modern HPC and AI Infrastructure

  

In modern data centers supporting HPC and AI workloads, traditional TCP/IP has become a massive bottleneck, making RDMA fabrics the undisputed gold standard. When synchronizing billions of parameters across thousands of GPUs during Large Language Model (LLM) training, standard TCP cannot keep up due to its high protocol overhead and latency. RDMA delivers unparalleled, hardware native ultra low latency and maximum bandwidth with near zero CPU utilization, ensuring that thousands of tightly coupled compute nodes remain fully saturated rather than idling while waiting for data.

TCP was originally designed for the unpredictable, best effort public internet to guarantee that every single packet is delivered safely and in the exact right order. To ensure this reliability over unstable networks, the protocol relies on heavy underlying mechanisms like handshakes, acknowledgments, and flow control.

1. Protocol Overhead

TCP is designed as a connection oriented protocol that relies on a heavy handshake mechanism to set up and tear down connections. It requires explicit acknowledgments from the receiving end for data sent and strictly enforces in order packet delivery by buffering out of order data. For example, in a sequence of packets numbered 1 to 100, if packet 2 is delayed but the rest arrive successfully, packets 3 through 100 are blocked in the buffer waiting for packet 2. This is known as the Head of Line blocking problem.

In AI training, thousands of GPUs must constantly perform collective communication operations (like All-Reduce) to synchronize gradients. If even a single GPU experiences Head of Line blocking and is forced to wait for its buffered packets, the entire training cluster grinds to a halt. This TCP tax introduces severe tail latency into the communication fabric

2. CPU Saturation and Memory Copies

Because TCP relies heavily on the host CPU for protocol processing and data copying, its computational overhead climbs steeply as network bandwidth scales to 400 Gbps and beyond. In high demand environments like distributed AI training and NVMe storage, the OS kernel becomes a critical bottleneck. It introduces tens of microseconds of fixed latency due to frequent kernel context switches, protocol encapsulation, and multiple intermediate memory copies.

3. Congestion Control

TCP handles network traffic congestion using traditional software algorithms (like Cubic or BBR) that drastically cut transmission speeds the moment a dropped packet or congestion is detected. In a tightly coupled AI data center, these sudden, drastic drops in throughput are catastrophic for workload efficiency.

4. Redundant Software Reliability in a Lossless Fabric

TCP’s main value proposition is that it guarantees reliability over an unpredictable, lossy network. However, modern AI data centers are built with specialized InfiniBand or high end Ethernet switches where the underlying physical network is explicitly configured to be lossless.

Using specialized hardware level mechanisms (like Priority Flow Control), the network itself ensures packets are virtually never dropped due to buffer overflows. Because the underlying fabric is inherently reliable, paying the massive performance tax for TCP's software level reliability is completely redundant.

As high performance computing (HPC) and AI/ML workloads began requiring microsecond level latency and near line rate bandwidth, a new transport paradigm became mandatory. RDMA was developed to bypass the CPU and operating system entirely, offloading the transport layer to hardware and enabling direct, wire speed, memory to memory data transfers.

What is RDMA?

Remote Direct Memory Access (RDMA) extends traditional DMA capabilities across a network, enabling one server to directly read or write the memory of another server without involving either host's CPU or operating system kernel.

The History and Evolution

While the foundations of RDMA were laid in early industry patents, the practical breakthrough occurred in 1995 when Cornell University researchers introduced U-Net (User Level Network Interface). They demonstrated that a parallel supercomputer could be built using commodity servers by completely bypassing the OS kernel. By memory mapping the network interface card (NIC) directly into the application's user space, they eliminated the CPU overhead of traditional networking.

  • 2000 (InfiniBand): The InfiniBand architecture specification was released, adopting RDMA as its native, hardware driven transport protocol for high performance computing (HPC).
  • 2010 (RoCEv1): The RDMA over Converged Ethernet (RoCEv1) standard was introduced to bring RDMA to Ethernet fabrics. However, it was strictly a Layer 2 protocol, meaning it could not be routed across different network subnets.
  • 2014 (RoCEv2): The protocol was modernized by encapsulating RDMA packets inside standard UDP/IP headers. This critical evolution made RoCEv2 fully routable over standard enterprise IP networks, setting the stage for its adoption in modern AI data centers.

How RDMA Achieves Low Latency and High Bandwidth

RDMA achieves its blazing fast low latency and massive bandwidth by fundamentally changing how data travels across a network, entirely bypassing the core bottlenecks inherent to traditional TCP/IP.

1. Zero Copy Networking

In a standard TCP/IP data transfer, data is copied multiple times. It moves from the application buffer to the OS kernel buffer (via the socket API), then to the network driver, and finally to the network interface card (NIC) and repeating this entire tedious process in reverse at the destination.

RDMA eliminates these redundant intermediate copies and steps. The local network card reads data directly from the source application's memory and transmits it across the wire. The receiving RDMA enabled NIC (RNIC) then writes that data directly into the target application's memory space. Eliminating these memory copies dramatically reduces data transfer latency and frees up massive amounts of host memory bandwidth. 


 

2. Kernel Bypass

Typically, the OS kernel must handle every network request, forcing a process to constantly switch contexts between user mode and kernel mode to process network protocols. These context switches are incredibly expensive for a host CPU. RDMA allows the application to communicate directly with the RNIC hardware straight from user space. By skipping the OS kernel entirely, the time required to initiate and receive a transfer drops from milliseconds to ultra low microseconds.

3. CPU Offloading

With standard TCP/IP, the host CPU is entirely responsible for breaking data into packets, adding protocol headers, checking for errors, and managing flow control. At modern 100 Gbps to 800 Gbps speeds, a host CPU can spend 100% of its processing cycles just handling network traffic.

RDMA offloads the entire transport layer protocol stack directly onto the onboard hardware of the RNIC. Hardware level execution means packets are processed at line rate with near zero CPU utilization, leaving the host CPU completely free to compute actual AI or HPC workloads.

By turning network transfers into a hardware to hardware transaction, RDMA unlocks the true physical wire speed of modern networks. In modern RoCEv2 deployments, encapsulating these RDMA packets inside standard UDP/IP headers provides the best of both worlds, the ability to seamlessly route traffic across standard enterprise network switches, combined with hardware level, lightning fast speeds. This architectural advantages, precisely why RDMA has become the indispensable backbone of modern AI clusters, high performance computing (HPC), and next generation distributed storage networks.

The Three types of RDMA Protocol

  • InfiniBand: The native, purpose built RDMA architecture. Extremely high throughput, lowest possible latency, but requires specialized, expensive switches and cards.
  • RoCEv2 (RDMA over Converged Ethernet): Encapsulates RDMA in standard UDP/IP packets. Allows enterprises to use standard Ethernet switches, but requires careful network tuning to maintain a lossless environment.
  • iWARP: Encapsulates RDMA over standard TCP/IP. It doesn't require a lossless network and is easy to deploy, but it has slightly higher latency than RoCEv2 because it still deals with TCP's underlying characteristics.

Use Cases

Modern data centers investing millions of dollars building dedicated RDMA fabrics because certain workloads hit an absolute performance wall on standard TCP/IP networks. Here are the three primary use cases where RDMA is used.

Distributed AI Training & Large Language Models (LLMs)

Training massive frontier AI models requires thousands of interconnected GPUs working in unison. Communication libraries like NVIDIA NCCL or AMD RCCL use RDMA to orchestrate All Reduce operations, which synchronize millions of weight gradients across different server nodes during backpropagation.

By utilizing NVIDIA's GPUDirect RDMA, network interfaces bypass the host CPU and system RAM entirely. They stream gradient data directly from the High Bandwidth Memory (HBM) of one GPU to another across the fabric. This eliminates tail latency, prevents the GPUs from sitting idle, and keeps cluster utilization near 100%.

Next Generation Enterprise Storage (NVMe-oF)

Modern flash storage SSDs are so incredibly fast that the traditional network stack became the primary bottleneck when trying to share them across a data center. Storage fabrics use NVMe over Fabrics (NVMe-oF) to pool ultra fast NVMe SSDs into centralized storage arrays, connecting them to remote compute servers via RoCEv2 or InfiniBand.

NVMe-oF maps the remote storage drive's PCIe commands directly over the RDMA network layer. A remote computing node can read or write data to a centralized storage array with sub microsecond latencies, practically matching the raw speed and performance of an SSD plugged directly into its local physical motherboard.

High Frequency Trading (HFT)

In the world of electronic financial markets, a latency difference of just a few nanoseconds can mean the difference between making millions or losing a trade. Financial institutions and stock exchanges use RDMA fabrics to handle market data feeds, risk management calculations, and order execution matching engines.

By eliminating the operating system kernel and protocol encapsulation delays, RDMA enables deterministic, ultra low latency messaging. Trading firms use it to process market fluctuations and execute buy/sell orders at hardware native wire speeds, ensuring their algorithms react to market events faster than any competitor relying on traditional network infrastructure.

Technical Deployment Requirements: Building a RoCEv2 Fabric

Deploying an RDMA fabric requires a tightly coordinated ecosystem of specialized hardware, specific network configurations, and an RDMA aware software stack. While RDMA can be deployed using three distinct protocols InfiniBand, RoCEv2, or iWARP, this section focuses specifically on the infrastructure required to deploy a modern RoCEv2 ecosystem over Ethernet.

1. Hardware Requirements

Host Network Adapters (RNICs)

Standard network interface cards (NICs) cannot process hardware offloaded RDMA instructions. Servers must be equipped with specialized RDMA enabled Network Interface Cards (RNIC) featuring onboard processors to handle transport protocol execution.

Lossless Network Infrastructure

Because RoCEv2 lacks standard TCP software level retransmission safety nets, it requires a strictly lossless Ethernet network to prevent packet drops. The underlying Ethernet switches must support specifically:

  • Priority Flow Control (PFC): To pause specific classes of traffic before buffers overflow.
  • Explicit Congestion Notification (ECN): To signal and throttle traffic endpoints before congestion triggers packet drops.

Server Architecture (PCIe & NUMA alignment)

To achieve true microsecond level latency, physical hardware placement inside the server chassis matters. The RNIC must be installed in a PCIe slot directly routed to the same CPU socket (NUMA node) as the workload (or the GPUs in AI clusters). This prevents data from crossing the internal CPU interconnect bus, which introduces unwanted latency spikes.

2. Software Requirements

Operating System & Kernel Support

The host operating system must feature builtin drivers and kernel subsystems capable of managing the hardware bypass. For example, Linux operating sysgtem requires the installation of the userspace rdma-core subsystem and corresponding kernel modules. This includes vendor specific hardware drivers

The RDMA Programming Interface (API)

Applications cannot communicate using standard network sockets (IP:Port). Instead, they must interface with a specialized middleware layer:

  • InfiniBand Verbs (libibverbs): The low level API used to program RDMA operations. It handles memory registration, creates RDMA objects Queue Pairs (QPs) (Send/Receive queues), and manages Completion Queues (CQs).
  • OFED (OpenFabrics Enterprise Distribution): A unified software stack that packages RDMA drivers, core libraries, and diagnostic command line utilities.

Application Layer (RDMA Aware Software)

Finally, upper level software running actual enterprise workloads must be explicitly compiled or configured to utilize RDMA verbs rather than standard TCP sockets:

  • AI/Deep Learning: Frameworks like PyTorch or TensorFlow rely on NCCL (NVIDIA Collective Communications Library) or MPI (Message Passing Interface) explicitly toggled to communicate over RDMA to sync gradients.
  • Enterprise Storage: Storage fabrics like NVMe-oF (NVMe over Fabrics) or distributed object stores like Cloudian Hyperstore supports underlying transport layer switched from TCP to RDMA for lightening speed data transfers.

Congestion Management: How RDMA Maintains a Lossless Fabric

RDMA over Converged Ethernet (RoCEv2), solves congestion by using hardware level flow control and end to end intelligent signaling.

1. Priority Flow Control (PFC)

PFC operates at Layer 2 (Data Link Layer) inside network switches to provide lossless ethernet in a RoCEv2 environment.

When a receiving server or an intermediate switch becomes overwhelmed and its memory buffer queues begin filling to capacity, it refuses to drop incoming packets. Instead, it sends a specialized PAUSE frame backward to the upstream switch or sender. The upstream device immediately halts transmission on that specific traffic priority lane until the receiver's buffer clears, then resumes.

2. DCQCN (Data Center Quantized Congestion Notification)

Because PFC is highly disruptive, modern RoCEv2 deployments implement DCQCN at Layers 3 and 4. DCQCN acts as an early warning system designed to throttle the sending server's speed before network switches are forced to trigger a panic induced PFC PAUSE frame.

DCQCN is a hybrid algorithm that coordinates three key components:

  • The Switch (ECN Marking): As packets transit through network switches, the hardware monitors queue depths. If a queue begins building up, the switch flips a specific bit in the IP header marked Congestion Experienced (CE) but allows the packet to continue forward.
  • The Destination (CNP Generation): When the target server receives a packet with the "CE" bit enabled, it recognizes that the path is congested. The destination's RNIC immediately generates a highly urgent control packet called a Congestion Notification Packet (CNP) and sends it straight back to the source.
  • The Source (Rate Limitation): The moment the original sender's hardware receives the CNP, it mathematically scales down its packet injection rate. If the congestion clears and the sender stops receiving CNPs, an internal hardware timer gradually ramps the transmission speed back up to line rate.

The Hyperscale Wall: Why AWS and Google Built Alternatives to RDMA

While traditional RDMA is exceptionally performant in dedicated environments, its deployment and management creates severe challenges at multi-tenant cloud scale. When Amazon Web Services (AWS) and Google attempted to deploy standard RDMA (RoCEv2 and InfiniBand) across hundreds of thousands of servers, they ran into fundamental architectural roadblocks. To bypass these limitations, AWS developed Scalable Reliable Datagram (SRD) and Google created Falcon, shifting away from standard RDMA due to three critical drawbacks:

1. The Fragility of Lossless Ethernet (PFC Deadlocks)

Traditional RoCEv2 requires a perfectly lossless network enforced by Layer 2 Priority Flow Control (PFC). When congestion occurs, switches propagate PAUSE frames upstream to halt transmission, which at hyperscale creates a dangerous domino effect. A single sluggish server node can trigger a chain reaction of PAUSE frames that ripples backward through the fabric, causing catastrophic network deadlocks and PFC storms. To eliminate this operational fragility, AWS and Google abandoned lossless requirements altogether, engineering hardware driven protocols (SRD and Falcon) designed to run over standard, lossy Ethernet by handling packet drops via ultra fast retransmissions at the endpoints.

2. Inability to Multipath (ECMP Hash Collisions)

RDMA requires strict, in order packet delivery at the hardware layer, forcing an entire data stream to lock onto a single network path via Equal Cost Multi Pathing (ECMP) routing hashes. If two massive AI workloads randomly get assigned to the same physical link, that link becomes a severe bottleneck while adjacent paths sit completely empty. Standard RDMA cannot divert traffic to those empty paths because out of order delivery is treated as a fatal error. Cloud Service Providers (CSP) solve this by being natively tolerant of out of order delivery; AWS's SRD, for instance, dynamically distributes packets across all available network paths simultaneously, relying on specialized hardware at the destination to reorder them on arrival and eliminate hotspots.

3. Scale Out Connection Limits (Queue Pair Exhaustion)

Standard RDMA relies on a stateful Reliable Connection (RC) model where every server must maintain a dedicated communication channel a Queue Pair (QP) for every single node it communicates with. Network interface cards (NICs) have limited high speed, on chip cache memory (SRAM) to store these connection states. As an AI cluster scales up to tens of thousands of GPUs, the number of connection pairs explodes exponentially, causing cache exhaustion that forces the NIC to constantly fetch state data from the main server RAM. This cache misses destroy RDMA's ultra low latency benefits, prompting hyperscalers to shift to a decoupled, connectionless datagram model where a single endpoint can dynamically send data to any node without maintaining persistent connection states.

Despite the rise of cloud specific alternatives like SRD and Falcon, standard RDMA specifically InfiniBand and high performance RoCEv2 remains the undisputed gold standard for dedicated AI data centers and High Performance Computing (HPC). When training massive, multi billion parameter AI models, an optimal architecture needs absolute raw performance. Standard RDMA delivers unparalleled, hardware native ultra low latency and maximum bandwidth with zero CPU overhead, ensuring that thousands of tightly coupled GPUs or compute nodes remain fully utilized rather than waiting on data. For enterprises and research institutions building dedicated fabrics where they can tightly control the network topology, the massive performance edge of RDMA heavily outweighs its configuration complexities, cementing its role as the foundational fabric powering modern supercomputing and frontier AI development.

References


How KV Cache is revolutionizing AI inference

In AI inference, KV Cache is a memory optimization technique that reduces inference costs and improves Time to First Token (TTFT) from seconds to milliseconds. It helps in prefill stage of inference and prevents a model from re computing everything the model has read or processed during a single conversation. Without it, the computation work grows exponentially. 

LLMs based on Transformer architecture are auto-regressive. They generate output text one token at a time. To predict the next word, the model needs to process all previous words in the sequence. For example if a user ask "How is the weather in San Francisco?", the model processes all seven words in the input prompt and predicts "It". To predict the next token may be "is" it needs to reprocess "How is the weather in San Francisco? It". This generation process will become incredibly slow as the length of the prompt increases. The solution is to cache the already processed previous tokens. 

During the Attention phase, the LLM calculates three vectors Query(Q), Key(K), and Value(V) for each token. The Key vector describes what the token contains while the Value vector describes the actual meaning of the token in its context. The KV Cache stores these processed vectors. Instead of recalculating the K, V values for the tokens which have been already processed,  every time a new word is generated, the LLM simply looks up in the cache for the previously calculated key, values for the previous words. It only processes the latest token. 

So far we discussed KV Cache for a single user conversation. It can also be beneficial for multiple user conversations. KV Cache reuse will work well if the context that prefix a prompt remain the same for multiple users. This may not be easy but for shared content types the context can be cached. For example multiple developers working on the same code repository or multiple students trying to learn from the same text book.

The popular method is matching the prefix of the contexts of two users. If User A and User B both start their prompt with the same 1000 token system instruction, the model computes that 1000 token KV cache once. User B gets that part for free (near-zero compute time). This is the best case. But for most of the cases there will be some changes in the context. Even if there is a single character mismatch then the KV Cache can not be reused. Some advanced frameworks can split the input prompt into multiple srgments and reuse the KV Cache for the matching segments even if the prefixes don't match.

We use LLMs to perform common tasks like coding and solving problems. For these kind of tasks we need to provide additional information to LLM, called context. For example you attach a source code file to understand a function or upload a pdf document to summarize. In a multi turn interaction with LLM, all the information exchanged in that specific chat session becomes context. The context helps LLM to provide more accurate and relevant response to a prompt. Companies now freeze the KV Cache for popular documents (like a coding library or a textbook) so they don't have to pay to read it every time a new user asks a question. Many API providers  offer Context Caching discounts, charging up to 90% less for tokens that hit the cache ($0.20 vs $2.00 per 1M tokens).



As context windows have exploded (from 8k to 2M+ tokens), the KV Cache has become the primary bottleneck for AI hardware.The KV Cache lives close to the GPU in the fastest memory (HBM). However a GPU has limited capacity of HBM. 

KV Context is specific to each Model type. You cannot take a KV cache from GPT-4 and give it to Claude 3. The Keys and Values are calculated based on that specific model's internal weights, hidden dimensions, and number of attention heads. 

KV Cache has revolutionized the way LLMs process and generate text. By caching already processed previous tokens in a sequence, it significantly reduces inference costs and improves Time to First Token (TTFT). While it may not be suitable for all use cases, its benefits are evident in many applications, including multiple user conversations, coding, and solving problems. As context windows continue to grow, the KV Cache will remain a crucial component of AI inference. 

Designing Hyper Scalable Services

Scaling services is a hard problem but operating them at the scale is even more hard problem. You need to consider the operational aspects of running it at scale during design time. Service downtime can cause huge losses to business and leads to poor customer experience.

Operation Challenges

Workloads

Workloads can be unpredictable. Especially in case of multi-tenant systems, or multi application systems each tenant or application use case and the corresponding workload can be different. Tuning the service to one type of workload can cause severe problems to other workloads. 

Resources

Modern services are designed as Micro services. One micro service consuming high resources can cause problems to other micro services. Careful tuning and monitoring is required to operate overall service efficiently. Sometimes one machine can have high load compared to other machines due to uneven distribution of load or due to configuration issues in DNS.

Reliability

Commodity hardware has been used increasingly to deploy the services. Failures of different components is common in these environments. Detecting, self healing becomes critical to reduce the downtime. Failures can come from unexpected events also which are less frequent such as Datacenter power failure and Networking cable issues.

Security

Security should be the highest priority for any kind of service. From the beginning, you need to design a strong security posture. This includes how you authenticate, authorize the users, how do you protect the data privacy, and also how you monitor the vulnerabilities. Threats are becoming common and immediate remediation and patching needed when vulnerabilities identified. Different countries have different privacy regulations and these needs to be considered at the design time. You may need auditing capabilities also if you are providing compliance.

Performance

Faster response times are crucial for business to reduce the costs, gain competitive advantage and better user experience. It is crucial to know the bottlenecks ahead by doing some kind of back on the envelop calculations. Simulations can help understand the system at large scale and identify the limits theoratically.

Capacity

Dynamically expanding and shrinking like public clouds is critical for modern services to reduce the operation costs of service.

Management

With the popularity of public clouds customers or devops are expecting easy to manage services and operate them at scale

Upgrades

It is difficult to upgrade the service at scale and the release cycles will tend to take more time as the service is in operation longer. You need to spend more time on testing and ensure the quality for each release so that you can have less impact when a upgrade is done. The upgrading testing should be run with some real world service data and different deployment configurations

Programmability

It is hard to release a patch for every small bug or improvement. If service can provide a well defined management or administrative API then it can be used to provide a workaround for the issues till the next release

Backward Compatibility

Between releases, API interface or data formats can change. Some of the functionality available in previous releases can be deprecated but users are depending on it. It is extremely important to consider the backward compatibility during design

Hardware Replacement

Commodity hardware will have certain time of life. Replacing the hardware impacts service availability. If the service is deployed in a redundancy configuration the impact can be minimized

Logging

Application logs are very important to investigate the production issues. There should be right amount of logging in each transaction or API processing to debug the issues

Statistics

You can not log each step of the processing. Even your logs are very detailed the logs can be rolled over due to the predefined thresholds. It is better to log the statistics in to a separate log file and also to expose them through an API or command. The Statistics can help in investigating the complex problems such as performance issues or resource issues.

QOS

When multiple applications,or users using the same platform, QOS will prevent noisy neighbor problems for fair sharing,

Tools

Tools provide various advantages to a service. They can provide a way to access some of the internal state and also provide a way to operate on the collected information. Some times they can provide a work around to provide a immediate relief on some operational issues.

With all the issues listed above, designing resilient services need to consider and plan for operational aspects during design time


RocksDB Java Native Memory Growth

RocksDB is developed in C/C++ programming language and has a JAVA binding which provides a set of JAVA classes to access RocksDB. In a Java application, the memory is managed by JVM. Object allocations are done on JVM heap and GC manages the heap. But RocksDB allocates C/C++ object using the OS malloc library. When RocksDB is used in a Java application we need to extra careful about the allocations made by RocksDB. 

For a Java process these are native memory allocations which are not on the heap. If Java process doesn't delete those objects or close the handles properly after use then they will remain and the JAVA process resident memory size (RSS) starts growing.  Even application is well written and properly handling the allocations, some time depending on workload, it may need to access several SSTables (RocksDB sst files) to serve client requests. This case cause uncontrollable growth if RocksDB not tuned properly  and can cause Out Of Memory and eventually lead to process crash by Kernel OOM killer. There is a lengthy discussion on the thread specified in Reference section which provide lot of details and how to monitor it and resolve it if you are experiencing this problem.



References

https://github.com/facebook/rocksdb/issues/4112


RocksDB

RocksDB is one of the popular open source embedded key value database used by several other popular systems. It is built based on LevelDB code which is developed by Google. More and more opensource as wells as commercial systems started using RocksDB due to its high performance, flexibility and tuning features. Also there are may other Databases improved or customized RocksDB based on their needs.

I am going to write posts which  can help in reviewing RocksDB code. Some of the posts which I already published are here below.

RocksDB Environment

How To Debbug RocksDB Source Code 

RocksDB Put Operation

RocksDB Get Operation

RocksDB Java Native Memory Growth 

Please let me know if you need to know any particular area of RocksDB. Also let me know if any updates are needed to above posts.

RocksDB Put Operation

 

RocksDB Put operation creates a new record in the specified DB. RocksDB code is very flexible and has several levels of abstractions. If you are planning to understand the code path you can start reading from the functions specified in this post.

DB Put Operation
RocksDB stores the recently inserted data in the MemTable and periodically flushes the data to disk based SSTables (sst files). But MemTable data can be lost when power failure occurs or the application crashes. It needs to make the changes durable. It uses WAL file to provide the durability. Every write is stored in MemTable and also committed to WAL. The entry function for the Put operation is db/db_impl/db_impl_write.cc:Put() .

Write Batch

RocksDB creates a write batch even for single write. The write batch will have a set of key, value records, the count of records in it, sequence number, and a checksum. This code is in db/write_batch.cc
 
WAL Commit
 
This code is in db/db_impl/db_impl_write.cc:WriteToWAL(). It serializes the write batch in to disk storage format and and appends to the current WAL file. db/log_writer.cc:og::Writer::EmitPhysicalRecord() and file/writable_file_writer.cc:WritableFileWriter::Append()

MemTable Write

The MemTable Insert code is in WriteBatchInternal::InsertInto() function. There are different configuration option to perform MemTable write. The code is in DBImpl::WriteImpl()

RocksDB Get Operation

RocksDB Get operation retrieves the record value for a given key. RocksDB code is very flexible and has several levels of abstractions. If you are planning to understand the code path you can start reading from the functions specified in this post.

DB Get Operation
RocksDB stores the recently inserted data in the MemTable and periodically flushes the data to disk based SSTables (sst files). It needs to search both MemTable and SSTables for the given key. 
The entry function for the Get operation is db_impl_readonly.cc:DBImplReadOnly::Get() .

MemTable lookup
 
First it searches the given key in MemTable and returns the value if it is found in it. MemTable managed using SkipList and a search is done on it. The main function for this is db/memtable.cc:MemTable::GetFromTable()

Table Lookup

If the record not found in the MemTable then it will search the key in the table cache this code is in db/table_cache.cc:TableCache::Get()
 
If it is not found in table cache then it searches the files at each level starting from level 0.  If you use BlockBasedTable the default option the entry point is table/block_based/block_based_table_reader.cc:BlockBasedTable::Get(). It will search the block index to find the given key. It uses BlockIter::Seek() function to search. The keys are compared using the comparator function and the corresponding value is returned.


RocksDB Environment

RocksDB and LevelDB has an abstraction called Env (environment) which provides an interface to access Operating system specific functions. This abstraction is there in the traditional BerkeleyDB also. It nicely separates Database code from the OS specific functionality by encapsulating it.

The Env object provides an interface to access System time, File System API, Thread API, Synchronization primitives and Shared Library management API. The RocksDB distributions comes with few supported environments such as Posix, HDFS, Windows, etc. You can also customize some part of the environment by providing new implementation. For example you can customize FileSystem calls used by Env object with memory based implementation or you can mirror the writes to another environment. 

To implement a new environment for ex: Cloud environment, you need to override the  virtual functions in Env class and provide an implementation with Cloud APIs.

Posix Environment

It is the default Environment when you create a new Database. It uses various POSIX calls for proving OS level services. Below table describes the various APIs used

ServiceSystem Calls
 Time      
 clock_gettime(), gettimeofday()
 File fopen(), fclose(), open(), close(), fcntl(), stat(), fsync(), statvfs(), rename(), link(), unlink(),  access(), pread(), pwrite()
 Directory             
 mkdir(), rmdir(), getcwd(), opendir() and readdir()
 Threads pthread_create(), pthread_self(), pthread_mutex_init(), pthread_mutex_lock(), pthread_mutex_unlock(), pthread_join()
 Libraries  
 dlopen(), dlclose(), dlsym() and dlerror()

 

If you want to understand the RocksDB source code you can debug the code using the information provided in another blog post How To Debug RocksDB Source Code.

How To Debug RocksDB Source Code

 RocksDB is one of the popular open source embedded key value database used by several other popular systems. It is a derivative of LevelDB which is developed by Google. More and more opensource as wells as commercial systems started using RocksDB due to its high performance, flexibility and tuning features.

I wanted to read the source code but the code base grown over the years. I tried to search some of the function implementations and it is not very easy to find due to lot if abstraction. RocksDB site provides very good information on the architecture, internal mechanisms, configuration options and API etc. It showed a sample program in Getting Started. I wanted to use this program to debug the RocksDB code.

First we need to build the RocksDB with debug information enabled in the library. For this you can download the source code from github.


git clone https://github.com/facebook/rocksdb.git

In order to compile the code you need to install/download dependent libraries. The best place to get this information is in INSTALL.md. In this file there are instruction for each supported OS type. Please follow and install all the required libraries. The latest code uses gcc version of 7.0 or greater. I am using CentOS 7.8. It support GCC version of 4.8.5. I need to get the gcc source code and compile it and install it to get the GCC version 7.3.0. The GCC compilation tool a while.Once the GCC installed successfully I compiled the RocksDB code. 

 Here is my sample program based on the Getting Started guide.

 

$ cat Test.cc
#include <iostream>
#include "assert.h"
#include "rocksdb/db.h"

int main(int argc, char** argv) {
    rocksdb::DB* db;
    rocksdb::Options options;
    options.create_if_missing = true;
    rocksdb::Status s =
      rocksdb::DB::Open(options, "/tmp/testdb", &db);
    assert(s.ok());
    std::string value = "Hello World";
    std::string key = "key";
    std::string value1;

    if (s.ok()) s = db->Put(rocksdb::WriteOptions(), key, value);
    if (s.ok()) s = db->Get(rocksdb::ReadOptions(), key, &value1);
    if (s.ok()) std::cout << value1 << std::endl;;

    delete db;
    return 0;
}

To compile you need to include all the dependent libraries. 

g++ -v -I./include -I/usr/include -g -o Test Test.cc \
-std=gnu++17 -lzstd -lpthread -lsnappy -lbz2 -llz4 \
-lz -ldl ./librocksdb_debug.a

I used -v flag to see more information when compile is reports any errors, -g flag to include debug information for the test program. Once you compile the sample code successfully then you can start debugging the code using gdb.

The dynamic libraries linked to program are 

$ ldd Test
    linux-vdso.so.1 =>  (0x00007ffd52ffb000)
    libzstd.so.1 => /usr/local/lib/libzstd.so.1 (0x00007f26d5a7a000)
    libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f26d585e000)
    libsnappy.so.1 => /usr/lib64/libsnappy.so.1 (0x00007f26d5658000)
    libbz2.so.1 => /usr/lib64/libbz2.so.1 (0x00007f26d5448000)
    liblz4.so.1 => /usr/lib64/liblz4.so.1 (0x00007f26d5239000)
    libz.so.1 => /usr/lib64/libz.so.1 (0x00007f26d5023000)
    libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f26d4e1f000)
    libstdc++.so.6 => /usr/local/lib64/libstdc++.so.6 (0x00007f26d4a9d000)
    libm.so.6 => /usr/lib64/libm.so.6 (0x00007f26d479b000)
    libgcc_s.so.1 => /usr/local/lib64/libgcc_s.so.1 (0x00007f26d4584000)
    libc.so.6 => /usr/lib64/libc.so.6 (0x00007f26d41b6000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f26d5cef000)

Please let me know if you encounter any issue.  I spent lot of time to figure out the things to debug this simple program. Hope this helps. Enjoy debugging the code. I also created another blog article on RocksDB environment which abstracts the operating system specifics from the DB code.

 

Cassandra Heap Issues

 Cassandra Out Of Memory (OOM) exceptions can cause performance impact and reduce availability.  The OOM exception is reported when Cassandra is unable to allocate heap space. Some times the application workload alone can cause Cassandra to quickly use all the available heap space and throw OOM exceptions. Common cause for OOM is application workload combined with background jobs such as repair or Compactions can increase heap usage and eventually lead to OOM exceptions. Some of the components and background jobs which uses high heap space are

Memtables

Cassandra by default allocates 1/4th of max heap to Memtables. It can be configured cassandra.yaml to reduce the heap usage by Memtable allocations.   

# memtable_heap_space_in_mb: 2048
# memtable_offheap_space_in_mb: 2048

When Cassandra is under severe heap pressure it will start flushing aggressively which will result in smaller SSTables and the creates high background compaction activity.

Reads

CQL queries that read many columns or rows per query can consume lot of heap space. Especially when the rows are wider (contains huge number of columns per row) or when rows contains tombstones. When rows are replicated to multiple nodes, the coordinator node needs to hold the data longer to meet the consistency constraints. When rows are spread across multiple SSTables, the results need to be merges from multiple SSTables and this increases heap space usage. Cassandra need to filter out the tombstones to read the live columns from a row. The number of tombstones scanned per query also affects the heap usage. Cassandra nodetool cfstats command provides some metrics on these factors. Following sample cfstat output collected from a live cluster shows the application is using wider rows with huge number of columns and there exist huge number of tombstones per row

                Average live cells per slice (last five minutes): 8.566489361702128
                Maximum live cells per slice (last five minutes): 1109
                Average tombstones per slice (last five minutes): 1381.0478723404256
                Maximum tombstones per slice (last five minutes): 24601

Cassandra nodetool cfhistograms command also provides some metrics on the SSTables. It displays statistics on row size, column count and average number of SSTables scanned per read.

Percentile  SSTables     Write Latency    Read Latency    Partition Size        Cell Count
                                      (micros)             (micros)             (bytes)
50%             1.00           51.01                 182.79               1331                      4
75%             1.00           61.21                 219.34               2299                      7
95%             3.00           73.46                 2816.16             42510                    179
98%             3.00           88.15                 4055.27             182785                  770
99%             3.00           105.78               4866.32             454826                  1916
Min              0.00           6.87                   9.89                  104                         0
Max             3.00           379.02               74975.55          44285675122         129557750

Concurrent Read threads

Cassandra uses by default 32 concurrent threads in read stage. These threads serve read queries. If there are more read requests are received then they go in to pending queue. Cassandra nodetool tpstats command ca be used to check the active and pending read stage tasks. If each read query consuming high amount of heap space then reducing this setting can help heap usage.

Repair 

Cassandra repair creates Merkle trees to detect inconsistencies row data among multiple replicas. By default Cassandra creates Merkle tree till depth 15. Each Merkle tree contains 32768 internal nodes and consumes lot of heap space. Repair running concurrently with application load can cause high heap usage.

Compaction

Background compactions can also cause high heap usage. By default compaction_throughput_mb_per_sec is set to 16 MB.

Other factors

Other factors that can cause high heap usage are Cassandra Hints,  Row Cache, Tombstones and  space used for Index entries. 

Workarounds

If enough system memory available, increasing the max heap size can help. Some time min heap and max heap set to same value. reducing the min heap size can help reduce the heap usage. If application is allocating bigger chunks of memory, increasing G1GC region size can help in reducing heap fragmentation and improve heap utilization. Some times reducing concurrent read threads also help in reducing heap usage which can impact read performance.