In modern data centers supporting HPC and AI workloads, traditional TCP/IP has become a massive bottleneck, making RDMA fabrics the undisputed gold standard. When synchronizing billions of parameters across thousands of GPUs during Large Language Model (LLM) training, standard TCP cannot keep up due to its high protocol overhead and latency. RDMA delivers unparalleled, hardware native ultra low latency and maximum bandwidth with near zero CPU utilization, ensuring that thousands of tightly coupled compute nodes remain fully saturated rather than idling while waiting for data.
TCP was originally designed for the unpredictable, best effort public internet to guarantee that every single packet is delivered safely and in the exact right order. To ensure this reliability over unstable networks, the protocol relies on heavy underlying mechanisms like handshakes, acknowledgments, and flow control.
1. Protocol Overhead
TCP is designed as a connection oriented protocol that relies on a heavy handshake mechanism to set up and tear down connections. It requires explicit acknowledgments from the receiving end for data sent and strictly enforces in order packet delivery by buffering out of order data. For example, in a sequence of packets numbered 1 to 100, if packet 2 is delayed but the rest arrive successfully, packets 3 through 100 are blocked in the buffer waiting for packet 2. This is known as the Head of Line blocking problem.
In AI training, thousands of GPUs must constantly perform collective communication operations (like All-Reduce) to synchronize gradients. If even a single GPU experiences Head of Line blocking and is forced to wait for its buffered packets, the entire training cluster grinds to a halt. This TCP tax introduces severe tail latency into the communication fabric
2. CPU Saturation and Memory Copies
Because TCP relies heavily on the host CPU for protocol processing and data copying, its computational overhead climbs steeply as network bandwidth scales to 400 Gbps and beyond. In high demand environments like distributed AI training and NVMe storage, the OS kernel becomes a critical bottleneck. It introduces tens of microseconds of fixed latency due to frequent kernel context switches, protocol encapsulation, and multiple intermediate memory copies.
3. Congestion Control
TCP handles network traffic congestion using traditional software algorithms (like Cubic or BBR) that drastically cut transmission speeds the moment a dropped packet or congestion is detected. In a tightly coupled AI data center, these sudden, drastic drops in throughput are catastrophic for workload efficiency.
4. Redundant Software Reliability in a Lossless Fabric
TCP’s main value proposition is that it guarantees reliability over an unpredictable, lossy network. However, modern AI data centers are built with specialized InfiniBand or high end Ethernet switches where the underlying physical network is explicitly configured to be lossless.
Using specialized hardware level mechanisms (like Priority Flow Control), the network itself ensures packets are virtually never dropped due to buffer overflows. Because the underlying fabric is inherently reliable, paying the massive performance tax for TCP's software level reliability is completely redundant.
As high performance computing (HPC) and AI/ML workloads began requiring microsecond level latency and near line rate bandwidth, a new transport paradigm became mandatory. RDMA was developed to bypass the CPU and operating system entirely, offloading the transport layer to hardware and enabling direct, wire speed, memory to memory data transfers.
What is RDMA?
Remote Direct Memory Access (RDMA) extends traditional DMA capabilities across a network, enabling one server to directly read or write the memory of another server without involving either host's CPU or operating system kernel.
The History and Evolution
While the foundations of RDMA were laid in early industry patents, the practical breakthrough occurred in 1995 when Cornell University researchers introduced U-Net (User Level Network Interface). They demonstrated that a parallel supercomputer could be built using commodity servers by completely bypassing the OS kernel. By memory mapping the network interface card (NIC) directly into the application's user space, they eliminated the CPU overhead of traditional networking.
- 2000 (InfiniBand): The InfiniBand architecture specification was released, adopting RDMA as its native, hardware driven transport protocol for high performance computing (HPC).
- 2010 (RoCEv1): The RDMA over Converged Ethernet (RoCEv1) standard was introduced to bring RDMA to Ethernet fabrics. However, it was strictly a Layer 2 protocol, meaning it could not be routed across different network subnets.
- 2014 (RoCEv2): The protocol was modernized by encapsulating RDMA packets inside standard UDP/IP headers. This critical evolution made RoCEv2 fully routable over standard enterprise IP networks, setting the stage for its adoption in modern AI data centers.
How RDMA Achieves Low Latency and High Bandwidth
RDMA achieves its blazing fast low latency and massive bandwidth by fundamentally changing how data travels across a network, entirely bypassing the core bottlenecks inherent to traditional TCP/IP.
1. Zero Copy Networking
In a standard TCP/IP data transfer, data is copied multiple times. It moves from the application buffer to the OS kernel buffer (via the socket API), then to the network driver, and finally to the network interface card (NIC) and repeating this entire tedious process in reverse at the destination.
RDMA eliminates these redundant intermediate copies and steps. The local network card reads data directly from the source application's memory and transmits it across the wire. The receiving RDMA enabled NIC (RNIC) then writes that data directly into the target application's memory space. Eliminating these memory copies dramatically reduces data transfer latency and frees up massive amounts of host memory bandwidth.
2. Kernel Bypass
Typically, the OS kernel must handle every network request, forcing a process to constantly switch contexts between user mode and kernel mode to process network protocols. These context switches are incredibly expensive for a host CPU. RDMA allows the application to communicate directly with the RNIC hardware straight from user space. By skipping the OS kernel entirely, the time required to initiate and receive a transfer drops from milliseconds to ultra low microseconds.
3. CPU Offloading
With standard TCP/IP, the host CPU is entirely responsible for breaking data into packets, adding protocol headers, checking for errors, and managing flow control. At modern 100 Gbps to 800 Gbps speeds, a host CPU can spend 100% of its processing cycles just handling network traffic.
RDMA offloads the entire transport layer protocol stack directly onto the onboard hardware of the RNIC. Hardware level execution means packets are processed at line rate with near zero CPU utilization, leaving the host CPU completely free to compute actual AI or HPC workloads.
By turning network transfers into a hardware to hardware transaction, RDMA unlocks the true physical wire speed of modern networks. In modern RoCEv2 deployments, encapsulating these RDMA packets inside standard UDP/IP headers provides the best of both worlds, the ability to seamlessly route traffic across standard enterprise network switches, combined with hardware level, lightning fast speeds. This architectural advantages, precisely why RDMA has become the indispensable backbone of modern AI clusters, high performance computing (HPC), and next generation distributed storage networks.
The Three types of RDMA Protocol
- InfiniBand: The native, purpose built RDMA architecture. Extremely high throughput, lowest possible latency, but requires specialized, expensive switches and cards.
- RoCEv2 (RDMA over Converged Ethernet): Encapsulates RDMA in standard UDP/IP packets. Allows enterprises to use standard Ethernet switches, but requires careful network tuning to maintain a lossless environment.
- iWARP: Encapsulates RDMA over standard TCP/IP. It doesn't require a lossless network and is easy to deploy, but it has slightly higher latency than RoCEv2 because it still deals with TCP's underlying characteristics.
Use Cases
Modern data centers investing millions of dollars building dedicated RDMA fabrics because certain workloads hit an absolute performance wall on standard TCP/IP networks. Here are the three primary use cases where RDMA is used.
Distributed AI Training & Large Language Models (LLMs)
Training massive frontier AI models requires thousands of interconnected GPUs working in unison. Communication libraries like NVIDIA NCCL or AMD RCCL use RDMA to orchestrate All Reduce operations, which synchronize millions of weight gradients across different server nodes during backpropagation.
By utilizing NVIDIA's GPUDirect RDMA, network interfaces bypass the host CPU and system RAM entirely. They stream gradient data directly from the High Bandwidth Memory (HBM) of one GPU to another across the fabric. This eliminates tail latency, prevents the GPUs from sitting idle, and keeps cluster utilization near 100%.
Next Generation Enterprise Storage (NVMe-oF)
Modern flash storage SSDs are so incredibly fast that the traditional network stack became the primary bottleneck when trying to share them across a data center. Storage fabrics use NVMe over Fabrics (NVMe-oF) to pool ultra fast NVMe SSDs into centralized storage arrays, connecting them to remote compute servers via RoCEv2 or InfiniBand.
NVMe-oF maps the remote storage drive's PCIe commands directly over the RDMA network layer. A remote computing node can read or write data to a centralized storage array with sub microsecond latencies, practically matching the raw speed and performance of an SSD plugged directly into its local physical motherboard.
High Frequency Trading (HFT)
In the world of electronic financial markets, a latency difference of just a few nanoseconds can mean the difference between making millions or losing a trade. Financial institutions and stock exchanges use RDMA fabrics to handle market data feeds, risk management calculations, and order execution matching engines.
By eliminating the operating system kernel and protocol encapsulation delays, RDMA enables deterministic, ultra low latency messaging. Trading firms use it to process market fluctuations and execute buy/sell orders at hardware native wire speeds, ensuring their algorithms react to market events faster than any competitor relying on traditional network infrastructure.
Technical Deployment Requirements: Building a RoCEv2 Fabric
Deploying an RDMA fabric requires a tightly coordinated ecosystem of specialized hardware, specific network configurations, and an RDMA aware software stack. While RDMA can be deployed using three distinct protocols InfiniBand, RoCEv2, or iWARP, this section focuses specifically on the infrastructure required to deploy a modern RoCEv2 ecosystem over Ethernet.
1. Hardware Requirements
Host Network Adapters (RNICs)
Standard network interface cards (NICs) cannot process hardware offloaded RDMA instructions. Servers must be equipped with specialized RDMA enabled Network Interface Cards (RNIC) featuring onboard processors to handle transport protocol execution.
Lossless Network Infrastructure
Because RoCEv2 lacks standard TCP software level retransmission safety nets, it requires a strictly lossless Ethernet network to prevent packet drops. The underlying Ethernet switches must support specifically:
- Priority Flow Control (PFC): To pause specific classes of traffic before buffers overflow.
- Explicit Congestion Notification (ECN): To signal and throttle traffic endpoints before congestion triggers packet drops.
Server Architecture (PCIe & NUMA alignment)
To achieve true microsecond level latency, physical hardware placement inside the server chassis matters. The RNIC must be installed in a PCIe slot directly routed to the same CPU socket (NUMA node) as the workload (or the GPUs in AI clusters). This prevents data from crossing the internal CPU interconnect bus, which introduces unwanted latency spikes.
2. Software Requirements
Operating System & Kernel Support
The host operating system must feature builtin drivers and kernel subsystems capable of managing the hardware bypass. For example, Linux operating sysgtem requires the installation of the userspace rdma-core subsystem and corresponding kernel modules. This includes vendor specific hardware drivers
The RDMA Programming Interface (API)
Applications cannot communicate using standard network sockets (IP:Port). Instead, they must interface with a specialized middleware layer:
- InfiniBand Verbs (libibverbs): The low level API used to program RDMA operations. It handles memory registration, creates RDMA objects Queue Pairs (QPs) (Send/Receive queues), and manages Completion Queues (CQs).
- OFED (OpenFabrics Enterprise Distribution): A unified software stack that packages RDMA drivers, core libraries, and diagnostic command line utilities.
Application Layer (RDMA Aware Software)
Finally, upper level software running actual enterprise workloads must be explicitly compiled or configured to utilize RDMA verbs rather than standard TCP sockets:
- AI/Deep Learning: Frameworks like PyTorch or TensorFlow rely on NCCL (NVIDIA Collective Communications Library) or MPI (Message Passing Interface) explicitly toggled to communicate over RDMA to sync gradients.
- Enterprise Storage: Storage fabrics like NVMe-oF (NVMe over Fabrics) or distributed object stores like Cloudian Hyperstore supports underlying transport layer switched from TCP to RDMA for lightening speed data transfers.
Congestion Management: How RDMA Maintains a Lossless Fabric
RDMA over Converged Ethernet (RoCEv2), solves congestion by using hardware level flow control and end to end intelligent signaling.
1. Priority Flow Control (PFC)
PFC operates at Layer 2 (Data Link Layer) inside network switches to provide lossless ethernet in a RoCEv2 environment.
When a receiving server or an intermediate switch becomes overwhelmed and its memory buffer queues begin filling to capacity, it refuses to drop incoming packets. Instead, it sends a specialized PAUSE frame backward to the upstream switch or sender. The upstream device immediately halts transmission on that specific traffic priority lane until the receiver's buffer clears, then resumes.
2. DCQCN (Data Center Quantized Congestion Notification)
Because PFC is highly disruptive, modern RoCEv2 deployments implement DCQCN at Layers 3 and 4. DCQCN acts as an early warning system designed to throttle the sending server's speed before network switches are forced to trigger a panic induced PFC PAUSE frame.
DCQCN is a hybrid algorithm that coordinates three key components:
- The Switch (ECN Marking): As packets transit through network switches, the hardware monitors queue depths. If a queue begins building up, the switch flips a specific bit in the IP header marked Congestion Experienced (CE) but allows the packet to continue forward.
- The Destination (CNP Generation): When the target server receives a packet with the "CE" bit enabled, it recognizes that the path is congested. The destination's RNIC immediately generates a highly urgent control packet called a Congestion Notification Packet (CNP) and sends it straight back to the source.
- The Source (Rate Limitation): The moment the original sender's hardware receives the CNP, it mathematically scales down its packet injection rate. If the congestion clears and the sender stops receiving CNPs, an internal hardware timer gradually ramps the transmission speed back up to line rate.
The Hyperscale Wall: Why AWS and Google Built Alternatives to RDMA
While traditional RDMA is exceptionally performant in dedicated environments, its deployment and management creates severe challenges at multi-tenant cloud scale. When Amazon Web Services (AWS) and Google attempted to deploy standard RDMA (RoCEv2 and InfiniBand) across hundreds of thousands of servers, they ran into fundamental architectural roadblocks. To bypass these limitations, AWS developed Scalable Reliable Datagram (SRD) and Google created Falcon, shifting away from standard RDMA due to three critical drawbacks:
1. The Fragility of Lossless Ethernet (PFC Deadlocks)
Traditional RoCEv2 requires a perfectly lossless network enforced by Layer 2 Priority Flow Control (PFC). When congestion occurs, switches propagate PAUSE frames upstream to halt transmission, which at hyperscale creates a dangerous domino effect. A single sluggish server node can trigger a chain reaction of PAUSE frames that ripples backward through the fabric, causing catastrophic network deadlocks and PFC storms. To eliminate this operational fragility, AWS and Google abandoned lossless requirements altogether, engineering hardware driven protocols (SRD and Falcon) designed to run over standard, lossy Ethernet by handling packet drops via ultra fast retransmissions at the endpoints.
2. Inability to Multipath (ECMP Hash Collisions)
RDMA requires strict, in order packet delivery at the hardware layer, forcing an entire data stream to lock onto a single network path via Equal Cost Multi Pathing (ECMP) routing hashes. If two massive AI workloads randomly get assigned to the same physical link, that link becomes a severe bottleneck while adjacent paths sit completely empty. Standard RDMA cannot divert traffic to those empty paths because out of order delivery is treated as a fatal error. Cloud Service Providers (CSP) solve this by being natively tolerant of out of order delivery; AWS's SRD, for instance, dynamically distributes packets across all available network paths simultaneously, relying on specialized hardware at the destination to reorder them on arrival and eliminate hotspots.
3. Scale Out Connection Limits (Queue Pair Exhaustion)
Standard RDMA relies on a stateful Reliable Connection (RC) model where every server must maintain a dedicated communication channel a Queue Pair (QP) for every single node it communicates with. Network interface cards (NICs) have limited high speed, on chip cache memory (SRAM) to store these connection states. As an AI cluster scales up to tens of thousands of GPUs, the number of connection pairs explodes exponentially, causing cache exhaustion that forces the NIC to constantly fetch state data from the main server RAM. This cache misses destroy RDMA's ultra low latency benefits, prompting hyperscalers to shift to a decoupled, connectionless datagram model where a single endpoint can dynamically send data to any node without maintaining persistent connection states.
Despite the rise of cloud specific alternatives like SRD and Falcon, standard RDMA specifically InfiniBand and high performance RoCEv2 remains the undisputed gold standard for dedicated AI data centers and High Performance Computing (HPC). When training massive, multi billion parameter AI models, an optimal architecture needs absolute raw performance. Standard RDMA delivers unparalleled, hardware native ultra low latency and maximum bandwidth with zero CPU overhead, ensuring that thousands of tightly coupled GPUs or compute nodes remain fully utilized rather than waiting on data. For enterprises and research institutions building dedicated fabrics where they can tightly control the network topology, the massive performance edge of RDMA heavily outweighs its configuration complexities, cementing its role as the foundational fabric powering modern supercomputing and frontier AI development.
References
- Everything You Wanted to Know About RDMA But Were Too Proud to Ask
- U-Net: A User-Level Network Interface for Parallel and Distributed Computing
- An implementation of the Hamlyn sender-managed interface architecture (
- A Short History of Remote DMA (From U-Net to InfiniBand)
- GPUDirect Storage Overview & S3-Compatible AI Storage Fundamentals
- In the search for performance, there’s more than one way to build a network
- Introducing Falcon: A reliable, low-latency hardware transport for Google Cloud
- Decoupling QPs and Connections for Scalable RDMA
- Software-Defined RDMA Networks for Large-Scale AI Infrastructure
- Fully Lossless Ethernet Network for HPC
- The Definitive Guide to RDMA, RoCE, and iWARP
- InfiniBand vs. Ethernet: What Are They?
- What is RDMA Technology and How It Works


No comments:
Post a Comment