NVIDIA’s Spectrum-X Enhances AI Storage Bandwidth by Up to 48%

February 6, 2025

Feb. 6, 2025 -- AI factories rely on more than just compute fabrics. While the East-West network connecting the GPUs is critical to AI application performance, the storage fabric—connecting high-speed storage arrays—is equally important. Storage performance plays a key role across several stages of the AI lifecycle, including training checkpointing, inference techniques such as retrieval-augmented generation (RAG), and more.

To address these demands, NVIDIA and the storage ecosystem are extending the NVIDIA Spectrum-X networking platform to the data storage fabric, bringing higher performance and faster time to AI. Because Spectrum-X adaptive routing is able to mitigate flow collisions and increase effective bandwidth, the storage performance is much higher than RoCE v2, the Ethernet networking protocol used by a majority of data centers for AI compute and storage fabrics.

Credit: NVIDIA

Spectrum-X accelerated read bandwidth by up to 48% and write bandwidth by up to 41%. This increased bandwidth translates to faster completion of the storage-dependent steps of AI workflows, leading to faster job completion times (in the case of training) and lower inter-token latency (in the case of inference).

Key Storage Partners Integrate Spectrum-X

As AI workloads grow in scale and complexity, storage solutions must evolve to keep pace with the demands of modern AI factories. Leading storage vendors, including DDN, VAST Data, and WEKA, are partnering with NVIDIA to integrate and optimize their solutions for Spectrum-X, bringing cutting-edge capabilities to AI storage fabrics.

Spectrum-X Impact at Scale with Israel-1 Supercomputer

NVIDIA has built Israel-1, a generative AI supercomputer, to optimize Spectrum-X performance, simplifying network deployments by enabling a pretested and validated blueprint for AI fabrics. This made Israel-1 a good test bed for how Spectrum-X affects storage workloads, showcasing the impact of the network on storage performance in the context of real-world supercomputer operating conditions.

To see the impact of Spectrum-X on the storage network, the Israel-1 team measured the read and write bandwidth generated by the NVIDIA HGX H100 GPU server clients accessing the storage. The test (using the Flexible I/O Tester benchmark) was performed once with the network configured as a standard RoCE v2 fabric, and then was re-run with the adaptive routing and congestion control from Spectrum-X turned on.

These tests were run using different numbers of GPU servers as clients, ranging from 40 GPUs up to 800 GPUs. In every case, Spectrum-X performed better. For read bandwidth, improvements ranged from 20% to 48%. For write bandwidth, improvements ranged from 9% to 41%. These results are comparable to speedups achieved by the partner ecosystem for DDN, VAST, and WEKA.

Storage Network Performance Is Critical to AI Performance

To see why Spectrum-X makes such a difference, it helps to consider why storage is a factor for AI. AI performance is not simply a function of large language model (LLM) step completion time, as many other factors are involved. For instance, because model training often takes days, weeks, or months to complete, it makes sense to checkpoint, or save, the partially trained models to storage mid-training, typically every few hours. This means that, in the event of a system outage, the training progress is not lost.

With billion and trillion-parameter models, these checkpoint states become large enough—up to several terabytes of data for today’s largest LLMs—that saving or restoring them generates “elephant flows.” These are large bursts of data that can overwhelm switch buffers and links, and the network has to guarantee that optimal utilization is provided to the training workload.

RAG is another instance where the storage fabric can make or break the performance of the workload. With RAG, an LLM is combined with a constantly-growing knowledge base that adds domain-specific context to the model to provide better responses without requiring additional model training or fine-tuning. RAG works by taking the additional content, or knowledge, and embedding it in a vector database, which makes it a searchable knowledge base.

When an inference prompt comes in, the prompt is parsed (embedded) and the database is searched, with the retrieved content adding context to the prompt to help the LLM formulate the best possible answer. Vector databases are many-dimensional and can be quite large, especially in the case of knowledge bases consisting of images and videos.

These databases are connected to the inference nodes through the storage fabric, and the network has to provide quick communication to keep latencies minimal. This becomes especially important in the case of multitenant generative AI factories, where the number of queries per second is massive.

Applying Adaptive Routing and Congestion Control to Storage

The Spectrum-X platform introduced key innovations adapted from InfiniBand such as RoCE Adaptive Routing and RoCE Congestion Control. By taking these innovations and using them with the storage fabric, NVIDIA is able to improve performance and network utilization for storage workloads.

Adaptive Routing

To eliminate elephant flow collisions and mitigate network traffic created during checkpointing, adaptive routing is employed to dynamically load balance flows packet-by-packet on the network. Spectrum-4 Ethernet switches select the least congested path based on real-time congestion data. Because the packets are sprayed across the network, they may arrive at the destination out-of-order, which under Legacy Ethernet would require that many packets be retransmitted.

With Spectrum-X, the SuperNIC or data processing unit (DPU) in the destination host knows the correct order of the packets, placing them in order in the host memory and keeping the adaptive routing transparent to the application. This enables higher fabric utilization for higher effective bandwidth and predictable, consistent outcomes for checkpoint, data fetching, and more.

Congestion Control

Checkpoints and other storage operations often result in incast congestion, also known as many-to-one congestion. This can occur when multiple clients attempt to write to a single storage node. Spectrum-X introduced a telemetry-based congestion control technique that uses hardware-based telemetry from the switch to inform the SuperNIC or DPU to slow the sender data injection rate (that is, RDMA writes and reads). This prevents congestion hot spots from occurring, which can propagate backwards and lead to neighboring jobs or processes being unfairly impacted by the congestion.

Resiliency Enhancements

Because AI factories often consist of very large numbers of switches, cables, and transceivers, and any downed link can cause an outsized drop in network performance, network resiliency is crucial to maintaining healthy infrastructure. Spectrum-X global adaptive routing enables optimal and quick reconvergence around link outages, keeping the storage fabric well utilized.

Integration with the NVIDIA Stack

In addition to the innovations brought to the storage fabric from Spectrum-X, NVIDIA offers and recommends the use of several SDKs, libraries, and software offerings to accelerate the storage to GPU data path. These include but are not limited to the following:

NVIDIA Air: A cloud-based network simulation tool for modeling switches, SuperNICs, and storage, accelerating Day 0, 1, and 2 storage fabric operations.
NVIDIA Cumulus Linux: A network operating system built around automation and APIs, ensuring smooth operations and management at scale.
NVIDIA DOCA: The SDK for NVIDIA SuperNICs and DPUs, unlocking unmatched programmability and performance for storage, security, and more.
NVIDIA NetQ: A network validation toolset that integrates with switch telemetry to provide real-time visibility of the fabric.
NVIDIA GPUDirect Storage: A technology that enables a direct data path between storage and GPU memory, making data transfer more efficient.

Get Started with Spectrum-X

As models get bigger and data becomes more multimodal, storage will continue to be a crucial element of the training and operationalization of generative AI. For more information, read the NVIDIA white paper, Optimizing AI Storage Fabrics: NVIDIA Spectrum-X Accelerates AI Storage Networks.

Check out the Storage Innovations for AI Workloads session at NVIDIA GTC 2025 for even more news in this exciting space.

Source: Taylor Allison, NVIDIA

Categories: Happening Now

Tags: NVIDIA,Spectrum-X

NVIDIA’s Spectrum-X Enhances AI Storage Bandwidth by Up to 48%

Related

Happening Now

Recent News

Contributors

NVIDIA’s Spectrum-X Enhances AI Storage Bandwidth by Up to 48%

Related

Happening Now

Recent News

Contributors

Share

Copy short link