Covering Scientific & Technical AI | Wednesday, January 29, 2025

Nvidia Shows Renewed Interest in Top500 List, but Not for GPUs 

Nvidia is showing keen interest in the Top500 list and says it will put more systems with its GPUs and technologies on the list. However, Nvidia’s interest is to give credibility to its networking technologies, not to prove the performance of its GPUs.

Two supercomputing systems with Nvidia’s Spectrum-X networking are debuting in the Top 50 on the latest Top500 list.

The systems include Nvidia’s own Israel-1 system, which benchmarks at 42 petaflops, and another 38-petaflop system built by GMO Internet Group.

Dell built both systems with the Bluefield-3 data processing unit and the Spectrum X800 switch. The 42-petaflop Israel-1 has 936 H100 GPUs, while the 38-petaflop GMO Internet Group system has 768 H200 GPUs.

“This will be the first of many to come,” said Dion Harris, director of accelerated computing.

With its latest Hopper and Blackwell GPUs, Nvidia primarily backs InfiniBand as a high-bandwidth, low-jitter networking interface for AI and HPC systems. Spectrum-X allows wider scale-out of systems beyond the NVLink interface, Harris said.

Harris said that the X.AI system with 100,000 Nvidia H100 GPUs, Colossus was based on Spectrum-X.

The Spectrum-X Ethernet interconnect achieves “an impressive 95% of theoretical data throughput compared to just 60% for traditional Ethernet,” Harris said.

The system also maintains zero latency degradation and no packet loss due to flow collisions across three tiers of the network fabric.

“Thus far, X has been thrilled with system performance,” Harris said.

What This Means for the Top500

Networking is table stakes in driving AI performance, and Nvidia is out to prove its Ethernet technologies work in large-scale installations.

Three of the top five systems, including Frontier, Aurora, and Lumi, have HPE’s Slingshot interconnect. Nvidia’s InfiniBand interconnect is in the third-placed Eagle, and supercomputers ranked seventh to tenth.

Last year, the Top500 organizers expressed concerns over the progressive decline in new systems that have been making the list since 2017. The average performance of systems has also been declining over the last couple of years.

The slowdown also relates to the inability to grow system sizes and architectural limitations. In this case, Nvidia didn’t aim for the performance prize but submitted systems outside the top 10 to prove its networking technology.

Top500 also needs new submissions as the systems on the list are aging. The average age for Top500 systems doubled to 30 months in 2023 from about 15 months in 2018-2019.

AI is making its way into scientific computing workloads, and many sessions at Supercomputing 2024 will discuss mixed-precision benchmarking.

Blackwell

Blackwell is progressing smoothly after some design hiccups earlier this year. Nvidia said many partners will announce Blackwell servers at SC2024.

Blackwell is the successor to the successful Hopper GPU, which is in high demand. Cloud providers are building new AI data centers with Nvidia’s Hopper GPUs, and while Blackwell is significantly faster, it generates more heat.

“The rollout of Blackwell is progressing smoothly thanks to the reference architecture enabling partners to quickly bring products to market while adding their own customizations,” Harris said.

This year, the company announced the GB200 NVL4 server, a flexible server that can host up to four Nvidia Blackwell GPUs connected to two Grace CPUs.

It is designed for AI and HPC workloads. The server is a successor to the quad-GPU GH200 NVL4 server, which was announced a year ago at Supercomputing 2023.

Nvidia calls the new CPU-GPU combination the “Grace Blackwell” superchip, in which the CPUs and GPUs are interconnected via Nvidia’s NVLink interconnect.

Harris said that partners can deliver single-server Blackwell solutions optimized for HPC and AI hybrid workloads with 1.3 terabytes of coherent memory shared across four GPU NVLink domains.

The GB200 NVL4 system is 2.2 times faster in simulation and 1.8 times faster in inference. The simulation was benchmarked using MILC, which is a scientific computing benchmark. The 1.8x inference speed was measured on Llama2-7B on the FP16 data type.

The server, with only four GPUs, is best designed for inference. The system is configurable with a base TDP of about 5.4 kilowatts.

This month, Nvidia also released MLPerf benchmarks for Blackwell, which showed significant generation-to-generation improvement compared to Hopper.

Nvidia submitted benchmarks from its internal Blackwell supercomputer called Nyx, which is built on DGX B200 systems.

Blackwell achieved 2.2 times faster LLM fine-tuning performance and two times faster LLM pretraining performance per GPU compared to the H100 Tensor Core GPU, according to MLPerf. The benchmarks were done against Llama 2 70B fine-tuning and GPT-3 175B pretraining models.

Other Nvidia News

Nvidia made some software and microservices announcements for scientists to implement AI in research.

Nvidia announced new NIMs (Nvidia Inference Microservices), which are ready-made virtualized containers that scientists can deploy on GPUs to run AI inference services.

New containers from its BioNeMo large-language model allow researchers to use AI for drug discovery and biological research. The NIM supports the AlphaFold2 AI model, among other biological models and datasets. Google DeepMind’s Demis Hassabis and John Jumper won the Nobel Prize in Chemistry this year for creating AlphaFold2.

Nvidia also announced a new container called ALCHEMI, which allows scientists to run material discovery workloads. The ALCHEMI container requires Nvidia GPUs.

The company also announced an Earth-2 NIM for CorrDIFF, an AI model for weather forecasting.

“We’ve also worked with US weather forecasting agencies to develop a CorrDIFF model for the entire continental U.S. That’s an area about 300 times larger than the original Taiwan-based model,” Harris said.

Nvidia also announced cuPyNumeric, a drop-in replacement for NumPy. The cuPyNumeric will automatically distribute Python workloads across CPUs and GPUs, which will deliver faster performance.

“It can automatically scale and detect across CPU, GPU, multi-GPU, and multi-node. That’s essentially how it’s designed to make that scaling process very seamless,” Harris said.

Harris said cuPyNumeric works across multiple Nvidia GPU generations.

“I think that’ll be really great for a lot of our supercomputing systems that are now being deployed leveraging our Grace Hopper systems,” Harris said.

Nvidia also announced an Omniverse reference design for computer-aided design that can help engineers simulate, test, and design products more quickly. The tool is available through all major cloud providers.


This article first appeared on sister site HPCwire.

AIwire