Covering Scientific & Technical AI | Sunday, November 10, 2024

Nvidia’s Newly DPU-Enabled SuperPod Is a Multi-Tenant, Cloud-Native Supercomputer 

At GTC 2021, Nvidia has announced an upgraded iteration of its DGX SuperPods, calling the new offering “the first cloud-native, multi-tenant supercomputer.” The newly announced SuperPods come just two years after the first SuperPods, networked clusters of 96 DGX-2H systems that debuted as one of the world’s most powerful supercomputers in 2019. In 2020, Nvidia expanded the availability of its SuperPod systems, allowing enterprises to purchase modules through partners.

As with the previous generation, these DGX SuperPods contain 20-plus Nvidia DGX A100 systems networked with Nvidia’s in-house InfiniBand HDR networking. On the storage front, Nvidia is working with DDN as its first storage partner for these SuperPods.

Nvidia's BlueField-2 PCIe card

At the heart of these SuperPods are Nvidia’s BlueField-2 data processing units (DPUs), two of which are included in the PCIe slots of each constituent A100 DGX in the SuperPod. These BlueField DPUs allow the SuperPods to isolate users’ data, enabling the system’s robust multi-tenant functionality (see our coverage, "Nvidia Debuts BlueField-3 – Its Next DPU with Big Plans for an Expanded Role"). Nvidia says that this is in response to growing needs to incorporate multiple teams at different locations, no doubt accelerated by a work-from-home regime but perennially applicable to academics and researchers, who often need to share their computational resources with outside organizations. “More and more, we’re seeing customers that want the security isolation between their users, even if they’re all in the same company,” said Charlie Boyle, vice president and general manager of DGX systems at Nvidia.

Nvidia is further enabling this functionality with its Base Command software, which permits an organization to grant access to multiple users and IT teams. In fact, Boyle shared that Base Command – which has been under development for four years – is the same software that Nvidia has been using internally to manage its “thousands” of DGX systems (well over two thousand, according to Boyle). Base Command also includes built-in telemetry for validating deep learning models. “Now, for the first time, the entire management system that we use to manage our own fleet of DGX systems internally will be available to our SuperPod customers,” Boyle said. “We’ll be giving them the best of both worlds.”

The SuperPods start at $7 million and scale to $60 million for a full system.

Nvidia is making use of the occasion to tout the success of its existing SuperPods via high-profile clients, including Sony, NAVER, MTS and the University of Florida. And with drug discovery remaining a hot topic in the midst of the COVID-19 pandemic, Nvidia is highlighting SuperPod use by drug discovery company, Recursion, and announcing a partnership with Schrödinger, a pharmaceutical simulation software developer, to accelerate drug simulations. Recursion, Nvidia reported, was able to build an AI supercomputer for pharma applications (named Biohive-1) in 24 days, delivering enough compute to place it high in the Top500.

"I am pleased to see so much AI research advancing because of DGX," said Nvidia CEO Jensen Huang. "Top universities, research hospitals, telcos, banks, consumer products companies, carmakers and aerospace companies -- DGX helped their AI researchers, whose expertise is rare, scarce, and their work strategic. It is imperative to make sure they have the right instrument."

So far, though, Nvidia is keeping mum on clients for the new DPU-enabled SuperPods, saying that it will have more information to share after the systems are made available. On that front, the DPU SuperPods (and the accompanying Base Command software) are slated for availability sometime between May and July. With cloud-native supercomputing on the rise, one might expect that similar offerings are likely to emerge from other providers in the coming months.

Beyond the SuperPods, Nvidia is introducing another novel way to access its DGX technologies: a subscription service. The company will be offering its workstation-formatted DGX Station A100 320G systems, announced last November, to enterprises starting at $9,000 per month, once again targeting distributed workforces and home offices.

Nvidia also announced an upgraded version of Megatron, a tool for training giant transformer models in a highly parallel fashion. With an eye to the future, Megatron is now capable of training models with hundreds of billions or even trillions of parameters. "We expect to see multi-trillion-parameter models by next year," said Huang, "and hundred-trillion parameter models by 2023." Further, Huang announced Megatron Triton, a DGX inference server that enables faster response times.

AIwire