Microsoft’s Rent-a-server Program Now Doing Generative AI
There's a new kind of shortage of chips: there isn't enough compute capacity availability to handle the computing demand of ChatGPT, which has overloaded OpenAI's servers.
Microsoft hopes to fill some of that gap with a new virtual machine offering called the ND H100 v5, which includes a sea of Nvidia's latest H100 GPUs, codenamed Hopper, for generative AI applications.
The idea is to provide more speed for companies looking at generative AI, which are able to dig deeper into data to establish relationships, reason and predict answers. Generative AI is in its early days, but the results from applications like ChatGPT have demonstrated the potential of the technology.
But the technology also needs massive computing power, which Microsoft is bringing to its Azure cloud service.
The VM offering can adjust to the size of generative AI applications, and scale up to thousands of H100 GPUs. The GPUs are interconnected by the chip maker's Quantum-2 InfiniBand technology.
The pricing for the H100-equipped virtual machines on Azure wasn’t immediately available. The prices of virtual machines vary based on configuration, but a fully-loaded A100 virtual machine with 96 CPU cores, 900GB of storage and eight A100 GPUs costs close to $20,000 per month.
The Nvidia GPUs faced a litmus test when ChatGPT opened its doors last year. The computing was handled by the OpenAI supercomputer, built with Nvidia A100 GPUs.
But the server was overwhelmed by the growing demand for ChatGPT, and users complained of the server being unavailable to handle queries.
The H100 could close the speed gap required by generative AI, which is already being used in healthcare, robotics and other sectors. Companies are also looking to fill the last-mile gap, and deploy an interface that makes the AI presentable and usable, much like ChatGPT.
Nvidia and Microsoft are already building an AI supercomputer with H100s. The GPU was designed so it works best with applications coded in CUDA, which is Nvidia’s parallel programming framework. The chip maker’s offerings include the Triton Inference server, which will help in the deployment of AI models GPT-3 in its GPU environments.
Microsoft has implemented a customized version of GPT-3, which is the large-language model behind ChatGPT, in its Bing search engine. Microsoft is taking a DevOps style iterative approach with Bing AI in which the application is quickly updated as it learns more about the usage model. Some disturbing responses from the AI has led Microsoft to limit the number of queries posed by users to the AI.
Google is planning its own deployment of a large-language model called Bard, which will be handled by internally developed chips called TPUs (Tensor Processing Units). Google Cloud also hosts Nvidia GPUs like the A100, which are already available to researchers via the Colab Jupyter notebooks.
The base configuration of the new Azure VM can interconnect eight H100 Tensor Core GPUs via Nvidia's proprietary NVLink 4.0 interconnect. The configuration can scale to more GPUs via the Quantum-2 interconnect. The servers have Intel's 4th Gen Xeon Scalable processors, which are known as Sapphire Rapids, which are connected to the GPUs via PCIe Gen5 interconnect.
Nvidia has already announced that the H100 GPUs would be available from cloud providers Google and AWS, but the instances aren't available yet. Oracle is offering a full stack of Nvidia hardware and software in its cloud.
Nvidia CEO Jensen Huang announced the Nvidia DGX Cloud during an earnings call last month, which he classified as an "AI supercomputer." The Oracle cloud service offering is an example of DGX Cloud, where Nvidia's full-stack AI offering consists of hardware, software and networking products.