Covering Scientific & Technical AI | Sunday, November 24, 2024

Global Chip Shortage Is Negatively Impacting AI Workloads in the Cloud 

If you’re finding it harder to get access to GPUs in the cloud to train your AI models lately, you’re not alone.

The combination of a global chip shortage and increased demand for AI model training is apparently leading to longer wait times for some cloud GPU users. Nvidia, which makes most of the GPUs used for AI, says its overall operations are supply-constrained at the moment, particularly in the gaming market, but that it’s not enough to impact its growth.

Gigaom AI analyst Anand Joshi says the problem is definitely a concern.

“A lot of GPU users are complaining that it’s hard for them to get the GPU time,” Joshi said. “They put a job in a queue and it takes a while for it to ramp. Previously they would just say there are [X number of] GPUs and they were just sitting there. Now they don’t always have GPUs available, so it takes a while for them to get in the queue and get their jobs running.”

While Joshi doesn’t have any firsthand knowledge of cloud platform’s GPU expansion plans, he said the wait times customers are experiencing are an indication that the cloud platforms have not been able to obtain new GPUs at the pace they had expected or wanted. That, he says, may be impacting their ability to expand GPU cloud environments to keep up with increasing demand for model training, which is the most computational demanding component of the AI lifecycle.

Image credit: Shutterstock

“The users are saying they’re not available, and the reason they’re not available is the capacity has not been increased as much as cloud guys would like,” Joshi speculated. “So that could mean that they’re not as able to get as many GPUs as they want.”

Nvidia, which makes the GPUs used in many AI applications, enjoyed record revenues of $5 billion for the fourth quarter of 2021, which ended January 31, 2020. That was a whopping 61% increase from the same quarter a year ago. With continued strong demand from its gaming and data center businesses, Nvidia expects revenue to grow to $5.3 billion for the first quarter of 2022, which ends April 30.

Even with those gaudy growth figures, it appears that the chipmaker, which has a market capitalization of $317 billion, left a bit of money on the table last quarter because it was unable to keep up with surging demand for GPUs.

“At the company level, we’re supply constrained. Our demand is greater than our supply,” Nvidia CEO Jensen Huang said during an earnings call last month, according to a transcript provided by Nvidia. “However, for data center, so long as the customers work closely with us and we do a good job planning between our companies, there shouldn’t be a supply issue for data centers…We shouldn’t be supply constrained there.”

Huang continued: “But at the company level we’re supply constrained. Demand is greater than supply. We have enough supply and we usually have enough supply to achieve better than the outlook and we had that situation in Q4. We expect that situation in Q1 and we have enough supply to grow through the year. But supply is constrained and demand is really, really great.”

You can blame COVID-19 for most (if not all) of this. In the early days of the pandemic, economists predicted business around the globe to slow down as governments instituted lockdowns to prevent the spread of the novel coronavirus. And that did happen to some extent, such as automotive manufacturing, which slowed to a crawl in mid 2020 before picking up recently.

However, that lull didn’t last for long. A side effect of the lockdowns is that they forced many aspects of human life to shift to the digital realm. Stuck at home, adults and children suddenly needed new laptop computers, tablets, and video gaming consoles to work, learn, and play. The surge in demand for the microchips that go into computers, cars, gaming consoles, and smart phones has overwhelmed supply, which led to shortages in the devices themselves.

Nvidia’s gaming business has been booming during the COVID-19 pandemic. Just as we’ve seen with new iterations of the Microsoft Xbox and Sony PlayStation, new Nvidia graphics cards are being bought up immediately and put back on the secondary market, often for hundreds of dollars more than Nvidia’s MSRP.

At the same time, AI deployments surged under COVID-19 as companies sought to increase their competitiveness and deal with the sudden shift from physical to digital, such as by using conversational agents to interact with customers or using machine learning to boost the accuracy of supply chain planning in the consumer goods supply chain.

According to a recent survey by KPMG, the percentage of retailers running a moderately to fully functional AI deployment increased by 29 percentage points from late 2019 to early 2021, accounting for 81% of companies surveyed. In financial services, it went up 37 percentage points to 84% from 2019 to 2021.

Image credit: Shutterstock

When there’s more demand for AI model training, but not an increase in supply for GPUs to run the models on, you get a situation where some users are experiencing a delay in getting access to the GPUs.

“Frankly, nobody saw this coming,” Gigaom’s Joshi tells Datanami. “Nobody saw the demand for chips would drastically increase during COVID times. Everybody though that it would go down, but it went up. And everybody just suddenly started to scramble for chips.”

There are just a handful of companies in the chip fabrication business anymore, and the surge in demand for chips of all types (not just GPUs) means that the chip fabs are running at full production capacity. Nvidia likely had some priority with its fab partners as a top-line chipmaker, but due to increased demands for the other types of chips that these fabs make, the fabs (as well as other parts of the supply chain) simply don’t have the capacity to handle orders above and beyond what the companies had agreed to.

The situation has also impacted AI chip startups, who have not been able to get the chip fabs to manufacture their chip designs. Joshi says there are around 100 AI chip startups with various designs, such as open source RISC-V designs. But with limited capacity in the chip fabs and the entire chip supply chain, some of their new products won’t be coming to market any time soon.

“Some of the smaller guys are just being told to wait because they just don’t have the bandwidth,” Joshi said.  “So some of these guys, for new chipsets in particular, we might see some delay.”

So far, there hasn’t been much impact on pricing of GPUs, which are the favored type of chip to run AI workloads on, but that could change. If demand for actual GPUs, as well as GPUs running in the cloud, continues to exceed supply, don’t be surprised if there are price increases.

“We’ll have to watch that carefully, to see if the supply returns at some point, which I’m sure it will,” Joshi said. “But if it doesn’t or takes a long time, then we might see some change in pricing, or some premium tier which will guarantee that GPUs are always available for your job and things like that.”

Google Cloud declined to comment for this article. Amazon Web Services did not respond to a request for comment before this article was published.

This article first appeared on sister website Datanami.

About the author: Alex Woodie

Alex Woodie has written about IT as a technology journalist for more than a decade. He brings extensive experience from the IBM midrange marketplace, including topics such as servers, ERP applications, programming, databases, security, high availability, storage, business intelligence, cloud, and mobile enablement. He resides in the San Diego area.

AIwire