Covering Scientific & Technical AI | Friday, November 1, 2024

Tesla Expands Its GPU-Powered AI Supercomputer – Is Dojo Next? 

Tesla has revealed that its biggest in-house AI supercomputer – which we wrote about last year – now has a total of 7,360 A100 GPUs, a nearly 28 percent uplift from its previous total of 5,760 GPUs. That’s enough GPU oomph for a top seven spot on the Top500, although the tech company best known for its electric vehicles has not publicly benchmarked the system. If it had, it would have gone up against similarly-equipped GPU-based systems, such as NERSC’s Perlmutter (6,144 Nvidia A100 GPUs, 70.87 Linpack petaflops), and Nvidia’s own in-house A100 system, Selene (4,480 A100 GPUs, 63.46 Linpack petaflops).

Using Selene’s Top500 submission as a proxy, we estimate that Tesla’s 7,360-GPU cluster would be capable of about 100 double-precision Linpack petaflops, though we expect that Tesla is running mainly single- and lower-precision workloads (FP32, FP16, bfloat16, etc.).

An even larger AI supercomputer – from Meta/Facebook – was detailed earlier this year. The AI Research SuperCluster (RSC) will employ 16,000 A100 GPUs, delivering more than 200 double-precision petaflops, once completed this summer.

The Tesla GPU system reveal came last June from Andrej Karpathy, the senior director of AI at Tesla, at the 4th International Joint Conference on Computer Vision and Pattern Recognition (CCVPR 2021). “I wanted to briefly give a plug to this insane supercomputer that we are building and using now,” Karpathy said. At the time, the system spanned 720 nodes, each powered by eight Nvidia A100 GPUs (the 80GB model), for a total of 5,760 A100s. At eight GPUs per node, the infusion of another 1,600 GPUs adds 200 nodes to the installation for a total 920 nodes.

News of the upgrade came via a tweet from Tim Zaman, an engineering manager at Tesla – part of a promotion for the upcoming MLSysConf. Tesla is sponsoring the conference, which runs from August 29, 2022, through September 1, 2022. The company is also holding its second AI Day event on September 30, 2022.

Tesla’s GPU clusters are prologue to the company’s upcoming, homegrown Dojo supercomputer, which has been in development since August 2020, when Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.”

Tesla’s D1 chip. Image courtesy of Tesla.

The design of Dojo was revealed at Tesla’s inaugural AI Day event last August, when details of the system and its constituent D1 chip surfaced. Tesla may soon be ready to spill some additional Dojo tea next week at Hot Chips. The (all-virtual) event kicks off Sunday (August 16) and goes through Tuesday, August 23, 2022. Tesla has three slots on the program, all on Tuesday. In the morning, Tesla hardware engineer Emil Talpes is scheduled to give a presentation, titled “Dojo: The Microarchitecture of Tesla’s Exa-Scale Computer,” followed by Tesla’s Principal System Engineer for Dojo Bill Chang, with his talk, “Dojo – Super-Compute System Scaling for ML Training.”

Later in the same day, Ganesh Venkataramanan, senior director of autopilot hardware at Tesla, will deliver a keynote talk, “Beyond Compute – Enabling AI through System Integration.” That is the second of two keynotes being featured at Hot Chips 2022; the other one (“Semiconductors Run the World”) will be given by Intel CEO Pat Gelsinger on Monday, August 22.

Several technologies are competing to power the fastest AI supercomputers in the world. In addition to market leader Nvidia’s GPUs, GPUs from AMD now power the world’s fastest (publicly-ranked) supercomputer, Frontier. And Intel is working to release its Ponte Vecchio GPU, the primary engine for the future Aurora supercomputer. Custom chips are taking off as well: Google is on its fourth-generation TPUs; Microsoft has invested in FPGAs for running AI workloads; and Amazon has launched its Trainium and Inferentia chips for AI.

About the author: Tiffany Trader

With over a decade’s experience covering the HPC space, Tiffany Trader is one of the preeminent voices reporting on advanced scale computing today.

AIwire