DeepSeek Week: What (Almost) Everyone Missed
Earlier this week, news of DeepSeek GPT swept through the media. The market reacted, and many hardware companies (e.g., Nvidia, ASML, Broadcom, Marvell) saw a drop in their market value. Presumably, the demand for truckloads of GPUs will be reduced because DeepSeek has delivered a foundation model that rivals the “big kids” at 1/50 the cost. This analysis missed the point, except for old hands like Pat Gelsinger.
To be clear, I am not offering investment advice; I’m an HPC grey hair who has seen some things. Because no one knows what truly constitutes an AI, or an AGI for that matter, the hardware and software needed to reproduce HAL 9000 is still a bit up in the air.
The training cost of DeepSeek is quoted at $5 million — about 50 times less than the reported cost of the $250 million estimate required to build a state-of-the-art model. There is much quibbling about this price including only hardware training costs and not people, development, and other infrastructure costs. Let’s eliminate all the nitpicky arguments by assuming DeepSeek reduced the cost by a factor of 10, meaning they spent $25 million on the works. That is still cheap compared to the data center models we constantly hear about.
Take a pause here and think about the past. Has there ever been a time when the “cost of entry” to computing was reduced by a factor of 10? Consider supercomputing in the early 1990’s. It was an expensive game, usually required 7 figures to play, and the league was rather exclusive. Then Thomas Sterling and Don Becker, with $50,000 from Jim Fisher, built this commodity hardware-based “Beowulf” cluster computer.
The rest is a well-known part of history. The cost of entry to supercomputing was reduced by at least a factor of ten. As a result, a lot more people were now doing research and engineering quality supercomputing for a lot less money. There were stories circulating about purchasing a fast Beowulf cluster for the cost of a memory upgrade or annual support agreement on existing supercomputing systems.
The advent of Beowulf commodity computing was an inflection point for the market. Almost anyone could get in the game — and did. The lowered “cost of entry” spurned increased hardware sales into a new market — commodity based HPC. In addition to the centralized capability machines, there were now local machines designed to deliver the specific performance needed by users.
Back to today. If it can be reproduced by others, the DeepSeek news is the “Beowulf” moment of the GenAI market. The club is now open to many more organizations that were prohibited by cost in the past. Indeed, a recent X/Twitter post by Matthew Carrigan;
Complete hardware + software setup for running Deepseek-R1 locally. The actual model, no distillations, and Q8 quantization for full quality. Total cost, $6,000.
To be clear, that is the full model, which is 650GB in size. Now the kicker. The full model runs on two AMD EPYC processors (9004 or 9005) with 768GB (to fit the model) across 24 RAM channels. Throw in a case, power supply, and SSD, and you pretty much have a local machine that Matthew reports to run at “between 6 and 8 tokens/second.” Surprisingly, there is no GPU in the hardware manifest. The limiting factors in running the full model are memory size and memory bandwidth. In this case, large amounts of memory swing the platform in favor of CPUs. Of course, you can get a GPU with more memory, but it will cost you “a bit” more than $6,000. Note: GPUs are still needed for training the model, so don’t sell your Nvidia stock just yet because a lower cost to train and run the model may generate more hardware sales.
The DeepSeek-R1 release was accompanied by a detailed project release on GitHubub and a technical paper outlining the project’s key steps. Also, an “Open-R1: a fully open reproduction of DeepSeek-R1” has been started on Huggingface.
Finally, there is also reporting and methods in the their paper on how the DeepSeek team optimized the model. These optimizations allow the model to run extremely fast and may be due to the DeepSeek team trying to make the most of the available hardware. Remarkably, DeepSeek claims to have pre-trained its V3 model on only 2,048 crippled Nvidia H800 GPUs. Unlike the “cheaper” method of throwing more hardware at the problem (or more likely “throw a data center at the problem”), DeepSeek worked on software optimizations. Continued testing and study of this model will undoubtedly reveal more details.
As Mark Twain has famously said, “History Doesn’t Repeat Itself, but It Often Rhymes.” Indeed, as we have seen in the past when the barrier to entry for technical computing goes down, those big established companies tend to frown.
This article first appeared on sister site HPCwire.