Covering Scientific & Technical AI | Monday, December 2, 2024

Honey I Shrunk the Model: Why Big Machine Learning Models Must Go Small 

Bigger is not always better for machine learning. Yet, deep learning models and the datasets on which they’re trained keep expanding, as researchers race to outdo one another while chasing state-of-the-art benchmarks. However groundbreaking they are, the consequences of bigger models are severe for both budgets and the environment alike. For example, GPT-3, this summer’s massive, buzzworthy model for natural language processing, reportedly cost $12 million to train. What’s worse, UMass Amherst researchers found that the computing power required to train a large AI model can produce over 600,000 pounds of CO2 emissions – that’s five times the amount of the typical car over its lifespan.

At the pace the machine learning industry is moving today, there are no signs of these compute-intensive efforts slowing down. Research from OpenAI showed that between 2012 and 2018, computing power for deep learning models grew a shocking 300,000x, outpacing Moore’s Law. The problem lies not only in training these algorithms, but also running them in production, or the inference phase. For many teams, practical use of deep learning models remains out of reach, due to sheer cost and resource constraints.

Luckily, researchers have found a number of new ways to shrink deep learning models and optimize training datasets via smarter algorithms, so that models can run faster in production with less computing power. There’s even an entire industry summit dedicated to low-power, or tiny machine learning. Pruning, quantization, and transfer learning are three specific techniques that could democratize machine learning for organizations who don’t have millions of dollars to invest in moving models to production. This is especially important for “edge” use cases, where larger, specialized AI hardware is physically impractical.

The first technique, pruning, has become a popular research topic in the past few years. Highly cited papers including Deep Compression and the Lottery Ticket Hypothesis showed that it’s possible to remove some of the unneeded connections among the “neurons” in a neural network without losing accuracy – effectively making the model much smaller and easier to run on a resource-constrained device. Newer papers have further tested and refined earlier techniques to develop smaller models that achieve even greater speeds and accuracy levels. For some models, like ResNet, it’s possible to prune them by approximately 90 percent without impacting accuracy.

A second optimization technique, quantization, is also gaining popularity. Quantization covers a lot of different techniques to convert larger input values to smaller output values. In other words, running a neural network on hardware can result in millions of multiplication and addition operations. Reducing the complexity of these mathematical operations can help to shrink memory requirements and computational costs, resulting in big performance gains.

Finally, while this isn’t a model-shrinking technique, transfer learning can help in situations where there’s limited data on which to train a new model. Transfer learning uses pre-trained models as a starting point. The model’s knowledge can be “transferred” to a new task using a limited dataset, without having to retrain the original model from scratch. This is an important way to reduce the compute power, energy and money required to train new models.

The key takeaway is that models can (and should) be optimized whenever possible to operate with less computing power. Finding ways to reduce model size and related computing power – without sacrificing performance or accuracy – will be the next great unlock for machine learning.

When more people can run deep learning models in production at lower cost, we’ll truly be able to see new and innovative applications in the real world. These applications can run anywhere – even on the tiniest of devices – at the speed and accuracy needed to make split-second decisions. Perhaps the best effect of smaller models is that the entire industry can lower its environmental impact, instead of increasing it 300,000 times every six years.

About the Author:

Jeannie Finks brings passion, curiosity, and experience in developing teams, scaling businesses, and optimizing delivery of products to enhance customer outcomes. With more than 20 years of experience, she brings a unique background in customer success, technical program management, digital strategy, and open-source community evangelism. As Head of Customer Success at Neural Magic, Jeannie helps data scientists and AI/ML engineers achieve more by seeing how everyday CPUs can be turned into high performance compute resources.

AIwire