New Study Warns of Catastrophic Overtraining in Large Language Models

Source: Shutterstock
The race to build ever-larger language models is being driven by the assumption that more pre-training data equals better performance. It's no surprise that AI companies have been scrambling to find enough quality data to train their AI models, often resorting to creating synthetic data to build and fine-tune the AI models. But what if this core assumption is flawed?
A new study warns that more pre-training data may not always lead to better AI models. Researchers from high-ranking universities including Carnegie Mellon University, Stanford University, Harvard University, and Princeton University highlight the phenomenon of “Catastrophic Overtraining.” Their recent research on this matter suggests that extending pre-training can actually degrade a model’s ability to be fine-tuned effectively, leading to poorer performance in real-world applications.
The researchers challenge the “more is better” belief when it comes to training AI models. “Contrary to common belief, longer pre-training does not always lead to better post-trained models,” wrote the authors in their study published on arXiv. “We have shown that this is a consequence of a broader underlying phenomenon where models become more sensitive to perturbations as they are pre-trained on more tokens.”
Why do AI models require pre-training? AI companies use pre-training to teach AI systems foundational skills relevant to their tasks. This could be anything from understanding language, analyzing images, predicting sequences, or recognizing patterns in data.
Pre-training plays an important role as it allows models to generalize knowledge, adapt to diverse contexts, and perform effectively across a wide range of tasks. Just to be clear, the researchers don’t reject pre-training but suggest developers need to be more strategic about how much pre-training is enough.
To understand how the pre-training would impact AI models, the researchers compared two versions of Ai2’s open-source OLMo-1B model. One was trained on 2.3 trillion tokens, while the other on 3 trillion trillion tokens. Surprisingly, the model trained on more data performed worse after fine-tuning. It showed 2-3% lower accuracy on standard benchmarks like ARC-Challenge, PIQA, and AlpacaEval.
The authors explain this degradation in performance through what they call “progressive sensitivity”. As the models are trained for longer, their internal parameters become increasingly sensitive to changes such as tweaking the model during fine-tuning or adding more data. This heightened sensitivity means that even minor adjustments or even small amounts of noise in the data can seriously disrupt what the model has already learned.
The study supports its findings through evidence from multiple angles. When the researchers added Gaussian noise to pre-trained models, they found performance became significantly worse with increasing pre-training tokens. Additionally, they validated their results using a different setup involving fine-tuned benchmarks, which yielded similar outcomes.
The researchers admit that their research is not universal as their research suggests that the risk of catastrophic overtraining is higher on smaller models. They also emphasize that overtraining can’t always be fixed, even with good techniques, if the tasks aren’t well-aligned.
“Catastrophic overtraining may be inevitable, even if the fine-tuning process is regularized, especially when the pre-training and fine-tuning tasks are misaligned,” shared the researchers. This highlights the importance of ensuring alignment between training and fine-tuning objectives.
AI model pre-training is a crucial component of the development process. However, the study's findings highlight the risks of overtraining. So, what is the sweet spot? According to the researchers, it involves striking a balance between base model quality and post-training adaptability.
Developers may need to rethink the approach to building AI models. As the researchers suggest, the focus should move away from simply scaling up data and model size toward optimizing the entire training pipeline. “Our findings call for a renewed focus on model scaling that considers the entire training pipeline,” emphasizes the researchers.
The authors emphasize the need for further research to explore the factors that determine when and how catastrophic overtraining occurs. However, a key takeaway from their study is that by adopting smarter strategies for AI development, less can sometimes be more.