LLMs Showing Signs of “Cognitive” Decline – Just Like Humans

Source: Shutterstock
Barely two years since GenAI burst onto the scene, it has brought about numerous innovations across industries, including scientific breakthroughs and unprecedented efficiency in automation and data processing.
Large language models (LLMs) have often been compared to human intelligence. Some AI systems have even outperformed humans in certain tasks. As these models become more advanced, humans are becoming more reliant on them.
But what if these AI systems aren’t just evolving, they’re also declining. What if they’re exhibiting an unexpected human trait that we don’t anticipate in machines?
New research suggests that almost all the leading AI models suffer a form of “cognitive impairment” similar to a decline in the human brain. Interestingly, just like it is with humans, age is a key determinant of cognitive decline for these AI models. Like older patients, the “older” versions of chatbots showed signs of greater cognition impairment.
In their published paper, neurologists Roy Dayan and Benjamin Uliel from Hadassah Medical Center and data scientist Gal Koplewitz from Tel Aviv University focused on AI capabilities in the field of medicine and healthcare.
“Although large language models have been shown to blunder on occasion (citing, for example, journal articles that do not exist), they have proved remarkably adept at a range of medical examinations, outsourcing human physicians at qualifying examinations taken at different stages of a traditional medical training," wrote the authors in their research paper.
“To our knowledge, however, large language models have yet to be tested for signs of cognitive decline. If we are to rely on them for medical diagnosis and care, we must examine their susceptibility to these very human impairments.”
The researchers used the Montreal Cognitive Assessment (MoCA) test, a widely used tool to detect cognitive impairment, to test some of the leading LLMs. This included OpenAI’s ChatGPT 4 and 4o, Anthropic’s Claude 3.5 (Sonnet), and Google’s Gemini 1.0 and 1.5.
Why did the researchers use the MoCA test for this study? Well, MoCA is one of the most commonly used tests by neurologists and other healthcare professionals to assess for the onset of cognitive impairment in conditions like dementia or Alzheimer's disease.
The test consists of short questions designed to assess various cognitive domains, including memory, attention, language, and visuospatial skills. The highest possible score on the test is 30, with a score of 26 and above considered normal.
The MoCA test was administered to the LLMs using the same instructions as human patients, with some adjustments to ensure compatibility with AI models. For example, instead of using voice input, the questions were provided in text to focus on cognitive ability rather than sensory input. Early models without visual processing features followed MoCA-blind guidelines, while later models interpreted images using ASCII art.
The findings revealed that ChatGPT 4o scored the highest with 26 out of 30 points, while ChatGPT 4 and Claude were close behind with 25 points each. Gemini 1.0 had the lowest score at 16, suggesting greater cognitive limitations compared to the other models. Overall, the models performed worse than expected, especially on visuospatial/executive tasks. All the LLMs failed to solve the trail-making task.
The LLMs were also put through the Stroop Test, which measures cognitive flexibility, attention, and processing speed. It evaluates how well a person (or in this case, an AI) can handle interference between different types of information.
All the LLMs completed the first part of the Stroop test, where the text and font colors matched. However, only ChatGPT 4o successfully passed the second part, where the text and font colors differed.
“In this study, we evaluated the cognitive abilities of the leading, publicly available large language models and used the Montreal Cognitive Assessment to identify signs of cognitive impairment,” explained the researchers. “None of the chatbots examined was able to obtain the full score of 30 points, with most scoring below the threshold of 26. This indicates mild cognitive impairment and possibly early dementia.”
Should the researchers have tested the models more than once or used other types of tests to support their claims? Yes, that would have given more weight to the findings.
The researchers admit their study has a few limitations. With the rapid advancement of LLMs, future versions may perform better on cognitive and visuospatial tests. This may make the current findings less relevant over time. However, that’s something for the future. At this stage, the study has shown some of the fundamental differences between human and machine cognition.
Another limitation is the anthropomorphization of AI. The study uses humanlike descriptions to discuss AI performance. We know that LLMs do not experience neurodegenerative diseases in the same way humans do. So, this is more of a metaphorical study.
Some scientists have also questioned the study’s findings and have pushed back hard. Their primary objection is that the study treats AI like it has a human brain, whereas in reality, the chatbots process information in a completely different way. Critics say the MoCA test wasn’t designed for AI. The researchers are aware of this and intended the study to highlight a gap, not to be used as a definitive measure of AI's cognitive abilities.
The researchers are confident that their study raises concerns about LLMs' ability to replace human professionals, such as physicians. "These findings challenge the assumption that artificial intelligence will soon replace human doctors," they elaborated. “The cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients' confidence."
While human doctors may not be replaced by LLMs anytime soon, they may see a new kind of patient - an AI chatbot showing signs of cognitive decline.