Despite Rapid Advances, Studies Show AI Can’t Match Human Creativity
We have witnessed the remarkable rise of GenAI in recent times. The technology has transformed the way we interact with machines, opening doors to possibilities that once seemed impossible. From creating realistic images to composing music and generating narratives, GenAI has reshaped industries and empowered creators. However, can GenAI beat or even match human creativity?
While AI and machine learning (ML) algorithms are advancing rapidly, true creativity still remains uniquely human - at least, that’s what research suggests.
Mirco Musolesi, a computer scientist at the University College London (UCL), and Giorgio Franceschelli, an expert in AI and computational creativity at the University of Bologna, presented a paper that shows that GenAI output remains derivative, at least for now.
The researchers designed their own program to test LLM capabilities. The program was designed to measure creativity through Margaret Boden's three criteria of creativity: novelty, value, and surprise.
The evaluation in this study was more conceptual and theoretical rather than a formalized experimental test with quantifiable scores.
The findings of the study highlighted that LLMs can produce valuable and somewhat novel or surprising results. However, their design limits them from achieving deep and transformational creativity.
The LLMs lacked abilities essential to creativity such as social context and a broader perception of the world around us. The researchers concluded that while AI can help support human creators, it’s not going to match human creativity. Continual learning and fine-tuning of the models may improve their outputs over time, but the creativity will still remain simulated.
“Specific fine-tuning techniques might help LLMs diversify productions and explore the conceptual space they learn from data,” noted the researchers in their paper. “Continual learning can enable long-term deployments of LLMs in a variety of contexts. While, of course, all these techniques would only simulate certain aspects of creativity, whether this would be sufficient to achieve artificial, i.e., non-human, creativity, is up to the humans themselves.”
Nanyun Peng, a computer scientist at the University of California, Los Angeles, and her team compared human and LLM capabilities for storytelling focusing on plot progression and narrative development. The research concluded that LLM-generated stories showed less diversity in story arcs and often struggled with pacing. The performance improved slightly with explicit instructions on story arcs.
There’s no doubt that objectively testing the creativity of LLM models is tricky. Researchers have used various methods, including using their own judgment, based on their expertise, to analyze the creative outputs of LLMs. However, this raises an important question, are such methods reliable? Would a more systematic method provide a different conclusion?
In pursuit of more quantifiable methods to test LLM creativity, Ximing Lu, a computer scientist at the University of Washington, and her colleagues created a program that incorporates both objectivity and nuance in evaluating LLM outputs.
They designed a program, called DJ Search, that works by searching for identical or semantically similar phrases in online databases using AI-generated "embeddings" to represent word meanings. The tool calculates the ratio of novel content by removing matched phrases, producing a creativity index to measure linguistic novelty. The methodology used in the program is detailed in the paper published on arXiv.
According to Lu, LLMs put together pieces from existing works to make something special, much like a DJ remixing existing music. This is where the program’s name “DJSearch" comes from. While this remixing is valuable, it doesn't match the level of creativity involved in the original composition.
While Lu and her team used a more computational method, they reached a similar conclusion as Musolesi and Franceschelli that human-written texts are significantly more original than GenAI outputs.
“Our findings suggest that the content and writing style of machine-generated texts may be less original and unique, as they contain significantly more semantic and verbatim matches with existing web texts compared to high-quality human writings,” Lu wrote in her paper. “We hypothesize that this limited creativity in models may result from the current data-driven paradigm used to train LLMs.”