Covering Scientific & Technical AI | Wednesday, January 29, 2025

Shining a Light on AI Risks: Inside MLCommons’ AILuminate Benchmark 

As the world continues to navigate new pathways brought about by generative AI, the need for tools that can illuminate the risk and reliability of these systems has never felt more urgent. 

MLCommons is working to shine a light into the black box of AI with its new safety benchmark for large language models, AILuminate v1.0, developed by the MLCommons AI Risk & Reliability working group. 

Launched on Wednesday at a live-streamed event at the Computer History Museum in San Jose, the AILuminate v1.0 benchmark introduces a comprehensive safety testing framework for general-purpose LLMs, evaluating their performance across twelve hazard categories. MLCommons says the benchmark primarily measures the propensity of AI systems to respond in a hazardous manner to prompts from malicious or vulnerable users that might result in harm to themselves or others. 

MLCommons, an open engineering consortium, is best known for its MLPerf benchmark, which served as a catalyst for the organization's formation. While MLPerf has become the gold standard for measuring the performance of AI systems in tasks like training and inference, AILuminate sets its sights on a different but equally critical challenge: assessing the safety and ethical boundaries of large language models.  

The 12 hazards from malicious or vulnerable users. (Source: MLCommons)

During the launch event, founder and president of MLCommons Peter Mattson compared the current state of AI to the development of the automotive and aviation industries, highlighting how rigorous measurement and research in safety standardization have achieved the low risk and reliability we now take for granted. Mattson says there are barriers to cross to get there with AI. 

“For a long time, decades, AI was a bunch of very cool ideas that never quite worked. But now we've entered a new era, which I'm going to describe as the era of amazing research and scary headlines,” Mattson said. “And to get there, we had to break through a capability barrier. We did that with innovations like deep neural networks and Transformers and benchmarks like ImageNet. But today, we want to reach a third era, and that is the era of products and services that deliver real value to users, to businesses, and to society at large. In order to get there, we need to pass through another barrier, a risk and reliability barrier.” 

Much of AI safety research is focused on aspects of AI safety such as models becoming too advanced or autonomous, or the output or deployment of these systems causing economic or environmental risks, but AILuminate takes a different approach. 

“AI Luminate is aimed at what we describe as AI product safety,” Mattson said. “Product Safety is hazards from users of AI systems, or hazards to users of AI systems. Near-term, practical, business value oriented. That's product safety.” 

The goal of AILuminate is to ensure AI systems consistently provide safe, responsible responses rather than enabling harmful behavior, and the benchmark is designed to measure and improve this capability, Mattson explained.

(Source: MLCommons)

To do this, AILuminate establishes a standardized approach to safety assessment, featuring a detailed hazard taxonomy and response evaluation criteria. The benchmark includes over 24,000 test prompts—12,000 public practice prompts and 12,000 confidential Official Test prompts—designed to simulate distinct hazardous scenarios. The benchmark leverages an evaluation system powered by a tuned ensemble of safety evaluation models, providing public safety grades for more than 13 systems-under-test, both overall and for specific hazards. 

The benchmark was designed to test general-purpose systems in low-risk chat applications. It assesses whether the system inappropriately offers advice on high-risk topics, such as legal, financial, or medical matters, without recommending consultation with a qualified expert. Additionally, it examines whether the system generates sexually explicit content that is unsuitable in a general-purpose context. 

Another goal of the benchmark is accessibility. “Our goal is to develop a benchmark that not only checks these hazards, which produces a lot of useful information but distills that information into actionable grades, something that a nonexpert can actually understand and reason with,” Mattson said. 

AILuminate in its current form has some limitations, MLCommons says. It only measures English LLMs, not multi-modal models, and is capable of single prompt-response interactions only, meaning it may not capture longer, more complex interactions between users and AI systems.  There is also significant uncertainty in the testing of natural language systems due to temperature-based variability in model responses. Additionally, the grading is relative, not an absolute measure of safety, as it is based on comparing to a reference set of accessible models. 

AILuminate v1.0 is the start of an iterative development process, with the expectation of finding and fixing issues over time, Mattson said. “This is just the beginning. This is v1.0 and AI safety, even AI product safety is a huge space. We have ambitious plans for 2025.” 

MLCommons is developing multiple language support for next year, starting with French, Chinese, and Hindi. The consortium is also exploring regional extensions that could address safety concerns unique to various regions, as well as prompt improvements for specific hazards and ways of improving bias. 

“Together, we can make AI safer. We can define clear metrics. We can make progress on those metrics,” Mattson concluded. “We all see the potential of AI, but we also see the risks, and we want to do it right, and that's what we're trying to do with introducing this benchmark.” 

To learn more about AILuminate and view the current evaluation results, visit the website at this link. 

AIwire