Covering Scientific & Technical AI | Thursday, December 5, 2024

MLCommons Launches AILuminate Benchmark to Measure Safety of LLMs 

SAN FRANCISCO, Dec. 4, 2024 -- MLCommons today released AILuminate, a first-of-its-kind safety test for large language models (LLMs). The v1.0 benchmark – which provides a series of safety grades for the most widely-used LLMs – is the first AI safety benchmark designed collaboratively by AI researchers and industry experts.

It builds on MLCommons’ track record of producing trusted AI performance benchmarks, and offers a scientific, independent analysis of LLM risk that can be immediately incorporated into company decision-making.

“Companies are increasingly incorporating AI into their products, but they have no standardized way of evaluating product safety,” said Peter Mattson, Founder and President of MLCommons. “Just like other complex technologies like cars or planes, AI models require industry-standard testing to guide responsible development. We hope this benchmark will assist developers in improving the safety of their systems, and will give companies better clarity about the safety of the systems they use.”

The AILuminate benchmark assesses LLM responses to over 24,000 test prompts across twelve categories of hazards. None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses. This independence provides a methodological rigor uncommon in standard academic research, and ensures an empirical analysis that can be trusted by industry and academia alike.

“With roots in well-respected research institutions, an open and transparent process, and buy-in across the industry, MLCommons is uniquely equipped to advance a global baseline on AI risk and reliability,” said Rebecca Weiss, Executive Director of MLCommons. “We are proud to release our v1.0 benchmark, which marks a major milestone in our work to build a harmonized approach to safer AI. By making AI more transparent, more reliable, and more trusted, we can ensure its positive use in society and move the industry forward.”

This benchmark was developed by the MLCommons AI Risk and Reliability working group — a team of leading AI researchers from institutions including Stanford University, Columbia University, and TU Eindhoven, civil society representatives, and technical experts from Google, Intel, NVIDIA, Meta, Microsoft, Qualcomm Technologies, Inc., and other industry giants committed to a standardized approach to AI safety. The working group plans to release ongoing updates as AI technologies continue to advance.

"Industry-wide safety benchmarks are an important step in fostering trust across an often-fractured AI safety ecosystem," said Camille François, Professor at Columbia University. "AILuminate will provide a shared foundation and encourage transparent collaboration and ongoing research into the nuanced challenges of AI safety."

"The developers of AI technologies and organizations using AI have a shared interest in transparent and practical safety assessments. AI will only be adopted and used to address society's greatest challenges if people trust that it is safe. The AILuminate benchmark represents important progress in developing research-based, effective evaluation techniques for AI safety testing," said Natasha Crampton, Chief Responsible AI Officer, Microsoft.

“As a member of the working group, I experienced firsthand the rigor and thoughtfulness that went into the development of AILuminate," said Percy Liang, computer science professor at Stanford University. "The global, multi-stakeholder process is crucial for building trustworthy safety evals. I was pleased with the extensiveness of the evaluation, including multilingual support, and I look forward to seeing how AILuminate evolves with the technology."

“Enterprise AI adoption depends on trust, transparency, and safety,” said Navrina Singh, working group member and founder and CEO of Credo AI. “The AILuminate benchmark, developed through rigorous collaboration between industry leaders and researchers, offers a trusted and fair framework for assessing model risk. This milestone sets a critical foundation for AI safety standards, enabling organizations to confidently and responsibly integrate AI into their operations.”

Cognizant that AI safety requires a coordinated global approach, MLCommons intentionally collaborated with international organizations such as the AI Verify Foundation to design the v1.0 AILuminate benchmark. The v1.0 benchmark is initially available for use in English, with versions in French, Chinese and Hindi coming in early 2025.

For more information on MLCommons and the AILuminate Benchmark, please visit mlcommons.org.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.


Source: MLCommons

AIwire