Covering Scientific & Technical AI | Saturday, January 18, 2025

OSI Open AI Definition Stops Short of Requiring Open Data 

The movement toward open source AI made progress today when the Open Source Initiative released the first (OSAID). While the OSAID provides one step forward, the lack of requirements around openness for training data leaves a gap that eventually will need to be filled.

The OSAID was unveiled this week after two years of development at the OSI, the standards body that has worked for nearly three decades to define what open source means and to create licenses to help distribute open source software.

The process was “well-developed, thorough, inclusive and fair,” said Carlo Piana, the OSI board chair. “The board is confident that the process has resulted in a definition that meets the standards of Open Source as defined in the Open Source Definition and the Four Essential Freedoms, and we’re energized about how this definition positions OSI to facilitate meaningful and practical Open Source guidance for the entire industry.”

The Four Essential Freedoms require that, for any piece of software, every user must to be free to:

  • “Use the system or any purpose and without having to ask for permission,”
  • “Study how the system works and understand how its results were created,”
  • “Modify the system for any purpose, including to change its output,” and
  • “Share the system for others to use with or without modifications, for any purpose.”

According to the OSAID 1.0 definition, open source AI is needed so that the benefits “accrue to everyone.” The AI definition requires that developers must provide the complete source code used to train and run the system, including “the full specification of how the data was processed and filtered, and how the training was done.”

This includes any code used “for processing and filtering data, code used for training including arguments and settings used, validation and testing, supporting libraries like tokenizers and hyperparameters search code, inference code, and model architecture,” the definition states. The author of an open AI system under OSAID also must fully disclose full descriptions of parameters, including weights and configuration settings.

But when it comes to the data used to train the model, the OSAID does not require that the training data to be made available. Instead, it requires only “sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system,” the definition states.

The OSAID definition continues:

“In particular, this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.”

Ayah Bdeir, who leads AI strategy at Mozilla, said that says this goes beyond “what many proprietary or ostensibly Open Source models do today.”  However, Bdeir seemed to acknowledge that not requiring a full copy of the training data represents a compromise on the part of the OSAID.

“This is the starting point to addressing the complexities of how AI training data should be treated, acknowledging the challenges of sharing full datasets while working to make open datasets a more commonplace part of the AI ecosystem,” she stated in the press release. “This view of AI training data in Open Source AI may not be a perfect place to be, but insisting on an ideologically pristine kind of gold standard that will not actually be met by any model builder could end up backfiring.”

Luca Antiga, the CTO of Lightning AI, wished the OSI would have gone a step further and required the training data to be open in its definition of open source AI.

“If we accept that the source code for a model is the data it was trained on–or at least a significant part is the data it was trained on–then we have an open source AI whose source is not open. That is not just an academic distinction,” he tells BigDATAwire. “I believe that to be of a practical value, a definition of open source needs to be all encompassing.”

The Apache 2.0 license is the gold standard in open source because it states that the creator of open source software will not sue the user. But by leaving the training data out of the OSAID, it weakens the definition to the point where the user won’t carry the kind of assurance that commercial users of products licensed under Apache 2.0 have enjoyed, Antiga says.

“It’s going to be a bit too weak for open source to be perceived as something that is okay to use in a in a business situation,” he said.

These are difficult issues to grapple with, to be sure, especially in the context of large language models (LLMs), which are immensely large, difficult to build, and trained on huge swaths of data culled from the open Web as well as private Internet sites. Because of these hurdles, only a handful of the world’s largest tech firms have successfully developed and trained an LLM.

For instance, Meta’s Llama3 model is immensely popular and capable and free to download, but Meta has not called it an open source model, likely because it was trained on proprietary data–Facebook and Instagram conversations–which Meta won’t release. And despite its name, OpenAI, which kickstarted the LLM craze with the release of ChatGPT in November 2022, doesn’t even pretend that its models are open source.

Stefano Maffulli, the Executive Director of the OSI, seems to acknowledge the difficulties that adding open data as a requirement creates for open source AI.

“Arriving at today’s OSAID version 1.0 was a difficult journey, filled with new challenges for the OSI community,” Maffulli says in the OSI press release. “Despite this delicate process, filled with differing opinions and uncharted technical frontiers—and the occasional heated exchange—the results are aligned with the expectations set out at the start of this two-year process. This is a starting point for a continued effort to engage with the communities to improve the definition over time as we develop with the broader Open Source community the knowledge to read and apply OSAID v.1.0.”

Lightning AI’s Antiga acknowledges the difficulty of creating a standard for open source AI models, and commends the OSI for taking the issues up in the first place.

“I don’t want to criticize for the sake of criticizing. I think the people there, they did a good job at making the issue discussed,” he says. “I just think that the definition that is coming out of this is a compromise that is dictated by the current way AI needs to be trained, on gigantic, gigantic data sets.”

However, since OSAID won’t provide the legal indemnification that comes with an AI definition that requires fully open training data, the industry will seek it elsewhere, Antiga says. Businesses, model developers, and the scientific community will likely look for an additional license for training data that, in combination with the OSAID, will provide the necessary disclosures to settle ethical and legal concerns, he says.

“I think in the end, practical needs will find their way,” he says. “It’s just like water. At some point it finds its way. So there will be the OSI definitions plus some conditions on the data, and people will accept that A plus X will be the open source thing. I think the picture will be completed by practice in the sense that enough people adopting models that are more kosher versus others that are less, will bring us to finding definitions for one and the other piece that’s missing. Although the OSI will not pronounce themselves on the other piece right now, it will just emerge.”


This article first appeared on sister site BigDATAwire.

About the author: Alex Woodie

Alex Woodie has written about IT as a technology journalist for more than a decade. He brings extensive experience from the IBM midrange marketplace, including topics such as servers, ERP applications, programming, databases, security, high availability, storage, business intelligence, cloud, and mobile enablement. He resides in the San Diego area.

AIwire