Covering Scientific & Technical AI | Wednesday, December 11, 2024

Data Days Workshop Gathers DOE National Labs to Discuss Future of Data Management 

Dec. 9, 2024 -- Lawrence Livermore National Laboratory (LLNL) recently hosted the 5th annual Department of Energy (DOE) Data Days event, bringing together data scientists, researchers and policymakers for three days of discussion on the latest advancements in data management, AI and high-performance computing (HPC).

Drawing more than 300 attendees, this year’s “D3” workshop focused on tackling pressing data challenges in nuclear security, energy and collaborative scientific discovery, and featured a host of talks, presentations and panels. The workshop kicked off with opening remarks from LLNL’s Associate Deputy Director for Program Enablement in the Strategic Deterrence Directorate Valerie Noble and Paul Adamson, data science manager for the National Nuclear Security Administration’s (NNSA) Defense Nuclear Nonproliferation Office, who set the tone by outlining NNSA’s key objectives in data strategy and nuclear nonproliferation.

DOE CDO Rob King (with mic) speaks during a panel on data governance at the DOE Data Days event in October. The panel included (from left) King, Sandia National Laboratories CDO Tom Trodden; Michael Cooke, senior technical advisor for DOE’s Office of the Deputy Director for Science Programs; and CDO at DOE’s Office of Clean Energy Demonstration Lavanya Viswanathan. The panel was moderated by DOE Senior Advisor Josh Linard (far left). (Photo: Garry McLeod/LLNL)

DOE’s Chief Data Officer Rob King delivered the first keynote address, recapping DOE’s Enterprise Data Management strategy and the strides made in centralized, secure data handling across diverse fields. King also discussed the critical role of structured data in driving initiatives such as international nuclear safety and clean energy transitions.

The morning session featured talks on cloud and hybrid data management, such as the DOE Atmospheric Radiation Measurement program’s ARMFlo workflow, presented by software engineer Elvis Offor of Pacific Northwest National Laboratory, and DOE's spatial Platform for Advanced Research and Collaboration (sPARC), by program lead Kevin Wright. The session highlighted how advanced visualization and in-situ data analysis can accelerate scientific discovery and foster cross-sector collaboration. Other sessions explored how scientists can leverage cloud-based Jupyter Notebooks to enable real-time data processing and explored data management for large-scale clean energy projects in the Office of Clean Energy Demonstration (OCED).

Jack Sarle of the National Energy Technology Laboratory (NETL) also updated attendees on Project Alexandria, a “state-of-the-art R&D approach to data” that seeks to create a central repository for DNN’s non-proliferation research data across different levels of classification. A panel followed on big data and AI governance, including discussion on how the national labs are working with private companies on AI models. In the afternoon, participants engaged in interactive breakout sessions for deeper discussions around collaborating on Data Management Systems, data security and sensitivity and multi-cloud providers.

The future of AI and data security in energy and defense

The event’s second day began with a DOE Leadership Panel on data governance, including DOE’s King, Sandia National Laboratories Chief Data Officer Tom Trodden, Chief Data Officer at DOE’s OCED Lavanya Viswanathan and Senior Technical Adviser for DOE’s Office of the Deputy Director for Science Programs Michael Cooke, who delved into challenges and opportunities in data-intensive computing. The panel was moderated by DOE Senior Adviser Josh Linard.

During the panel, King emphasized the need for data literacy among the DOE workforce, incentivizing data stewardship and the importance of leadership to institutionalize data value.

“AI and data governance needs to be tied at the hip, and we’re driving toward that,” King explained.

Cooke spoke of ensuring scientists who share their data see the benefit in their careers, to improve data sharing and accelerate scientific progress. Viswanathan discussed the need for a user-friendly data marketplace, with streamlined processes for data sharing, while also educating the workforce on AI's potential and risks. Trodden outlined Sandia’s focus on making data access timely, establishing an AI Board of Directors and the importance of balancing enthusiasm for AI with a realistic understanding of its limitations and appropriate applications.

“AI will really transcend every business area,” Trodden said. “We’re really spending a good amount of our energy early on in educating our workforce to what AI is, its capabilities, and its shortfalls and shortcomings, and how to really understand the effective and appropriate use of AI in the workplace. We're focusing on education up-front.”

Following the panel, LLNL’s John Westlund delivered a keynote on recent advancements in HPC and AI integration, highlighting the Lab’s efforts to support AI/ML workflows and focusing on the critical role of data in improving model accuracy. He introduced the Unified Storage Namespace (USN), a metadata clearinghouse that aids in planning and aging analysis and streamlines data management across HPC systems. With tools like the USN and HPC-connected AI accelerators from companies like SambaNova and Cerebras, LLNL is advancing capabilities to connect external and internal data center workloads and creating a more efficient infrastructure for handling complex, large-scale data storage, Westlund said.

“Improving data and compute transparency is critical,” Westlund said. “As we capture more data to feed AI, we're going to see more users who primarily store and move data rather than compute data. And as the usage of storage becomes more complex, data discovery becomes paramount for not only administration, but for effective use of storage by all users. Likewise, managing vast quantities of data will require improved automation and tools to facilitate movement and curation.”

The day’s presenters also discussed AI-driven tools for advancing scientific research. National Renewable Energy Laboratory (NREL) lead technologist and data systems architect John Weers presented on the Open Energy Data Initiative (OEDI), an information portal sponsored by DOE and developed by NETL to support the Open Government Initiative. OEDI is an open architecture designed to make energy data universally accessible, FAIR (Findable, Accessible, Interoperable, Reusable)-compliant and AI-ready, Weers said. During his talk, Weers also announced the launch of the new "Ask OEDI" AI research assistant, powered by a large language model, allowing researchers to ask scientific questions and receive reliable, citation-based answers from curated metadata.

Following Weers, Svitlana Volkova, chief of AI at Aptima Inc., presented on FusionSci, a framework aimed at optimizing human-AI performance for cross-disciplinary scientific discover. FusionSci leverages large language models and knowledge graphs to enable knowledge generation and validation across various fields including data science, material science and nuclear nonproliferation, Volkova said. Brookhaven National Laboratory computer scientist Carlos Soto then discussed the critical challenges of ensuring safety, security and trustworthiness in generative AI ecosystems, particularly in scientific and security applications. Soto stressed the need for robust verification, privacy preservation and adversarial protections as generative AI models scale, presenting unique risks like data leakage, biased output and reproducibility issues that require safeguards.

The afternoon concluded with a poster session in the Lab’s West Cafe, where attendees explored a range of projects, including data-intensive climate science and AI applications in scientific research.

Working Toward a Unified Data Ecosystem

D3’s final day focused on data curation and governance, starting with a keynote by DOE Deputy Chief Data Officer Seth Berl on developing curated data pipelines for advanced analytics and AI applications.

Presentations from experts Meghan Berry of Oak Ridge National Laboratory (ORNL) and Kim Maestas from Los Alamos National Laboratory emphasized the need for interconnected repositories among the DOE labs and how organizations could better support data stewards — agents who could ensure data quality and security as data ecosystems grow more complex. Berry discussed Constellation 2.0, an open dataset repository at the Oak Ridge Leadership Computing Facility (OLCF) and explained how such repositories can catalyze new scientific discoveries and drive collaborative research in areas like climate science and nuclear safety.

A trio of LLNL speakers filled out the rest of the session. LLNL computer scientist and Center for Applied Scientific Computing group leader Dan Laney discussed modern challenges in managing HPC modeling and simulation workflows and an NNSA-wide digital transformation effort designed to increase agility and reduce time to production. The NNSA labs will need secure and reliable workflows before they can start thinking about digital twins for part manufacturing, Laney said.

“The design and production agencies are working very hard to get more interconnected and we're actively looking at ways to connect our sites together in a more automatic fashion,” Laney said. “Being able to do simulation work across sites for various things like manufacturing and digital engineering and moving towards fully digitized processes is going to be pushed … The key thing is that there are a lot of nice tools around digital engineering in the outside business pool, but connecting those tools to the way we do high-performance computing and at the scale we do is the challenge.”

Camille Mathieu, the Knowledge Management (KM) program director in LLNL’s Strategic Deterrence Directorate, emphasized the importance of building a KM program for the National Security Enterprise (NSE) that supports digital engineering through a structured environment. Mathieu outlined three critical elements for progress: a controlled environment enabling predictable, discrete functions; normalized data inputs with standardized metadata; and trained participants who understand the defined system rules. Mathieu said that while high-level FAIR principles are essential, success for a data-driven NSE depends on clear standards, defined metrics and validated information systems.

“None of this is outside of the realm of possibility, in large part, not because it isn't hard; it's ridiculously hard, and not because it isn't ambitious, but because we are ultimately a closed system,” Mathieu said. “There are a lot of benefits to operating in an internal environment that make this, although a difficult thing to think about doing, also something that would pay off in a major way.”

To end the session, LLNL computational engineer Kerianne Pruett talked about the Data Science Institute’s Open Data Initiative (ODI), a platform through which DSI can share datasets from various Lab projects. The repository is designed to foster collaboration, curriculum development and community engagement, she explained. ODI supports programs like the Data Science Challenge and internal training, though incentivizing principal investigators to submit datasets remains a challenge, Pruett said.

“We've been trying to motivate scientists at the lab to submit datasets and host them, and we're working on streamlining the metadata gathering for all these different options,” Pruett said. “The long-term vision is to be hosting not just data, but dashboards and tools and tutorials and across all classification levels.”

As the event concluded, a panel featuring Laney, Mathieu, Pruett and others discussed DOE’s challenges and strategies for effective data management and governance. Speakers noted the need for clear role definitions, the difficulty in hiring data practitioners and the development of specific position descriptions. The conversation also touched on the importance of data sharing, the need for better tools to facilitate peer review and the challenges of citing and attributing closed data.

“The workshop was a great success,” said LLNL Geophysics Data Specialist Rebecca Rodd, a member of the organizing committee and host for the workshop. “We received positive feedback from both our sponsors and the attendees across the DOE complex. We continue to offer a space to meet and learn from others in data management roles across different domains and laboratories.”


Source: LLNL

AIwire