LLNL Study Explores ML-Driven Binary Analysis for Software Security
Feb. 3, 2025 -- Whether intended for a smartphone or a supercomputer, software updates can be vulnerable to attacks. Cyber security programs scrutinize software behavior before installation, detecting signatures of malicious activity. But lack of access to source code complicates this task, so researchers are turning to structural information contained in software binaries, which are the precompiled, machine-readable ones and zeros.
Software binaries come with their own complexities, however. For instance, information could be discarded during compilation if deemed unnecessary. Machine learning (ML) techniques—such as graph neural networks (GNNs) and natural language processing (NLP)—are opening up new avenues to automating binary analysis.
Leveraging these techniques, computational mathematician Geoff Sanders and former LLNL data scientist Justin Allen explored ways to characterize software behaviors based on similarity to previous threats. Allen built an ML-driven binary analysis pipeline that incorporates large-scale training data and hierarchical embeddings, and presented their paper, “BobGAT: Towards Inferring Software Bill of Behavior with Pre-Trained Graph Attention Networks,” at the 2024 IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications. The work was part of a Laboratory Directed Research and Development project focusing on software assurance capabilities.
The pipeline turns software binaries into graph representations such as control-flow graphs (CFGs) that represent data relationships as basic blocks, then applies a combination of GNN and NLP to understand the semantics of the compiled binaries. This model uses a multilayered, hierarchical graph attention network—the GAT in BobGAT—which aggregates embeddings of assembly lines, basic blocks, functions, and binaries. This process repeats to accumulate fine-grained details of the graph’s features.
Additionally, shortcuts like skip connections provide easier access to information from earlier layers, such as allowing basic blocks to connect back to their initial features. And finally, the GNN is bidirectional in order to retain connections among sparser graphs, averaging what it learns from “forward” program execution and “backward” connection transposition. “These combined techniques allow the model to access the initial node features without having to retain them through multiple layers, and split up that information for the final embedding,” explains Allen, now an ML engineer at Meta. The result of this pipeline is ML-optimized data suitable for training binary analysis models.
Two complementary open-source tools are key to this pipeline. Developed for this research, CAP (Compile. Analyze. Prepare.) generates large-scale binary datasets from source code examples, then BinCFG parses compiler outputs, tokenizes and normalizes the binary data into assembly lines, and converts the data into ML-prepped formats. “Data preparation, especially when working with assembly lines, is supremely important,” Allen points out. “These tools automate much of the tedious work of data preparation for binary analysis tasks.”
The more data, the better. “The approach Justin developed is data hungry,” Sanders notes. “He trained the models with labeled data from software binaries so they’ll make better predictions of real-world software behavior, even if it’s unexpected, not seen before, or a new version of the software.”
To test the pipeline, Allen scraped a massive dataset from the programming competition website Codeforces. Containing more than 10,000 problem sets and solutions, the dataset provided about 125 million code snippets across dozens of programming languages and millions of authors. According to Allen, running the pipeline at this large scale required about a week of compute time on one of LLNL’s high performance computing systems, and at one point ran on 1,200 nodes simultaneously.
As it turns out, more data is indeed better. “With millions or billions of labeled software samples, Justin showed that we can more accurately model tens of thousands of behaviors and predict changes in those behaviors,” says Sanders. Examples of predictions include: (1) infer a compiled binary’s originating problem; (2) group binaries based on their similarities, such as if they came from the same or different problems; and (3) group new binaries not included in the training data. ML models using Allen’s pipeline-processed training data variously achieved 92–99% accuracy on these types of tasks.
Analogous to a virus-scanning application, the pipeline can be used in installation workflows for both open-source and commercial software, and is suitable for large computing centers and even classified systems. Notes Allen, “The models can generalize to new binary analysis tasks and benchmarks, and these tools are well documented and extensible to new data types, programming languages, analysis paradigms, and instruction sets.”
Source: Holly Auten, LLNL