Covering Scientific & Technical AI | Saturday, January 18, 2025

Anthropic Breaks Open the Black Box 

One of the largest hurdles to trustworthy and responsible AI is the concept of the black box, and Anthropic just took a big step towards opening that box.

For the most part, humans aren’t able to understand how AI systems output answers. We know how to feed these models large amounts of data, and we know that the model can take this data and find patterns in it. But exactly how those patterns form and correspond to the output of answers is something of a mystery.

For a world increasingly relying on AI tools for important decisions, explaining those decisions is of the utmost importance. Anthropic’s recent research into the topic is shedding much-needed light on how AI systems work and how we can build toward more trustworthy AI models.

Anthropic chose the Claude 3.0 Sonnet model – which is a version of the company’s Claude 3 language model – to learn more about the black box phenomenon. Previous work by Anthropic had already discovered patterns in neuron activations that the company calls “features.” This work used a technique called “dictionary learning” to isolate these features that occur across multiple different contexts.

“Any internal state of the model can be represented in terms of a few active features instead of many active neurons,” the press release from Anthropic said. “Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features.”

Anthropic reported in October 2023 of success in applying dictionary learning to a very small language model, but this most recent work was scaled up to the vastly larger Claude model. After overcoming some impressive engineering challenges, the Anthropic team was able to successfully extract millions of features from the middle layer of Claude 3.0 Sonnet – which the company calls the “first ever detailed look inside a modern, production-grade large language model.”

Anthropic mapped features corresponding to entities such as the city of San Francisco, atomic elements like Lithium, scientific fields like immunology, and more. These features are also multimodal and multilingual, which means they respond to images of a given entity as well as its name or description in a variety of languages. Claude even had more abstract features, responding to things like bugs in computer code or discussions of gender bias.

What’s even more amazing is that Anthropic’s engineers were able to measure the “distance” between features. For instance, by looking near the “Golden Gate Bridge” feature, they found features for Alcatraz Island, The Golden State Warriors, California Governor Gavin Newsom, and the 1906 earthquake.

A map of the features near an "Inner Conflict" feature, including clusters related to balancing tradeoffs, romantic struggles, conflicting allegiances, and catch-22s. Credit: Anthropic

Even at higher levels of conceptual abstraction, Anthropic found that the internal organization within Claude corresponds to the human understanding of similarity.

However, Anthropic also made a discovery that could prove immensely important in the AI era – they were able to manipulate these features and artificially amplify or suppress them to change Claude’s responses.

When the “Golden Gate Bridge” feature was amplified, Claude’s answer to the question “What is your physical form?” changed dramatically. Before, Claude would have responded something like this: “I have no physical form, I am an AI model.” After the amplification, Claude would respond something like this: “I am the Golden Gate Bridge… my physical form is the iconic bridge itself…” In fact, Claude became obsessed with the bridge and would bring it up in an answer to questions that weren’t even remotely relevant to the bridge.

However, the features that Anthropic identified weren’t all as harmless as the Golden Gate Bridge. They also found features connected to:

  • Capabilities with misuse potential such as code backdoors and the development of biological weapons
  • Different forms of bias such as gender discrimination and racist claims about crime
  • Potentially problematic AI behaviors such as power-seeking, manipulation, and secrecy

Another area of concern that Anthropic addressed is sycophancy, or the tendency of models to provide responses that match user beliefs rather than truthful ones. The team studying Claude found a feature associated with sycophantic praise. By setting the “sycophantic praise” feature to a high value, Claude would respond to overconfident users with praise and compliments rather than correcting objectively wrong facts.

Anthropic is quick to point out that the existence of this feature  does not mean that Claude is inherently sycophantic. Rather, they state that this feature means that the model can be manipulated to be sycophantic.

AI tools are just that – tools. They are not inherently good or evil, they simply do what we tell them. That said, this research from Anthropic clearly outlines that AI tools can be manipulated and distorted to provide a wide variety of responses regardless of their basis in reality. Additional research and public awareness are the only ways to ensure that these tools work for us, and not the other way around.

AIwire