Covering Scientific & Technical AI | Saturday, January 18, 2025

Anthropic Talks Red Teaming for AI 

The security and safety of AI tools is a topic that’s become ever more important as the technology influences more of our world. Achieving this goal is going to require a multi-faceted approach, but red teaming techniques will play a crucial goal in securing AI tools.

Specifically, red teaming is the process of testing some system to identify vulnerabilities. Done without malicious intent, this process is meant to find problems before hackers do.

Anthropic recently published a post outlining some insights the company has come across in the process of testing its AI systems. In doing so, Anthropic hopes to spark a conversation of how to do red teaming right with AI and how the world needs more standardized practices with red teaming.

New Tech, New Rules

One of the larger problems in AI security in general – and with the technology more generally – is that we currently lack a set of standardized practices. Specifically, Anthropic pointed out that a lack of standardization “complicates the situation.”

For instance, Anthropic points out that developers might use different techniques to assess the same type of threat model. Even using the same technique itself doesn’t remove the problem, as they may go about the red teaming process in different ways.

Additionally, the solutions to many of these problems aren’t as simple as they may appear. At the moment, there aren’t any disclosure standards that dictate the entire industry. An article from Tech Policy Press discussed the Pandora’s box or protective shield dilemma. There are many advantages to sharing the outcomes of red-teaming efforts in academic papers, but doing so may inadvertently provide adversaries with a blueprint for exploitation.

While that’s more of a general discussion that must happen in the AI field in the years to come, Anthropic went on to outline specific red teaming methods that they have tried:

  • Domain-specific, expert red teaming
    • Trust & Safety: Policy Vulnerability Testing
    • National security: Frontier threats red teaming
    • Region-specific: Multilingual and multicultural red teaming
  • Using language models to red team
    • Automated red teaming
  • Red teaming in new modalities
    • Multimodal red teaming
  • Open-ended, general red teaming
    • Crowdsourced red teaming for general harms
    • Community-based red teaming for general risks and system limitations

Anthropic does a great job of diving into each of these topics, but the company’s focus on red teaming in new modalities is especially interesting. AI has been heavily focused on text inputs rather than other forms of media like photos, videos, and scientific charts. Red teaming in these multimodal environments is challenging, but it can help identify risks and failure modes.

Anthropic’s Claude 3 family of models are multimodal, and while that gives users more flexible applications it does present new risks in the form of fraudulent activity, threats to child safety, violent extremism, and more.

Before deploying Claude 3, Anthropic asked its Trust and Safety team to red team the system for both text- and image-based risks, They also worked with external red teamers to assess how well Claude 3 does at refusing to engage with harmful inputs.

Multimodal red teaming clearly has the benefit of catching failure modes prior to public deployment, but Anthropic also pointed out the benefit it provides with end-to-end system testing. Many AI models are actually a system of interrelated components and features. This can include a model, harm classifiers, and prompt-based interventions. Multimodal red teaming is an effective way to stress test the resilience of an AI system end-to-end and therefore understand overlapping safety features.

Of course, there are challenges to a multimodal approach to red teaming. To begin, the security team requires deep subject matter expertise in high-risk areas such as dangerous weapons – which is a rare skill. Additionally, multimodal red teaming can involve viewing graphic imagery as opposed to reading text-only content. This presents a risk to red teamer wellbeing, and as such must warrant additional safety considerations.

Red teaming is a complex process, and multimodality is only one of topics that Anthropic covered in their extensive report. However, it’s clear that the world requires a standardized approach to AI safety and security.

AIwire