AI Red Teaming

How Anthropic’s Jailbreak Challenge Put AI Safety Defenses to the Test

ejames@hackerone.com — Mon, 03 Mar 2025 18:49:33 +0000

How Anthropic’s Jailbreak Challenge Put AI Safety Defenses to the Test H1 Team ejames@hackerone.com Mon, 03/03/2025 - 12:49 Image March 3rd, 2025

Last month, Anthropic partnered with HackerOne to complete an AI red teaming challenge on a demo version of Claude 3.5 Sonnet. The challenge's goal was to test and validate Anthropic’s new Constitutional Classifiers, which block harmful queries, particularly those that could produce outputs related to CBRN (chemical, biological, radioactive, nuclear) weapons and related content. Anthropic invited researchers to try and bypass Claude’s defenses through a “universal” jailbreak — a technique that allows model users to bypass safety defenses with a single input consistently.

The challenge ran from February 3 to February 10 and consisted of eight levels. To pass each level, researchers had to gain answers from Claude about a question related to CBRN topics through jailbreaking. Depending on their findings, researchers earned bounties: $10,000 to the first participant who passed all eight levels with different jailbreaks and $20,000 to the first participant who used a single, universal jailbreak to pass all levels.

Challenge Results

The challenge saw substantial engagement, with more than 300,000 chat interactions from 339 participants. We’d like to thank all the researchers who participated and congratulate those who received bounty rewards. It was no small feat! Four teams earned a total of $55,000 in bounty rewards from Anthropic: one passed all levels using a universal jailbreak, one passed all levels using a borderline-universal jailbreak, and two passed all eight levels using multiple individual jailbreaks.

"This challenge demonstrated the high return on investment for collaborative efforts. Delivering large language models (LLMs) in a safe and aligned manner is a significant challenge—especially given the intricacies of transformer architectures. This experience was a clear reminder that as these models get smarter, our strategies for testing can also evolve to stay ahead of potential risks." — Salia Asanova aka @saltyn

How The Community Contributes to Safer Systems

The diversity of techniques used by the winners and all the researchers who participated contributed to strengthening Claude’s protections. Anthropic noticed a few particularly successful jailbreaking strategies researchers employed:

Using encoded prompts and ciphers to circumvent the AI output classifier
Leveraging role-play scenarios to manipulate system responses
Substituting harmful keywords with benign alternatives
Implementing advanced prompt-injection attacks

These discoveries made by the community identified fringe cases and key areas for Anthropic to reexamine for its safety defenses while validating where guardrails remained effective.

Looking Ahead: Strengthening AI Defenses

The findings demonstrate the value the community can deliver when organizations use AI red teaming work in addition to other AI safety and security best practices:

"Our researcher community’s approach is rooted in curiosity, creativity, and the relentless pursuit of finding flaws others might miss. This mindset is distinct from building and reinforcing technical models, yet it’s an essential complement. While internal teams focus on defending and aligning AI systems, engaging with a community of researchers ensures continuous, real-world testing that validates and strengthens those defenses. Together, these perspectives drive more resilient and trustworthy AI." — Dane Sherrets, Staff Solutions Architect, Emerging Technologies at HackerOne

As AI advances, so must the ways we secure it. We’re committed to collaborating with leaders like Anthropic, who continue to define AI safety best practices that help us all build a more resilient digital world.

Visit here to read more about the challenge and Anthropic’s AI safety work.

Customer Stories AI Safety & Security AI Red Teaming

Proactively testing for risk is a key component of building responsible AI. One way organizations do this is through AI red teaming, which stress tests models to identify potential opportunities for abuse. AI red teaming often taps the broader security and AI researcher community to help find elusive security and safety issues caused by circumventing model guardrails. Model developers can then use these insights to improve or validate existing guardrails.

Celebrating 10 Years of Partnership: Snap and HackerOne Reach $1M in Bounties

h1_admin — Fri, 14 Feb 2025 17:17:47 +0000

Celebrating 10 Years of Partnership: Snap and HackerOne Reach $1M in Bounties H1 Team h1_admin Fri, 02/14/2025 - 11:17 Image February 14th, 2025

Q: Tell us about your role at Snap and why cybersecurity is vital to your business.

Jim Higgins: I’m Snap's Chief Information Security Officer (CISO). Before joining Snap, I served as CISO at Square and spent over a decade at Google leading their Product Security Information Engineering team. At Snap, we support nearly a half a billion daily active users who use Snapchat every day on average. Keeping our customers safe from the ever-evolving landscape of unknown threats is a deeply personal mission for me.

Q: What does reaching the $1M milestone mean for Snap’s security team?

Jim Higgins: Hitting $1M in bounties is a badge of honor. It reflects our commitment to valuing the intelligent security researchers who help keep us safe. Bug bounty programs are notoriously difficult to build, but HackerOne’s talented community provides us with the expertise and creativity we need to secure our platform.

Q: How has your bug bounty program evolved over the past 10 years?

Vinay Prabhushankar: When we started, our program was more operational and focused on identifying and fixing individual issues. As we matured, we shifted to a strategic approach, identifying systemic problems and building frameworks to resolve them. For instance, our 2025 roadmap includes initiatives that stem directly from vulnerabilities identified through HackerOne. Today, our program influences security, privacy, and safety strategies.

Q: Are there any memorable milestones or moments you’re especially proud of?

Vinay Prabhushankar: Beyond the $1M milestone, we launched one of the first CTF-style challenges focused on the safety of generative AI features.

Q: How has AI Red Teaming influenced Snap’s approach to security?

Ilana Arbisser: We use AI Red Teaming to determine qualitative safety aspects – what’s possible, not necessarily what’s likely. We’re also constantly surprised by what’s possible– we try to keep an open mind while designing exercises. The benefit of working with HackerOne is that human ingenuity is more effective than consistently using adversarial prompt datasets or LLM written attacks. The impact of the AI Red Teaming on our products has been to identify specific safety vulnerabilities and guide the addition of specific mitigations.

Q: Where do you see AI Red Teaming heading in the future?

Ilana Arbisser: Simulated AI red teaming with LLM agents is improving significantly. This approach, when complimented by AI expert-driven testing by humans, is also more useful for getting quantitative results because attacks can be scaled to understand better how small input changes affect output.

Q: With new AI tools constantly emerging, how does your team stay ahead of these technological advancements?

Ilana Arbisser: To keep pace with advancements, we rely on a combination of strategies. This includes staying informed through news and industry sources, attending AI networking and information-sharing events and conferences, and participating in industry-specific gatherings like the Defcon AI Village.

Q: What sets HackerOne apart as a partner?

Jim Higgins: HackerOne’s community is second to none. Over the past decade, they’ve built an ecosystem that values customer and researcher feedback. Their pace of innovation, particularly in AI features, has been impressive. For instance, we were able to use HackerOne’s GenAI copilot, Hai, to translate submissions in 7 different EU languages when we did a private challenge hackathon around Election Safety around our MyAI chatbot.

Beyond technology, the support we’ve received has been phenomenal. HackerOne doesn’t just get us; they get security researchers. It’s like having a trusted partner who’s always in your corner.

Q: What findings is the team most interested in surfacing? What types of bugs are most valuable to Snap?

Jim Higgins: At Snap, we prioritize security and privacy. Protecting sensitive user information is at the core of everything we do. Snap’s team is particularly interested in vulnerabilities that could compromise the integrity of its platform, such as remote code execution (RCE) or privilege escalation. We encourage security researchers to focus their efforts on these critical issues.

Q: What lessons has Snap learned from its bug bounty program?

Vinay Prabhushankar:

Fix low and medium bugs: These might seem minor, but when chained together, they can lead to critical vulnerabilities. Fixing them breaks the chain.
Build trust with security researchers: Trust takes time but pays dividends in high-quality submissions.
Gamify your program: Elements like challenges, swag, and live hacking events encourage creativity and engagement.

Q: What advice would you give companies starting a bug bounty program?

Jim Higgins: Start small with a private program, then expand the scope as you grow. Treat researchers as trusted allies—they’re like an extension of your team. We even have an internal guide on engaging with researchers, which includes concrete examples of dos and don’ts.

Q: What’s next for Snap’s bug bounty program?

Jim Higgins: We plan to expand our scope to include hardware products like AR glasses and double down on AI security. HackerOne AI Red Teaming has proven invaluable, and we’re eager to deepen our collaboration with HackerOne’s community. Our ultimate goal is to make Snap’s bug bounty program a model for others to follow and strengthen the security of our users.

Customer Stories Bug Bounty AI Red Teaming AI Safety & Security

At Snap, security is more than a priority—it’s a core mission. Over the past decade, Snap has partnered with HackerOne to build and sustain a robust bug bounty program. This collaboration has led to major milestones, including paying security researchers over $1M in bounties. To celebrate this achievement and their 10-year partnership, we spoke with Jim Higgins, Snap's Chief Information Security Officer, Vinay Prabhushankar, Snap’s Security Engineering Manager, and Ilana Arbisser, Snap’s Privacy Engineer. Together, they reflect on how this partnership has shaped Snap’s security, privacy, and innovation approach.

New Guidance for Federal AI Procurement Embraces Red Teaming and Other HackerOne Suggestions

h1_admin — Mon, 09 Dec 2024 19:09:00 +0000

New Guidance for Federal AI Procurement Embraces Red Teaming and Other HackerOne Suggestions Michael Woolslayer Policy Counsel h1_admin Mon, 12/09/2024 - 13:09 Image December 9th, 2024

Earlier this year, the Office of Management and Budget (OMB), which establishes budget rules for federal agencies, issued a memorandum on Advancing the Responsible Acquisition of Artificial Intelligence in Government which outlines for both agencies and the public significant aspects of responsible AI procurement and deployment. In particular, OMB’s memo embraced AI red teaming as a critical element of the acquisition of AI for U.S. government agencies.

Rules for U.S. Federal Agency AI Procurement

Last October, the Biden-Harris Administration published an Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence (AI EO). That expansive action set the tone for the US government’s approach to utilizing AI in a safe and secure manner and required OMB to provide guidance to US government agencies on how to manage risks when acquiring AI products and services.

Consistent with HackerOne’s long-standing policy advocacy in favor of responsible AI deployment, we provided OMB with comments on how the security and safety best practices championed by HackerOne aligned with the AI EO and should be leveraged in OMB’s development of that guidance. Specifically, HackerOne cited the benefits of conducting AI red teaming, ensuring the transparency of AI red teaming methodology, and of documenting the specific harms and bias federal agencies are seeking to avoid. These suggestions drew on our extensive experience working with government agencies and companies to enhance cybersecurity and our use of similar best practices in testing AI models.

We were pleased to see that the memo reflects our core recommendations:

Embracing AI Red Teaming: OMB has made it a requirement that agencies procuring general use enterprise-wide generative AI include contractual requirements ensuring that vendors provide documentation of AI red teaming results.
Identifying Specific Harms: In addition to the categories of risk that vendors include, OMB has encouraged agencies to require documentation to cover AI red teaming related to nine specific categories of risk.

The inclusion of these elements within the memo will help protect the security and effectiveness of the U.S. federal government by requiring that the AI products and services that undergird critical operations be proactively tested to identify potential risks and harms. It also further underscores the role of AI red teaming as a best practice that all companies should adopt to help ensure the safety and security of their AI products and services and to build the trust of their customers.

Learn more about AI red teaming with HackerOne.

Public Policy AI Red Teaming AI Safety & Security

The U.S. government’s approach to evaluating and adopting new technology for its own use often impacts private sector adoption. That’s why it’s significant that, while AI is already having a transformative effect on productivity across industries, the U.S. government is also seeking to harness the benefits of this emerging technology for federal agencies and has now developed criteria to guide its decision making while evaluating AI. As the U.S. government works to apply AI to critical U.S. government operations, it is vital that AI’s power is harnessed safely and responsibly—not only to ensure that the government’s own deployment of AI is secure and effective, but also because of the government’s ripple effect on standards and adoption of AI across all sectors.