Major AI Safety Breaches Discovered as Anthropic's New 'Claude Mythos' Model Leaks

Key Developments

A confluence of major AI safety events has emerged this week, headlined by METR’s discovery of novel vulnerabilities in Anthropic’s internal security systems and an accidental leak revealing the company’s most capable model yet. METR researchers spent three weeks red-teaming Anthropic’s agent monitoring systems, uncovering several previously unknown vulnerabilities - some of which have since been patched, though none severely undermined the Opus 4.6 Sabotage Risk Report’s major findings.

Simultaneously, an accidental data leak exposed Anthropic’s new “Claude Mythos” model, which the company describes as representing “a step change” in AI performance. The leaked draft revealed unprecedented cybersecurity capabilities, including the ability to surface unknown vulnerabilities in production codebases - a dual-use capability that could benefit both attackers and defenders.

Industry Context

These developments highlight the accelerating tension between AI capability advancement and safety measures. The METR evaluation represents a significant milestone in third-party AI safety assessments, demonstrating the value of embedding external researchers within frontier AI companies. Meanwhile, over 40 researchers from competing companies including OpenAI, Google DeepMind, Anthropic and Meta have abandoned corporate rivalry to issue joint warnings about AI safety, arguing that the window for monitoring AI reasoning could “close forever - and soon.”

Anthropic has also released research on “abstractive red-teaming,” a new approach for testing language models that searches for natural-language categories causing specification violations - addressing gaps left by static evaluations and prompt optimization approaches.

Practical Implications

For Irish and European AI developers, these events underscore the critical importance of robust security testing and third-party evaluations. The European Commission is actively recruiting AI technology specialists to govern cutting-edge models, with applications closing today (March 27). Ireland’s new AI Office, set to launch by August 2026, will serve as the central authority for implementing the EU AI Act nationally, following the recent publication of the Regulation of Artificial Intelligence Bill 2026.

The cybersecurity implications of models like Claude Mythos suggest organisations should prepare for both enhanced defensive capabilities and new attack vectors in their security planning.

Open Questions

Key uncertainties remain around the full scope of vulnerabilities discovered in Anthropic’s systems and whether similar issues exist across other frontier AI companies. The timeline for implementing robust third-party evaluation standards across the industry, and how quickly regulatory frameworks can adapt to these rapid capability advances, also remains unclear.

Source: METR Research Report