AI Model Performance Timeline
Two charts, two stories. The first tracks MMLU — the dominant knowledge benchmark from 2022–2025 — across every major OpenAI and Anthropic release. The headline: Claude entered 2023 thirteen points behind GPT-4 and had fully caught up by mid-2024. Both labs now sit at near-saturation around 97%. The second chart covers where the frontier is actually contested in May 2026: SWE-bench Pro, Terminal-Bench, OSWorld, and GDPval. Both frontier models now exceed the human baseline on OSWorld desktop navigation, and the contested benchmarks have moved to harder, less-contaminated variants like SWE-bench Pro — where Anthropic's just-released Opus 4.8 holds the lead. Hover any data point for model details and source notes.
MMLU (Massive Multitask Language Understanding) was the dominant benchmark from 2022–2025, tracking broad knowledge across 57 academic disciplines. Both labs are now near-saturated at 97%+, making it a poor differentiator for current frontier models — but the historical arc is striking.
MMLU measures what models know. These benchmarks measure what they can actually do — write and deploy working code, navigate a computer autonomously, and complete real professional tasks. SWE-bench Verified is now saturating around 88–89% for both labs, so the contested benchmark has moved to SWE-bench Pro. Both frontier models now exceed the human baseline on OSWorld. Hover each bar for model details and source notes.
Benchmark notes: MMLU scores — official where available; recent entries (*) estimated from ArtificialAnalysis Intelligence Index and relative positioning. MMLU has been saturated at 97%+ since late 2025 and is no longer the primary frontier differentiator. Capabilities benchmark scores sourced from Anthropic and OpenAI launch posts, ArtificialAnalysis, TokenMix, and llm-stats.com (May 2026). SWE-bench Verified is saturating (~88–89% for both labs); SWE-bench Pro is now the more representative coding benchmark. Terminal-Bench moved to version 2.1 in May 2026. GDPval-AA is measured in Elo points and is not directly comparable to percentage-based benchmarks. Mythos-class preview models from Anthropic are excluded (not generally available).
For the latest information: llm-stats.com — live benchmark tracking across 50+ evaluations and 20+ API providers.
The Expanding Universe of Large Language Models
A dark-observatory globe that grows the whole model landscape from the 2017 Transformer to today — a curated ~100 models plotted on the city of their lab, sized by parameters, raised by capability, and coloured by maker, with a playable timeline, usage-flow arcs, and sound.