Fable 5 benchmarks — the table, then the discount.
Anthropic published thirteen benchmarks at launch — fifteen rows once Humanity's Last Exam and BioMysteryBench split into two settings each. This page prints all of them, including the rows the headlines skip, then applies the discount this book has carried since April.
Numbers are from Anthropic's announcement. The discount rule is the Berkeley RDI receipt in research notes; the discipline is Ch 24 and Ch 25. The operator's full file lives at /fable-5.
Jump to section tap to open
The 30-second answer
Fable 5 tops the launch table: SWE-Bench Pro 80.3% (Opus 4.8: 69.2%), FrontierCode Diamond 29.3% vs 13.4%, OSWorld 85.0%. First-party numbers — and Berkeley RDI reward-hacked 8 of 8 agent benchmarks in April, so the standing discount is 10–15 points. Signals, not receipts. Run your private eval in the free June 9–22 window.
The full launch table
Grouped by cluster, not by Anthropic's ordering. "—" means not reported. Starred rows carry the safeguard caveat — footnote below the table.
| Benchmark | Fable 5 / Mythos 5 | Mythos Preview | Opus 4.8 | GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|---|
| SWE-Bench Pro · agentic coding | 80.3% | 77.8% | 69.2% | 58.6% | 54.2% |
| FrontierCode (Diamond), xhigh · agentic coding | 29.3% | — | 13.4% | 5.7% | — |
| Terminal-Bench 2.1* · agentic coding | 88.0% | — | 82.7% | 83.4% (Codex CLI) | 70.7% (Gemini CLI) |
| GDPval-AA · knowledge work | 1932 | — | 1890 | 1769 | 1314 |
| GDP.pdf · knowledge work, vision, no tools | 29.8% | — | 22.5% | 24.9% | 16.7% |
| Legal Agent Benchmark · knowledge work | 13.3% | — | 10.4% | 2.1% | 0.0% |
| Blueprint-Bench 2 · spatial reasoning | 38.6% | — | 14.5% | 36.2% | 26.5% |
| AutomationBench · tool use | 17.4% | — | 15.5% | 12.9% | 9.6% |
| OSWorld-Verified · computer use | 85.0% | 85.4% | 83.4% | 78.7% | 76.2% |
| Humanity's Last Exam, no tools* · reasoning | 59.0% | 56.8% | 49.8% | 41.4% | 44.4% |
| Humanity's Last Exam, with tools* · reasoning | 64.5% | 64.7% | 57.9% | 52.2% | 51.4% |
| BioMysteryBench, hard* · biology | 46.1% | 29.6% | 40.0% | — | — |
| BioMysteryBench, human-solved* · biology | 83.9% | 82.6% | 80.4% | — | — |
| ExploitBench Cap%* · cybersecurity | 78.0% | 69.0% | 40.0% | 34.0% | — |
| HealthBench Professional* · health | 66.0% | 64.7% | 56.9% | 51.8% | — |
Anthropic's methodology footnote, in substance: reported scores are within a 1–3 percentage-point difference between Claude Mythos 5 and Claude Fable 5, and the table shows the higher of the two. Starred (*) benchmarks show a larger difference due to blocking safeguards for cybersecurity- and biology-related questions — on these, Fable 5 performs closer to Opus 4.8 due to fallbacks.
Agentic coding — the headline cluster
SWE-Bench Pro: 80.3% vs Opus 4.8's 69.2%. An 11.1-point jump on the number every vendor quotes first — which is exactly why it's the number to trust least on its own. GPT 5.5 sits at 58.6%.
FrontierCode (Diamond) at xhigh effort: 29.3% vs 13.4%. A 2.2× gap on the eval built to resist saturation, with GPT 5.5 at 5.7%. This is the row I weight heaviest — a relative gap that size on a resistant eval survives a 10–15 point haircut; an absolute level like 80.3% doesn't. Cognition reports Fable 5 highest among frontier models at medium effort — vendor-curated, same discount applies.
Terminal-Bench 2.1: 88.0%, starred. Two caveats in one row. The GPT 5.5 number (83.4%) runs through the Codex CLI harness and Gemini's 70.7% through Gemini CLI — cross-vendor agentic scores are model-plus-scaffold pairs, not model-vs-model. And the star means the printed score leans on the Mythos 5 side of the pair; the Fable 5 you can buy lands closer to Opus 4.8 where safeguards touch terminal work.
Knowledge work — a smaller gap than the coding story
GDPval-AA: 1932 vs 1890. The units are a score, not a percentage — don't read it like SWE-Bench. A 42-point lead over Opus 4.8 on this scale is a narrower relative gap than the coding cluster shows. The cross-vendor spread is the real signal: GPT 5.5 at 1769, Gemini 3.1 Pro at 1314.
GDP.pdf (vision, no tools): 29.8%. Opus 4.8 scores 22.5%, GPT 5.5 24.9% — nobody clears 30%. Knowledge work over documents-as-pixels has headroom across the entire field.
Legal Agent Benchmark: 13.3%. It leads a field where GPT 5.5 scores 2.1% and Gemini scores 0.0%. When the winning number is 13.3%, the benchmark is measuring the frontier of a category, not crowning a usable winner. Treat it as a map of what's still hard.
Spatial + tool use — the close races
Blueprint-Bench 2 (spatial reasoning): 38.6% vs GPT 5.5's 36.2%. The closest cross-vendor race on the table. Opus 4.8's 14.5% says spatial was the lineage's weak spot and Fable 5 fixed it — the 2.4-point lead over GPT says it isn't a moat.
AutomationBench (tool use): 17.4%. Opus 4.8 at 15.5%, GPT 5.5 at 12.9%, Gemini at 9.6%. Everyone is bad at this. The honest read of a cluster where the leader scores 17.4% is the category verdict, not the ranking — no model on this table survives the eval, Fable 5 just drowns slowest.
Computer use — the row the preview still owns
OSWorld-Verified: 85.0%. The one row on the table where Mythos Preview — at 85.4% — edges the shipped model. Opus 4.8 sits at 83.4%, so the launch gain here is 1.6 points: computer use is the cluster where Fable 5 moves least.
One reason to like this modest number more than the loud ones: in the Berkeley RDI paper, OSWorld held up better against reward-hacking than SWE-bench Verified did. The benchmark that barely moved may be the one telling you the most truth.
Reasoning — starred on both rows
Humanity's Last Exam, no tools: 59.0% against Opus 4.8's 49.8% and Mythos Preview's 56.8%. With tools: 64.5% — and here Preview's 64.7% edges the shipped pair.
Both rows are starred, and the star matters more here than anywhere outside the safeguarded cluster: HLE's biology- and cyber-flavored questions trip Fable 5's blocking classifiers, so the shipped model performs closer to Opus 4.8 on those slices and the printed score leans on Mythos 5. If you're buying Fable 5 for reasoning-heavy work in safe domains, the number is informative; if your questions live near the classifiers, it isn't your number.
Safeguarded domains — capability you can't fully buy
ExploitBench Cap%: 78.0% vs Opus 4.8's 40.0%. Read this row the right way: it's Mythos-class cyber capability that Fable 5's classifiers gate by design. The score tells you what the underlying model can do; the star tells you the product you can buy behaves closer to Opus 4.8 on exploitation topics — by design, with a fallback you're told about. Anthropic's external testing found zero harmful single-turn cyberattack-planning requests succeeded, across 1,000+ hours of red-teaming with no universal jailbreaks found.
BioMysteryBench: 46.1% on hard, 83.9% on human-solved. Both starred, both ahead of Opus 4.8 (40.0% / 80.4%) — and the hard split is the table's strangest line, with Mythos Preview at 29.6%, below Opus. HealthBench Professional: 66.0% vs Opus 4.8's 56.9% — health sits next to the bio classifiers, same starred logic.
The routing consequence: if your work lives in these domains, send it to Opus 4.8 directly instead of bouncing off Fable 5's classifiers only to land there anyway. The fallback architecture is laid out on the hub.
The discount — what April taught us about June
On April 12, 2026, Berkeley RDI reward-hacked eight of eight major agent benchmarks — SWE-bench Verified, SWE-bench Pro, OSWorld, GAIA, WebArena, Terminal-Bench, FieldWorkArena, CAR-bench. The agents didn't solve harder problems; they learned the scoring rules and detected the test environment. Since then this book runs one rule: discount public scores 10–15 points for contamination and gaming, and pair every external benchmark with a private eval you wrote (Ch 24, Ch 25).
What survives the discount is shape, not level. FrontierCode's 2.2× over Opus 4.8 on an eval built to resist saturation, ExploitBench's near-doubling — relative gaps that size don't vanish into contamination. SWE-Bench Pro's 80.3% as an absolute headline is exactly the kind of number that does shrink.
And the move is already scheduled for you: Fable 5 sits inside paid-plan limits June 9–22. Two free weeks to run your own suite against the three workloads where Opus 4.8 makes you wait or retry — keep the transcripts, compare cost per finished task, then update your own tier list with your data instead of Anthropic's. Head-to-heads when you're choosing: vs Opus 4.8 · vs GPT 5.5 and Gemini 3.1 Pro.
FAQ
What is Fable 5's SWE-Bench score?
On Anthropic’s launch table, Claude Fable 5 scores 80.3% on SWE-Bench Pro — against 69.2% for Opus 4.8, 58.6% for GPT 5.5, and 54.2% for Gemini 3.1 Pro. That is a first-party number; the operator discipline is to discount public scores 10–15 points and confirm on a private eval before re-routing work.
Is Fable 5 the best coding model?
On the launch table, yes — it leads SWE-Bench Pro (80.3%), FrontierCode Diamond (29.3% vs 13.4% for Opus 4.8), and Terminal-Bench 2.1 (88.0%). But the table is vendor-reported, and Berkeley RDI showed agent benchmarks can be reward-hacked. Treat it as the strongest coding signal so far, not as proof — your own eval decides.
Why are some Fable 5 scores starred?
Anthropic’s footnote: most scores sit within a 1–3 percentage-point difference between Claude Mythos 5 and Claude Fable 5, and the table shows the higher of the two. Starred benchmarks show a larger difference because Fable 5’s blocking safeguards for cybersecurity- and biology-related questions trigger fallbacks — on those rows the Fable 5 you can buy performs closer to Opus 4.8.
How reliable are these benchmarks?
Treat them as signals, not receipts. On April 12, 2026, Berkeley RDI reward-hacked eight of eight major agent benchmarks — including SWE-bench Pro, OSWorld, and Terminal-Bench, all on this launch table. The standing rule since then: discount public scores 10–15 points for contamination and gaming, and pair every external benchmark with a private eval you wrote yourself.
Did Fable 5 beat Mythos Preview?
Where both are reported, mostly yes: SWE-Bench Pro 80.3% vs 77.8%, Humanity’s Last Exam (no tools) 59.0% vs 56.8%. Two exceptions: OSWorld-Verified, where Preview’s 85.4% edges the shipped 85.0%, and HLE with tools (64.7% vs 64.5%). Caveat: starred scores print the higher of the Fable 5 / Mythos 5 pair, so the gated model does some of the lifting on those rows.
The Fable 5 files
One model, two names — the safeguards, the fallback, the gated twin.
All thirteen benchmarks, the starred-row caveat, and the reward-hacking discount.
Upgrade or wait — the 2× sticker against the turn-count collapse.
Fable 5 vs GPT 5.5 vs Gemini 3.1 ProThe cross-vendor read, including where the rivals' CLIs hold up.
Pricing + cost per task$10/$50, the plan window, and the Ch 29 math on 2× stickers.
Use casesStripe's 50M-line day, Cursor, GitHub, trading desks, drug design — and the operator's own.
Fable 5 in Claude CodeThe banner, the June 22 clock, /model, and when to route to it.
The API pageclaude-fable-5, the one new 400, and the one-line migration from Opus 4.8.
Related: The Fable 5 hub · Ch 24 — the tier list · Ch 25 — evals or hope · Research notes (the Berkeley RDI receipt) · The live tier list