Fable 5 vs GPT 5.5 vs Gemini 3.1 Pro.
On June 9 Anthropic published a launch table with rival columns in it. Thirteen rows compare Claude Fable 5 against GPT 5.5 and Gemini 3.1 Pro. This page is the operator read: what the table says, where the rivals hold up, and why none of it re-tiers anything by itself.
Numbers sourced from Anthropic's announcement. Full per-row caveats on the benchmarks page; the standing discount rule in Ch 24.
Jump to section tap to open
The 30-second answer
On Anthropic's launch table, Claude Fable 5 leads GPT 5.5 and Gemini 3.1 Pro on every row all three reported — SWE-Bench Pro 80.3% vs 58.6% vs 54.2%. The closest race is Blueprint-Bench 2, 38.6% vs 36.2%; the closest agentic-coding race is Terminal-Bench, 88.0% vs Codex CLI's 83.4% — and that one is a harness comparison. First-party table: discount 10–15 points, run your own eval before re-tiering.
The cross-vendor table
Every row from the launch table where at least one rival is reported. "—" means Anthropic didn't report a number for that model.
| Benchmark | Fable 5 / Mythos 5 | GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| SWE-Bench Pro (agentic coding) | 80.3% | 58.6% | 54.2% |
| FrontierCode Diamond, xhigh (agentic coding) | 29.3% | 5.7% | — |
| Terminal-Bench 2.1 (agentic coding)* | 88.0% | 83.4% (Codex CLI) | 70.7% (Gemini CLI) |
| GDPval-AA (knowledge work) | 1932 | 1769 | 1314 |
| GDP.pdf (knowledge-work vision, no tools) | 29.8% | 24.9% | 16.7% |
| Blueprint-Bench 2 (spatial reasoning) | 38.6% | 36.2% | 26.5% |
| AutomationBench (tool use) | 17.4% | 12.9% | 9.6% |
| OSWorld-Verified (computer use) | 85.0% | 78.7% | 76.2% |
| Legal Agent Benchmark | 13.3% | 2.1% | 0.0% |
| Humanity's Last Exam, with tools* | 64.5% | 52.2% | 51.4% |
| Humanity's Last Exam, no tools* | 59.0% | 41.4% | 44.4% |
| ExploitBench Cap% (cybersecurity)* | 78.0% | 34.0% | — |
| HealthBench Professional* | 66.0% | 51.8% | — |
* Starred rows show a larger gap between Mythos 5 and Fable 5 because blocking safeguards make Fable 5 perform closer to Opus 4.8 on cyber- and bio-adjacent questions; elsewhere the two are within 1–3 points and the table shows the higher. Anthropic chose these benchmarks, ran this table, and published it on launch day. Keep that sentence next to every number above.
The honest reads
Terminal-Bench is the closest agentic-coding race — and the one where the rivals brought their own harnesses. The rival numbers aren't bare models: GPT 5.5 ran through Codex CLI, Gemini 3.1 Pro through Gemini CLI. Codex CLI at 83.4% is five points off the lead on the one row where the rivals brought their own tooling — that holds up. It's also the lesson of Ch 35: Codex isn't a competing model in my stack, it's the night shift — same repo, different contract — which is why the rivals' CLI numbers here deserve respect.
Blueprint-Bench 2 is the closest race on the table. 38.6% vs 36.2% on spatial reasoning — 2.4 points. If your workload is floor plans, diagrams, physical layout, this table gives you no reason to switch anything in either direction.
Legal Agent Benchmark says more about the eval than the loser. 13.3% vs 2.1% vs 0.0% — when the leader can't clear 14% and one model posts a zero, the row is telling you the task class is barely solved, not that one vendor cracked it.
Gemini 3.1 Pro is third on all but one row it appears in — it edges GPT 5.5 on Humanity's Last Exam without tools. Read that with both caveats on. First: this is Anthropic's table of Anthropic-chosen benchmarks. Second: Google's cycle already moved. On May 19 this site's research notes logged Gemini 3.5 Flash clearing 3.1 Pro on Google's own agentic boards — vendor slides, same discipline applied — with a 3.5 Pro promised. Anthropic launched against the model Google is in the middle of replacing. That's how launch tables work; it's also why they expire.
The discipline — one vendor's table is not a leaderboard
This book has one rule for launch-day numbers and it didn't expire on June 9. Berkeley RDI reward-hacked eight major agent benchmarks in April 2026 — since then the standing discount on every public score is 10–15 points, and every external benchmark gets paired with a private eval (Ch 24, Ch 25). That discount applies to all three columns above, including the one with the accent color.
What survives the discount is shape, not precision. SWE-Bench Pro at 80.3% vs 58.6% is a 21.7-point gap — that doesn't dissolve into contamination noise. FrontierCode Diamond at 29.3% vs 5.7% is a 5× gap on the eval built to resist saturation. Terminal-Bench at 88.0% vs 83.4% is inside the discount band — treat that row as a tie until your own numbers say otherwise.
The move
If you run a second-prior workflow — Claude on the day shift, Codex on the night shift, Gemini as the idea machine (Ch 35) — nothing in this table changes your setup today. Two priors triangulate; that logic got stronger, not weaker, now that the gaps between them are this uneven across task classes.
The one time-boxed action: Fable 5 sits inside paid-plan limits June 9–22, 2026, which makes the Claude leg of a three-way eval free for two weeks. Point all three models at the same three workloads, keep the transcripts, read the results against your own bar. Re-tier after that — not after a slide, mine or theirs.
The question every launch invites — which model won the cycle — is the wrong question. The table above is one vendor's thirteen favorite rows on launch day; the May table was another vendor's. The operators who win cycles are the ones whose private evals were already running when the slides dropped.
FAQ
Is Claude Fable 5 better than GPT 5.5?
On Anthropic's launch table, yes — Fable 5 leads GPT 5.5 on every row reported: SWE-Bench Pro 80.3% vs 58.6%, GDPval-AA 1932 vs 1769, Humanity's Last Exam with tools 64.5% vs 52.2%. The closest race is Blueprint-Bench 2 (38.6% vs 36.2%); the closest agentic-coding race is Terminal-Bench 2.1 (88.0% vs 83.4%), where GPT 5.5 ran through Codex CLI — a harness comparison, not a pure model one. It is a first-party table; discount it 10-15 points and run your own eval before deciding.
Is Fable 5 better than Gemini?
Against Gemini 3.1 Pro, Fable 5 leads every row where Gemini is reported — GDPval-AA 1932 vs 1314, OSWorld-Verified 85.0% vs 76.2%, Terminal-Bench 88.0% vs Gemini CLI's 70.7%. But the comparison target is one Google cycle stale: Gemini 3.5 Flash was announced May 19, 2026 and clears 3.1 Pro on Google's own agentic boards, with a 3.5 Pro promised. The honest answer is 'yes, against the model Google is replacing.'
Should I switch from GPT 5.5 to Fable 5?
Not on a vendor launch table alone. The book's standing rule is to discount public benchmark scores 10-15 points after Berkeley RDI reward-hacked eight major agent benchmarks in April 2026. What survives the discount here is the gap shape — 80.3% vs 58.6% on SWE-Bench Pro doesn't vanish into contamination. Fable 5 sits inside paid-plan limits June 9-22, 2026, so the eval window on your own workload is free. Run it, then decide.
What about Gemini 3.5?
Anthropic's table compares against Gemini 3.1 Pro — but Google announced Gemini 3.5 Flash on May 19, 2026, and on the agentic and coding boards Google chose to show, the Flash clears 3.1 Pro. A 3.5 Pro is promised. This site logged that launch in research notes with the same discipline applied here: vendor launch-deck numbers are a signal, not a receipt. Until the 3.5 Pro ships and lands on independent leaderboards, the cross-vendor picture is incomplete by exactly one column.
Which is best for agentic coding?
On the launch table, Fable 5: SWE-Bench Pro 80.3% (GPT 5.5: 58.6%, Gemini 3.1 Pro: 54.2%) and FrontierCode Diamond 29.3% vs GPT 5.5's 5.7%. But the agentic rows are where harness matters most — GPT 5.5's strongest showing, Terminal-Bench 83.4%, came through Codex CLI. Best on your repo is decided by your eval, not this table. The live tier list stays the source of truth.
The Fable 5 files
One model, two names — the safeguards, the fallback, the gated twin.
Benchmarks, read honestlyAll thirteen benchmarks, the starred-row caveat, and the reward-hacking discount.
Fable 5 vs Opus 4.8Upgrade or wait — the 2× sticker against the turn-count collapse.
The cross-vendor read, including where the rivals' CLIs hold up.
$10/$50, the plan window, and the Ch 29 math on 2× stickers.
Use casesStripe's 50M-line day, Cursor, GitHub, trading desks, drug design — and the operator's own.
Fable 5 in Claude CodeThe banner, the June 22 clock, /model, and when to route to it.
The API pageclaude-fable-5, the one new 400, and the one-line migration from Opus 4.8.
Related: The Fable 5 hub · Benchmarks, read honestly · The live tier list · Research notes (the Gemini 3.5 Flash entry) · Ch 24 — the tier list · Ch 35 — Codex or Claude Code