Chapter 24

The tier list. Without mercy.

Public leaderboards rank capability. Operators rank usefulness. Below, four readings of the same models: the crowd's public benchmark, an independent referee that prices every run, the labs' own launch decks — then Vlad's actual stack, a builder you can drag, share, and argue about.

What the public says

LMArena public leaderboard

Crowdsourced head-to-head model votes from lmarena.ai. This is the canonical "which model is better at benchmark tasks" public ranking. It's the right answer to one specific question — and the wrong answer to the question you actually care about as an operator.

LMArena · Text
The headline board — head-to-head chat votes.
snapshot · 2026-06-11 (live unavailable)
board published 2026-06-10
1
Anthropic
claude-opus-4-6-thinking
1511
2
Anthropic
claude-opus-4-6
1507
3
OpenAI
gpt-5.4-mini-high
1499
4
Anthropic
claude-opus-4-7-thinking
1498
5
Anthropic
claude-fable-5
1497
6
OpenAI
gpt-5.4
1495
7
OpenAI
gpt-5.2-high
1493
8
OpenAI
gpt-5.4-high
1491
9
Google
gemini-3.1-pro-preview
1490
10
Anthropic
claude-opus-4-7
1490
Anthropic holds 5 of the top 10 here. The Elo spread across this top 10 is 21 points — a gap operators rarely feel in practice.
Crowdsourced head-to-head votes. Live data via the lmarena-ai HF dataset; falls back to a hand-verified snapshot if offline.Open lmarena.ai →
What an independent referee measures

Artificial Analysis — the independent referee

The arena above is the crowd's taste; the boards below are the vendors' launch slides. This is neither — Artificial Analysis runs its own agentic harness (rebuilt in Intelligence Index v4.1 around long agentic chains — Terminal-Bench 2.1, 𝜏³-Banking, GDPval v2) and reports the one number the other two never do: what each run costs. Cost per task, not just capability — the column where the leaderboard starts speaking the operator's language. Read it as a referee's reading, not a verdict: independent buys disinterest, not the last word. Captured 2026-06-16; dated sourcing on the research notes timeline.

Independent evals · Artificial Analysis
Agentic intelligence, priced per task
captured 2026-06-16 · Intelligence Index v4.1
captured today · fresh
Rank by
#ModelCost / taskIntel$/tasktok/s
1
DeepSeek V4 Pro
44.3
$0.056
75
2
gpt-oss-120b
23.8
$0.061
345
3
MiMo-V2.5-Pro
42.2
$0.062
41
4
Grok 4.3 (high)
37.6
$0.145
167
5
MiniMax-M3
44.4
$0.182
60
6
Nemotron 3 Ultra
37.8
$0.244
168
7
Kimi K2.6
42.8
$0.294
41
8
Gemini 3.1 Pro
46.5
$0.305
117
9
GPT-5.5 (xhigh)
54.8
$0.993
65
10
Claude Fable 5
59.9
$3.25
Claude Fable 5 tops the Index (59.9) — and is the most expensive per task on this board at $3.25, about 58× DeepSeek V4 Pro ($0.056). Capability is the vanity metric; cost-per-task is the one that shows up on the invoice. Sort by it.
How the Index is built — v4.1 (June 2026)
  • Agents · 34%GDPval-AA v2 · 20% · 𝜏³-Banking · 14%
  • Coding · 24%Terminal-Bench v2.1 · 16% · SciCode · 8%
  • Scientific reasoning · 24%HLE · 12% · GPQA Diamond · 6% · CritPt · 6%
  • General · 18%AA-Omniscience · 12% · AA-LCR · 6%
v4.1 retired IFBench (saturated), Terminal-Bench Hard, 𝜏²-Bench Telecom to chase agentic signal. It's a weighted composite — change the weights and you change the king. Independent buys disinterest, not infallibility: read it as a third reading that disagrees usefully with the crowd and the labs, not a tiebreaker that overrules them.
Source: Artificial Analysis — independent evals, run on their own harness. Hand-captured fair-use snapshot; figures verified against AA's public board on 2026-06-16. Methodology →Open artificialanalysis.ai →
What the labs claim

Launch-deck numbers — discounted on arrival

The arena above is crowd votes on chat prompts. The boards below are what vendors shipped on launch day for the question operators actually ask — agentic work — which LMArena doesn't measure. Read every row as a claim, not a receipt: Berkeley RDI reward-hacked 8 of 8 major agent benchmarks, so launch numbers carry a 10–15 point discount until your own eval confirms them. Models you can't buy don't appear — capability ceilings are not tier-list entries. Dated sourcing for every figure lives on the research notes timeline.

SWE-Bench Pro
agentic coding — the honest successor to Verified
claude-fable-5 80.3%
claude-opus-4-8 69.2%
gpt-5.5 58.6%
Anthropic launch table · 2026-06-09
first-party launch numbers — discount per Ch 24
FrontierCode (Diamond)
long-horizon coding, built to be unsaturated
claude-fable-5 (xhigh) 29.3%
claude-opus-4-8 13.4%
Anthropic launch table · 2026-06-09
2.2× the previous frontier — still a launch number
OSWorld
computer use — the board that held up best under RDI gaming tests
claude-sonnet-4-6 72.5%
Anthropic product page · per Berkeley RDI (2026-04-12)
the production-realistic anchor to calibrate against
Terminal-Bench
agentic terminal work
gemini-3.5-flash 76.2
gemini-3.1-pro 70.3
Google launch presentation · 2026-05-19
low confidence — vendor slide, not an independent eval
What an operator actually uses

Vlad's tier list

The arena measures capability head-to-head on neutral prompts. This list measures what runs Vlad's portfolio on a Tuesday at 11am. They disagree more than you'd expect — Claude Code and Cowork don't appear on the arena at all, Perplexity loses head-to-head but is S-tier here because it answers the actual question 90% of the time. Drag tools, build your own, share the URL.

Build your own tier list
S
Run my life — remove this and three things break by Wednesday
A
Open every day
B
Useful for one job each
C
I see why people use these but I don't
D
Exists, fine, not for me
F
Actively bad / don't
·
Unranked pool — drag into a tier
Drag tools between tiers. State saves locally. Share opens copy / Tweet / LinkedIn / device share — the URL encodes every placement, so whoever opens it sees exactly your tiers.
The argument

Benchmarks reward the model that beats other models. Operators reward the tool that doesn't break the workflow. The most useful tool you own is rarely the most capable one — it's the one with the lowest activation energy on a Tuesday morning when you have nineteen other problems. That's why Claude Code is S-tier in this list and not even ranked on LMArena. Different question. Different answer.

Stay close

The next edition lands when this list says it does.

No course. No paywall. Operator playbooks weekly. 10K+ subscribers.