The tier list. Without mercy.
Public leaderboards rank capability. Operators rank usefulness. Below, four readings of the same models: the crowd's public benchmark, an independent referee that prices every run, the labs' own launch decks — then Vlad's actual stack, a builder you can drag, share, and argue about.
LMArena public leaderboard
Crowdsourced head-to-head model votes from lmarena.ai. This is the canonical "which model is better at benchmark tasks" public ranking. It's the right answer to one specific question — and the wrong answer to the question you actually care about as an operator.
Artificial Analysis — the independent referee
The arena above is the crowd's taste; the boards below are the vendors' launch slides. This is neither — Artificial Analysis runs its own agentic harness (rebuilt in Intelligence Index v4.1 around long agentic chains — Terminal-Bench 2.1, 𝜏³-Banking, GDPval v2) and reports the one number the other two never do: what each run costs. Cost per task, not just capability — the column where the leaderboard starts speaking the operator's language. Read it as a referee's reading, not a verdict: independent buys disinterest, not the last word. Captured 2026-06-16; dated sourcing on the research notes timeline.
How the Index is built — v4.1 (June 2026)
- Agents · 34% — GDPval-AA v2 · 20% · 𝜏³-Banking · 14%
- Coding · 24% — Terminal-Bench v2.1 · 16% · SciCode · 8%
- Scientific reasoning · 24% — HLE · 12% · GPQA Diamond · 6% · CritPt · 6%
- General · 18% — AA-Omniscience · 12% · AA-LCR · 6%
Launch-deck numbers — discounted on arrival
The arena above is crowd votes on chat prompts. The boards below are what vendors shipped on launch day for the question operators actually ask — agentic work — which LMArena doesn't measure. Read every row as a claim, not a receipt: Berkeley RDI reward-hacked 8 of 8 major agent benchmarks, so launch numbers carry a 10–15 point discount until your own eval confirms them. Models you can't buy don't appear — capability ceilings are not tier-list entries. Dated sourcing for every figure lives on the research notes timeline.
first-party launch numbers — discount per Ch 24
2.2× the previous frontier — still a launch number
the production-realistic anchor to calibrate against
low confidence — vendor slide, not an independent eval
Vlad's tier list
The arena measures capability head-to-head on neutral prompts. This list measures what runs Vlad's portfolio on a Tuesday at 11am. They disagree more than you'd expect — Claude Code and Cowork don't appear on the arena at all, Perplexity loses head-to-head but is S-tier here because it answers the actual question 90% of the time. Drag tools, build your own, share the URL.
Benchmarks reward the model that beats other models. Operators reward the tool that doesn't break the workflow. The most useful tool you own is rarely the most capable one — it's the one with the lowest activation energy on a Tuesday morning when you have nineteen other problems. That's why Claude Code is S-tier in this list and not even ranked on LMArena. Different question. Different answer.