Is Artificial Analysis an independent AI benchmark?

Yes. Artificial Analysis runs its own evaluation harness on dedicated hardware, so it is neither a crowdsourced preference vote (like LMArena) nor a vendor launch-deck number. On this tier list it sits as the independent referee between the public LMArena crowd leaderboard and the vendors’ own lab-claims.

What does cost per task mean for an AI model?

Cost per task is the weighted-average cost in US dollars to run one standardized Artificial Analysis Intelligence Index task. It prices a model’s real economics in agentic loops — the number that shows up on the invoice — rather than the headline capability score, so the most capable model is often not the cheapest to run.

Which AI model is cheapest per agentic task in 2026?

As of the 2026-06-16 capture of Artificial Analysis’s Intelligence Index v4.1 board, DeepSeek V4 Pro had the lowest cost per Intelligence Index task, while the most capable model on the board (Claude Fable 5) was the most expensive per task. The panel shows the dated per-task figures; sort by "Cost / task" to compare.

Chapter 24

The tier list. Without mercy.

Public leaderboards rank capability. Operators rank usefulness. Below, four readings of the same models: the crowd's public benchmark, an independent referee that prices every run, the labs' own launch decks — then Vlad's actual stack, a builder you can drag, share, and argue about.

What the public says

LMArena public leaderboard

Crowdsourced head-to-head model votes from lmarena.ai. This is the canonical "which model is better at benchmark tasks" public ranking. It's the right answer to one specific question — and the wrong answer to the question you actually care about as an operator.

LMArena · Text

The headline board — head-to-head chat votes.

snapshot · 2026-06-11 (live unavailable)

board published 2026-06-10

Anthropic

claude-opus-4-6-thinking

1511

Anthropic

claude-opus-4-6

1507

OpenAI

gpt-5.4-mini-high

1499

Anthropic

claude-opus-4-7-thinking

1498

Anthropic

claude-fable-5

1497

OpenAI

gpt-5.4

1495

OpenAI

gpt-5.2-high

1493

OpenAI

gpt-5.4-high

1491

Google

gemini-3.1-pro-preview

1490

Anthropic

claude-opus-4-7

1490

Anthropic holds 5 of the top 10 here. The Elo spread across this top 10 is 21 points — a gap operators rarely feel in practice.

Crowdsourced head-to-head votes. Live data via the lmarena-ai HF dataset; falls back to a hand-verified snapshot if offline.Open lmarena.ai →

What an independent referee measures

Artificial Analysis — the independent referee

The arena above is the crowd's taste; the boards below are the vendors' launch slides. This is neither — Artificial Analysis runs its own agentic harness (rebuilt in Intelligence Index v4.1 around long agentic chains — Terminal-Bench 2.1, 𝜏³-Banking, GDPval v2) and reports the one number the other two never do: what each run costs. Cost per task, not just capability — the column where the leaderboard starts speaking the operator's language. Read it as a referee's reading, not a verdict: independent buys disinterest, not the last word. Captured 2026-06-16; dated sourcing on the research notes timeline.

Independent evals · Artificial Analysis

Agentic intelligence, priced per task

captured 2026-06-16 · Intelligence Index v4.1

captured today · fresh

Rank by

#ModelCost / taskIntel$/tasktok/s

DeepSeek V4 Pro

44.3

$0.056

gpt-oss-120b

23.8

$0.061

345

MiMo-V2.5-Pro

42.2

$0.062

Grok 4.3 (high)

37.6

$0.145

167

MiniMax-M3

44.4

$0.182

Nemotron 3 Ultra

37.8

$0.244

168

Kimi K2.6

42.8

$0.294

Gemini 3.1 Pro

46.5

$0.305

117

GPT-5.5 (xhigh)

54.8

$0.993

Claude Fable 5

59.9

$3.25

—

Claude Fable 5 tops the Index (59.9) — and is the most expensive per task on this board at $3.25, about 58× DeepSeek V4 Pro ($0.056). Capability is the vanity metric; cost-per-task is the one that shows up on the invoice. Sort by it.

How the Index is built — v4.1 (June 2026)

Agents · 34% — GDPval-AA v2 · 20% · 𝜏³-Banking · 14%
Coding · 24% — Terminal-Bench v2.1 · 16% · SciCode · 8%
Scientific reasoning · 24% — HLE · 12% · GPQA Diamond · 6% · CritPt · 6%
General · 18% — AA-Omniscience · 12% · AA-LCR · 6%

v4.1 retired IFBench (saturated), Terminal-Bench Hard, 𝜏²-Bench Telecom to chase agentic signal. It's a weighted composite — change the weights and you change the king. Independent buys disinterest, not infallibility: read it as a third reading that disagrees usefully with the crowd and the labs, not a tiebreaker that overrules them.

Source: Artificial Analysis — independent evals, run on their own harness. Hand-captured fair-use snapshot; figures verified against AA's public board on 2026-06-16. Methodology →Open artificialanalysis.ai →

What the labs claim

Launch-deck numbers — discounted on arrival

The arena above is crowd votes on chat prompts. The boards below are what vendors shipped on launch day for the question operators actually ask — agentic work — which LMArena doesn't measure. Read every row as a claim, not a receipt: Berkeley RDI reward-hacked 8 of 8 major agent benchmarks, so launch numbers carry a 10–15 point discount until your own eval confirms them. Models you can't buy don't appear — capability ceilings are not tier-list entries. Dated sourcing for every figure lives on the research notes timeline.

SWE-Bench Pro

agentic coding — the honest successor to Verified

claude-fable-5 80.3%

claude-opus-4-8 69.2%

gpt-5.5 58.6%

Anthropic launch table · 2026-06-09
first-party launch numbers — discount per Ch 24

FrontierCode (Diamond)

long-horizon coding, built to be unsaturated

claude-fable-5 (xhigh) 29.3%

claude-opus-4-8 13.4%

Anthropic launch table · 2026-06-09
2.2× the previous frontier — still a launch number

OSWorld

computer use — the board that held up best under RDI gaming tests

claude-sonnet-4-6 72.5%

Anthropic product page · per Berkeley RDI (2026-04-12)
the production-realistic anchor to calibrate against

Terminal-Bench

agentic terminal work

gemini-3.5-flash 76.2

gemini-3.1-pro 70.3

Google launch presentation · 2026-05-19
low confidence — vendor slide, not an independent eval

What an operator actually uses

Vlad's tier list

The arena measures capability head-to-head on neutral prompts. This list measures what runs Vlad's portfolio on a Tuesday at 11am. They disagree more than you'd expect — Claude Code and Cowork don't appear on the arena at all, Perplexity loses head-to-head but is S-tier here because it answers the actual question 90% of the time. Drag tools, build your own, share the URL.

Build your own tier list

Run my life — remove this and three things break by Wednesday

Open every day

Useful for one job each

I see why people use these but I don't

Exists, fine, not for me

Actively bad / don't

Unranked pool — drag into a tier

Drag tools between tiers. State saves locally. Share opens copy / Tweet / LinkedIn / device share — the URL encodes every placement, so whoever opens it sees exactly your tiers.

The argument

Benchmarks reward the model that beats other models. Operators reward the tool that doesn't break the workflow. The most useful tool you own is rarely the most capable one — it's the one with the lowest activation energy on a Tuesday morning when you have nineteen other problems. That's why Claude Code is S-tier in this list and not even ranked on LMArena. Different question. Different answer.

The tier list. Without mercy.

LMArena public leaderboard

Artificial Analysis — the independent referee

Launch-deck numbers — discounted on arrival

Vlad's tier list

The next edition lands when this list says it does.