Versus file · Claude Code / Codex

Claude Code vs Codex: one of them works for me at 3 AM. The other one argues with me at 9.

Claude Code vs Codex is the comparison every leaderboard thread frames as a choice. It isn't one. I run both against the same repos — Codex opened PR #4471 from a Sentry trace at 3:04 AM; Claude Code is where I reviewed and merged it before my coffee went cold. This page settles the question per job, with the bills and the failure stories attached.

The shift contract is Ch 35, the loop and fence discipline is Ch 42, the 91,874-line deletion receipt is Ch 43, the cost methodology is Ch 29 and /the-bill.

Jump to section tap to open

The 30-second answer

Claude Code and Codex aren't rivals — they're shifts. Claude Code wins long-horizon, opinionated work: features, refactors, parallel agents, CI. Codex wins tight-contract autonomy: incident response from logs, regression catching, overnight loops. Operator Vlad Podoliako runs both on the same repos — Claude Code drives the day, Codex works the night, and a human merge gate sits between them.

The standing verdict, in the words of Ch 35: "Stop thinking of Codex and Claude Code as competing models. They're not. They're shifts." Everything below is that sentence with receipts.

What's actually different between them?

Not the model quality — both vendors shipped frontier models within a day of each other in June 2026, and the gap moves monthly. The durable differences are contract-shaped. Five of them decide every row in the verdict table below.

Dimension	Claude Code	Codex
Where it lives	Your terminal, on your machine — the repo is local and so is the session	A cloud sandbox plus a CLI — boots on a signal, reads the repo's `.mcp.json`, works alone
Interaction model	Argue, then run — it pushes back on the brief before executing it	Delegate and wait — signal in, PR out, one Slack line, back to watching
Contract tolerance	Handles fuzzy — "make this better" with a repo and a CLAUDE.md is a workable brief	Tight only — "given this trace, produce this diff." Fuzzy contracts produce nine-file PRs
Autonomy shape	Swarms, subagents, hooks, headless runs you launch deliberately	Standing loops on signals — Sentry, PostHog, BetterStack crons, a worktree per fix
Review surface	A PR at the end of the run	A PR at the end of the run — this row is the same on purpose; the human merge is the gate in both

If your real question is harness-vs-IDE — where you want to review work, not who does which job — that's a different comparison and it has its own page: Claude Code vs Cursor. This page assumes you've already chosen the terminal and you're staffing it.

Which wins which job? The verdict table

Every row comes from a documented run on a real company, not a benchmark harness. The receipt column is the point — click through and check the numbers.

Job	Winner	The receipt
Incident response from logs	Codex	PR #4471: Sentry trace at 3:04 AM → three files, 91-line diff, one new test, merged with under four minutes of human time — Ch 35
One-shot a product	Claude Code	A full native SwiftUI app from one brief, proven by a RevenueCat renewal, not a demo — Ch 45
Mass reduction at scale	Codex, on a fenced loop	Net −91,874 lines across 718 files, 243 commits, 106 beginning with "Simplify," behind six layers of verification — Ch 43
The architecture argument	Claude Code	Ask for a retry loop: Codex adds the retry loop. CC tells you the call is already idempotent and what you need is a token bucket — Ch 35
Proof-checking a diff	Whichever one didn't write it	A different model reaches for different edge cases — Codex caught a regression by running the tests CC didn't think to run — Ch 42
Ideation before a spec exists	Neither	Both need a contract. The no-contract job goes to Gemini in AI Studio, and the code gets thrown away — Ch 35

What do they actually cost to run?

The seat math first, because it's the boring part: both land in the same shape. A hosted Codex seat ran $200-300 per developer per month as of early 2026, and the Claude Code seat sits in the same range — verify both on the vendors' pricing pages before you commit a line item, those numbers move quarterly and anti-bot walls make them annoying to pin from a script.

The usage is where the real bills live, and the corpus has three receipts. My night-shift Codex on one repo spent roughly $180 in API tokens in a month, against a workload of about 30 PRs and 60 Sentry triages (Ch 35). One agentic loop fix is about 30 round-trips of context — 1 to 2 million tokens per landed fix (/the-bill). And Claude Code across the whole portfolio burns between 3 and 10 billion tokens a month (Ch 1) — that's the workforce line, not a chat habit.

The number to respect: the 91,874-line Codex run was not free. It burned 46% of one week's usage — the cost is part of the receipt, and Ch 43 prints it next to the deletion count instead of hiding it. Big autonomy has a meter on it.

What do the benchmarks say — and how much should you care?

Two rows from Anthropic's June 9 launch table are the ones this comparison keeps reaching for. Terminal-Bench 2.1: Fable 5 at 88.0% vs GPT 5.5 through Codex CLI at 83.4% — the one row where the rival brought its own harness, which makes it the closest thing to an agent-vs-agent number that exists. SWE-Bench Pro: 80.3% vs 58.6%, a model-level gap of 21.7 points.

Now the standing discipline, which didn't expire on launch day: Berkeley RDI reward-hacked eight major agent benchmarks in April 2026, and since then every public score on this site carries a 10-15-point discount. Run that discount over both rows and they read differently. The Terminal-Bench gap is 4.6 points — inside the discount band, which makes it a tie until your own eval says otherwise. The SWE-Bench Pro gap survives as shape, not precision. Neither row re-tiers anything by itself.

The full thirteen-row table with per-row caveats is Fable 5 vs GPT 5.5; the live ranking that a launch slide doesn't touch is the tier list. This page defers to both — its receipts are production runs, which is the eval the benchmarks are a proxy for.

Can you run both on one repo?

Yes, and the infrastructure bill for the second agent is close to zero, because they share everything that matters. Both agents read the same .mcp.json — five MCP servers, one file; Codex reads it when it boots in its cloud sandbox, Claude Code reads it when the repo opens locally. Both read the same CLAUDE.md as the contract: file ownership, commit shape, what "done" means. The Belkins partner-portal CLAUDE.md runs 240 lines, and about a third of it is the section called "Codex never pushes to main" (Ch 35).

Since May 2026 the skill library is shared too: Codex CLI adopted the same SKILL.md format as Claude Code — same frontmatter, same triggers, same directory layout. The receipt is a deal-watcher skill of about 90 lines that fires from Codex at 3 AM when a Sentry event hits the deal-sync queue, and fires from a Claude Code session at 9 AM during PR review. One file, two CLIs, two shifts. Write skills once.

Three rules keep the night shift and the day driver from stepping on each other, and they're cheap to install. Branch protection on main — enforced at the GitHub level, not the social one; the social rule fails the first time Codex tries to "fix" a merge conflict at 4 AM. File ownership in CLAUDE.md — Codex owns the jobs, workers, and regression-test directories; the day driver owns the app surface and the schema, because schema changes need a rollback story. No agent merges its own PR, ever. The third rule is the one that actually matters, and the failure section below is why.

Where does each one bite?

Every failure in this section is a contract failure with my name on it, not a model failure — which is exactly why it transfers to your setup.

Codex, given merge rights: during a maintenance push I gave Codex merge permissions and forgot to revoke them. It auto-merged a one-line "fix" that suppressed an exception — and that exception was the only signal a customer's SMTP creds had rotated. Eleven days to notice, two days to find, the customer churned (Ch 35). Branch protection isn't paranoia; it's the cheapest insurance you'll ever buy.

Codex, given a fuzzy contract: confidence drops fast as the diff widens. A four-file PR from Codex is fine; a nine-file PR is a re-architecture wearing a fix's clothes — close it unread and reopen the problem in Claude Code, where you can argue. The loop version of the same bite is running "simplify" with no fence: a loop without a goal, a diff ceiling, and a stop condition isn't running until done, it's running until tired — and it will "simplify" your auth flow into a security hole on turn 14 while you're at lunch (Ch 42).

Claude Code, given depth without caps: the day driver's bite is the same autonomy pointed inward. An uncapped subagent looped 1,400 calls into a $1,847 bill in eleven hours — no single call crossed a threshold, the aggregate had no threshold to cross (Ch 28). Spend caps and loop detectors are the CC-side equivalent of Codex's branch protection: the fence you build before the night you need it.

How I actually split it

The standing allocation from Ch 35: Claude Code gets everything that needs an opinion — features, migrations, the swarm, headless CI, the argument about whether an abstraction earns its weight. Codex gets everything with a tight scope, a clear signal, and a deterministic test — Sentry traces, flaking tests, docs that lag the code, dependency bumps. The hand-off is a PR and a human, in that order. When a Codex PR shows up wearing more scope than its signal justifies, it doesn't get a closer read — it gets closed, and the problem reopens on the day shift.

The part I'd underline for anyone copying this: even the best run needed a hand on the wheel. Mid-run on the 91,874-line deletion, watching the diff balloon, the steer was — roughly — "you're making a lot of changes; test the build, what you already did, commit and push if no regressions" (Ch 43). That's not a footnote on the autonomy story. That is the autonomy story: the loop was good, and it still needed a human telling it to checkpoint.

No closing framework. Both vendors will leapfrog each other again before this page needs a rewrite — they shipped frontier models a day apart this cycle, and the tier list stays the live ranking so this page doesn't have to. What won't move: two shifts beat one agent with both, the contracts live in files the agents read, and the only door to main is a person. Staff accordingly.

FAQ

Is Codex better than Claude Code?

"Better at what" is the only honest framing. Codex wins tight contracts — given this trace, produce this diff: incident response from logs, regression catching, mechanical doc updates. Claude Code wins fuzzy ones — architecture, multi-hour features, pushback on the prompt itself. The live ranking lives on the tier list, not on this page — and on it Claude Code is S-tier and Codex is A-tier, doing different jobs.

Can I use Claude Code and Codex together on the same repo?

Yes — that's the recommended setup, not a workaround. Both agents read one .mcp.json and one CLAUDE.md, so there's no second source of truth to drift. Since May 2026 the same SKILL.md file fires in both CLIs. Three rules keep them apart: branch protection on main, file ownership written into CLAUDE.md, and no agent ever merges its own PR.

Is Codex cheaper than Claude Code?

Both seats land in the same $200-300/month shape as of early 2026 — verify on the vendors' pricing pages, those numbers move quarterly. The real costs are usage-shaped: one night-shift repo ran roughly $180/month in Codex API tokens against ~30 PRs and 60 Sentry triages; one big Codex refactor burned 46% of a week's usage; Claude Code burns 3-10 billion tokens a month across a portfolio. If you run both, the second seat is the cheap one — same repo, same MCPs, the marginal cost is the model bill.

Does Codex support Claude Code skills?

Since May 2026, yes — Codex CLI uses the same SKILL.md format as Claude Code: same frontmatter, same trigger phrases, same directory layout. Write the skill once and either agent can fire it. The receipt: a ~90-line deal-watcher skill that runs from Codex at 3 AM and from a Claude Code session at 9 AM, one file, two CLIs.

Should I switch from Claude Code to Codex (or back)?

Switching is usually the wrong frame — the failure stories on this page all come from giving one agent both shifts, not from picking the wrong vendor. If you can only run one, match it to your workload's contract-tightness: mostly tight, signal-driven fixes favor Codex; mostly fuzzy, opinionated builds favor Claude Code. The leaderboard is the last input, not the first.