Field guide · Claude Code

The Claude Code practices that survived production.

The internet is full of Claude Code tips written from demos. This page is the other kind — the practices distilled from 45 chapters of running it on real companies, where every rule exists because skipping it produced a bill. The wisdom of hour 200, compressed.

Every claim links to the chapter that carries the receipt. Failure bills in Ch 28, cost math in Ch 29, the security floor in Ch 9.

Jump to section tap to open

The 30-second answer

Five practices prevent the expensive Claude Code failures: cap spend at the workspace level, keep CLAUDE.md under 100 lines, never bypass permissions outside a sandbox, ship an eval before the artifact ships, and treat every model upgrade as a deploy. Each rule has a receipt — $1,847 in one night, $4,312 in one week.

The non-negotiables — five practices, five receipts

These aren't tips. Tips are optional. Each of these is a rule I now run without thinking, installed by a specific failure with a timestamp and a dollar amount.

1. Cap the spend before the first prompt

A workspace-level spend cap ($200/day on mine), a per-task token ceiling enforced by the orchestrator (kill at 500K tokens and page me), and a loop detector in subagent prompts — if your last three tool calls overlap more than 80%, stop and return what you have. After the caps went in, the next stuck agent cost $4 and a ping instead of $1,847.

Receipt: an uncapped subagent looped 1,400 calls into a $1,847 bill in eleven hours — Ch 28.

2. Verify the claim, not the vibe

An agent that hits a snag mid-task doesn't error — it returns its idea of a friendly status word. Between every wave of parallel agents: count touched files, diff commit hashes, fail loud on any agent whose claimed scope shows zero changes. Every skill that writes a file ends with a read-back check that the path matches.

Receipt: a subagent returned the literal string "OK" with zero commits, and the next wave burned 90 minutes debugging code that didn't exist — Ch 28.

3. Permissions stay on outside a sandbox

--dangerously-skip-permissions is named correctly. The flag belongs in containers and throwaway VMs where "agent did something stupid" costs a rebuild, not on the machine that holds your ~/.ssh. The full decision tree is below.

Receipt: a generated rm -rf with a trailing ~/ expanded after the validation layer and torched a home directory — Ch 15.

4. An eval ships before the artifact does

Three lines of code that check the output for the obvious failure — empty sections, $0 where money should be, missing headers — running 30 minutes before the real job fires. Not a framework. A smoke detector for the artifact.

Receipt: a skill that ran flawlessly for six weeks shipped nine days of $0-pipeline canvases to a COO because nothing was watching — Ch 25.

5. A model upgrade is a deploy

A new default model is a behavior change shipped to every workflow at once. Any skill shared with more than two people gets a regression test on upgrade day: same input, diff the structural shape of the output, alert on missing sections. You wouldn't push a backend change to twelve users without a smoke test.

Receipt: a 4.6 → 4.7 default flip made a shared skill silently drop its most useful section and ship prep docs 30% shorter — Ch 28.

Context discipline — what goes in CLAUDE.md, and what kills it

CLAUDE.md loads on every turn of every session. That makes it the most expensive file in your repo and the most misused. The budget: under 100 lines, hard ceiling around 150. Rules at the top, identity in one sentence, the exact build commands, a never-do list with receipts. No onboarding wiki, no marketing prose, no file inventories — the agent has Read, it can look.

Two failure modes do the damage. First, bloat: my 340-line CLAUDE.md doubled the project's token bill in a week and the model started ignoring half of it because the relevant rule was buried. The fix was extraction — three skills pulled out, the file down to 84 lines. When CLAUDE.md grows past 100 lines, the answer is almost never "trim it"; it's "what skill is hiding in there." Second, churn: every edit voids the prompt cache for the prefix. Three edits in one afternoon dropped my cache hit rate from 84% to 11% for two days. Batch CLAUDE.md edits like database migrations — once a week, new content at the bottom.

And know the hierarchy when layers disagree: your current message beats a fired skill, a skill beats CLAUDE.md, CLAUDE.md beats memory. Putting a rule in CLAUDE.md makes it a default, not a law. For the genuinely non-negotiable — never commit to main, never touch this API without confirmation — use a hook that blocks the action. CLAUDE.md is preference; hooks are enforcement.

The full architecture — four layers, the decision tree, a working 63-line template — is in Ch 37, and the standalone rules page is /claude-md-rules.

Permissions and blast radius — when to skip, when never

The default contract: every mutating tool call shows you a preview and asks. Rules live in settings.json and evaluate deny → ask → allow, deny always wins. That contract is the steering column collapse zone of agentic coding — build it once and you can drive at speed.

When you want the gates off, run the four-gate check from Ch 15: disposable environment, no real credentials in scope, network containment, recoverable from a one-minute-ago state. Four yeses — type the flag, in a container. One no — type --auto instead, which routes destructive actions through a classifier with a documented 17% false-negative rate. Better than no gate; not a replacement for your eyes. For any refactor touching more than five files, plan mode first: read the intent, push back, then run.

Blast radius is the deeper discipline, and it lives at the credential layer, not the prompt layer. A leaked Stripe key drained $4,200 in eleven minutes — the bot doesn't take a coffee break (Ch 9). The same logic caught me from the other side: a test workspace sharing a production Gmail credential surfaced a real customer email to a teammate, because connectors authenticate per-account and workspace boundaries are a UI fiction over one credential (Ch 28). Scoped keys, separate accounts per environment, least privilege as the default.

When one agent isn't enough

The single biggest quality jump available isn't a better prompt — it's running more than one instance and choosing. Three agents with three locked perspectives and a fourth that synthesizes beats one agent with a clever prompt, every time I've measured it. The architecture, the seven patterns, and the three things that quietly break a fan-out live on /swarms and in Ch 6.

The production-scale proof is Ch 43: a bench of named explorer agents, each reading one slice of a real product, deleted a net 91,874 lines across 718 files — behind six layers of verification, from focused tests through CI to a live browser run on production. Two findings only the browser layer caught. The role lenses disagreed productively: a deletion that helps UX can hurt SEO, and the swarm exists to force that tradeoff explicit before the code lands.

Two operator rules keep a swarm honest. First, verify between waves — non-negotiable #2 above exists because parallel agents fail silently at a rate one agent never shows you. Second, budget it: that 91,874-line run burned 46% of one week's usage. Swarms aren't free; they're worth it on problems where synthesis beats a single thread, and a waste everywhere else.

Cost discipline — the bill is a design output

The pricing page shows two numbers; the invoice has four — input, output, cache write, cache read. The 10x gap between a cache read and fresh input is the entire game, and it's governed by a contract: the cached prefix must not change between calls. Treat your system prompt as an append-only log. I broke that contract with a 38-line CLAUDE.md edit and the weekly bill went from $1,108 to $4,312 with zero workload change. The fix took 12 minutes. Knowing the fix existed took six months — that's Ch 29, so you skip the six months.

Three levers after caching. Watch the scorecard — cache read ratio north of 90% means the contract is holding (mine runs 98.1%), and write amortization under ~3× means you're paying the cache premium for a discount you're not collecting. Batch what can wait — the Batch API runs the same model at 50% off within 24 hours, which turned a $600/week line item into $300. Route by task — cheap models for triage, the default tier for the working middle, the top model only when a wrong answer costs more than the run. The three-line router cut my bill roughly 30% the week it shipped.

And the rule that survives every model launch: the price of a model is not the price of a task. A model that one-shots what the old one needed three turns and a retry to finish can be cheaper per job at double the per-token rate. Decide on the total-billed-per-finished-task column, never on $/Mtok. Full math and the live numbers: /the-bill.

Evals or hope — pick one

An eval is not a framework. It's a function that takes the artifact your workflow produced and answers one question — did this meet a minimum bar — returning a boolean and a reason. Three lines would have caught my nine-day silent failure on day zero: a HubSpot stage rename zeroed a filter, and the model wrote graceful prose around the empty result set for nine straight Fridays. The skill didn't crash. It gaslit. That's the failure class evals exist for (Ch 25).

The pattern: run the skill dry on a cron 30 minutes before the real one fires, pipe the output through the eval, page yourself on failure. Sixty seconds of your attention buys the intervention window. Give the eval a failure budget too — one that pages you more than once every two weeks gets refined; one that never fires isn't watching.

Evals also answer the benchmark question. Berkeley RDI reward-hacked eight major agent benchmarks in April 2026 — agents trained to detect the test and optimize the score, not the task. The standing rule on this site: discount public scores 10–15 points, and treat every external benchmark as a signal that a test is worth running, never as the result. Your private eval on your actual workload is the receipt. The full breakdown sits in research notes and Ch 24.

The anti-practices — popular advice that fails operators

"Tune the prompt harder." Prompting is table stakes now, not the lever. The leverage moved up the ladder — data layer, memory, swarms, skills all sit above the prompt. A prompt you've tweaked five times is a skill trying to be born; extract it (Ch 40).
"YOLO mode is how the pros go fast." One survey — primary unverified — puts unintended file modifications at 32% of bypass-mode users, with 9% reporting data loss. The pros go fast inside containers with the four gates checked (Ch 15).
"Put everything the agent should know in CLAUDE.md." A 340-line CLAUDE.md doubled a project's token bill in a week while the model ignored half of it. Always-loaded context is the most expensive real estate you own (Ch 37).
"Always use the most capable model." Flying first class to the corner store. Routing triage to the cheap tier cut a real bill roughly 30% in a week with no quality change on those tasks (Ch 29).
"The benchmark says it's better, switch." Eight major agent benchmarks were reward-hacked in one paper. Benchmarks are a signal a test is worth running; your private eval is the proof (Ch 25).
"Install every MCP server that looks useful." Nine of eleven public registries accepted a malicious test package without review. Every connector is a contractor holding your keys — audit it like a dependency (Ch 9).

New model, same practices — the Fable 5 note

Claude Fable 5 landed in Claude Code on June 9, 2026 — the banner calls it the model "for complex, long-running work," and you pick it via /model. Nothing on this page changes. That's the point of writing practices against the discipline instead of the model: if your model id is a variable (Ch 24 — don't pin model strings), a new flagship is a one-line diff plus a regression run, not a migration project.

Two of the non-negotiables apply immediately. Rule 5 — a model upgrade is a deploy, so run your golden set before flipping the default. And rule 4 — Fable 5 is included in paid-plan limits until June 22, then needs usage credits, which makes the window a free private eval on your own workload. Route the long-horizon, complex work to it, keep the transcripts, and decide on cost per finished task. The operator file on the model is Fable 5 in Claude Code; the full picture is the Fable 5 hub.

FAQ

What are Claude Code best practices?

Five non-negotiables prevent the expensive failures: cap spend at the workspace level before the first prompt, keep CLAUDE.md under 100 lines, never bypass permissions outside a sandbox, ship an eval before the artifact ships to anyone who matters, and treat every model upgrade as a deploy with a regression test. Each comes from a documented production failure with a dollar amount attached — $1,847 in one night from an uncapped recursion, $4,312 in one week from a CLAUDE.md edit that voided prompt caching.

How do I stop Claude Code making mistakes?

Verification, not prompting. Run a three-line smoke eval against every artifact a skill produces. Verify parallel agents between waves — count touched files and diff commit hashes, because an agent that hits a snag can return "OK" with zero work done. Use plan mode for any refactor touching more than five files. And enforce hard rules with hooks, not CLAUDE.md — CLAUDE.md is preference, hooks are enforcement.

Should I use auto-accept mode?

Auto-accept is the mild version; the dangerous one is full bypass (--dangerously-skip-permissions) — run the four gates first: disposable environment, no real credentials in scope, network containment, recoverable from a one-minute-ago state. Four yeses and bypass mode is fine — in a container. One no and the answer is --auto, which escalates destructive actions through a classifier (documented 17% false-negative rate — better than no gate, not a substitute for review), or plan mode. Never full bypass on your main machine.

How do I keep Claude Code costs down?

Treat your prompt prefix as an append-only log so prompt caching holds — a 38-line CLAUDE.md edit once took a weekly bill from $1,108 to $4,312, and the fix took 12 minutes. Batch anything that can wait 24 hours (50% off). Route by task: cheap models for triage, the default tier for the working middle, the top model only when a wrong answer costs more than the run. Set a workspace spend cap before you need one.

Do these work with Fable 5?

Yes — the practices are model-agnostic by design. Claude Fable 5 (released June 9, 2026) is picked in Claude Code via /model and is included in paid-plan limits until June 22, then needs usage credits. If your model id is a variable, switching is a one-line diff. The discipline transfers unchanged: cap the spend, keep the eval, route complex long-running work to it, and measure cost per finished task before the meter starts.