Run Until Done

Goals, Loops, and the Evaluator That Tells the Agent to Stop

/goalHaiku-as-evaluatorautonomous loopStop hook/loopevaluator-driven

Tuesday, May 12, 8:47 AM#

Claude Code 2.1.139 had been out for less than 24 hours. I had a migration I’d been putting off for a week — sweeping claude-sonnet-4 references out of two repos before the June 15 deprecation deadline. Mechanical work. Boring work. The kind I usually start, get bored of at file 4, and walk away from until the deadline hisses.

I typed /goal all references to claude-sonnet-4 in this repo are replaced with claude-opus-4-7, tests pass, the diff lives on a branch named model-bump, or stop after 30 turns. Hit return. The overlay panel appeared — ◎ /goal active, elapsed timer, turn counter, token meter. I went and made coffee.

When I came back the goal had cleared itself. Twelve turns. Eighteen minutes. Branch pushed. Tests green. Total Haiku evaluator spend: $0.04. Total Opus spend on the main turns: $3.12.

Then I tried it again on a vibe-eval. /goal investigate until you find the cause of a flaky test. Forty-one turns later I killed it manually because the agent was looping on the same three hypotheses, the evaluator kept saying “not yet — cause not isolated,” and I’d spent $11 to learn nothing. The first run was the whole point of /goal. The second run was the warning label.

That’s the chapter. The evaluator is the goal. The goal is the evaluator. If you can’t measure done, you can’t run until done.

What /goal actually is#

/goal <condition> shipped in Claude Code v2.1.139 on May 11, 2026, with a v2.1.140 hotfix the next day for a silent-hang case under certain hook configurations. The docs page at code.claude.com/docs/en/goal is short and exact about what it does.

It’s a session-scoped wrapper around a prompt-based . After every turn, a small fast model — Haiku 4.5 by default per /en/model-config — inspects the conversation transcript and judges whether the condition holds. If it does, the goal clears. If it doesn’t, Claude takes another turn instead of returning control to you.

Three properties matter:

That last property is the trap most operators hit on day one. The condition has to be a thing Claude’s own output can prove. “Tests pass” only works if Claude actually pasted the test runner output. If the test runs in a subprocess whose stdout doesn’t bubble back, the evaluator never sees a pass signal.

/goal isn’t only for code. Anywhere “done” is measurable in transcript — a memo with three receipts, a triage list with 12 verdicts, a prep doc with four sections — /goal works. You don’t need a repo. You need a condition you could grep for. The CFO defense, the proposal triage, and the mentee prep examples later in this chapter are the operator-shaped versions; the engineering scenes are the same primitive with a different surface.

The three autonomous-loop primitives#

The autonomous-work surface in Claude Code now reads as a clean three-way. Each primitive answers a different question.

/goal — evaluator-driven. “Run until this condition is met.” The Haiku evaluator decides. Use it when the finish line is measurable in transcript output — tests pass, file count below N, all PRs labeled, lint clean.

/loop [interval] [prompt] — interval-driven. “Run this prompt every N minutes.” No evaluator. Claude executes the prompt, sleeps, executes again. Use it for polling — /loop 5m check if the deploy finished is the canonical example. Pre-2026 you’d have rigged this with cron; now it’s a slash command.

Stop hooks — custom-logic-driven. The engineer-flavored version — covered in detail in Ch 16. Use when you want determinism — “stop when this script returns 0” — rather than a model judging the transcript. For curated community skills and the install audit pattern, see Ch 39.

/goal/loopStop hook
Next turn starts whenPrevious finishes + evaluator says “not yet”Interval elapsesPrevious finishes + script returns non-zero
Stops whenEvaluator says condition metYou stop itScript returns 0
JudgeHaiku reads transcriptNone — Claude self-pacesWhatever you wrote
ScopeSession-scopedSession-scopedSettings-scoped (every session)
DeterminismPrompt-judgedTime-drivenScript-driven

The three compose. A /goal session can have Stop hooks firing alongside it — the goal evaluator decides whether to take another turn, the Stop hook still runs after each turn and can format files, post Slack, draft commit messages, whatever. They’re orthogonal layers, not competitors.

Which primitive when?#

The table proves the primitives are different. This is the literal “I’m at the terminal — what do I type” flow. Five yes/no questions, each one collapses the surface.

start here ↓

1. is the finish line a thing claude's own transcript can prove?
   ├── yes → continue to 2
   └── no  → Stop hook with a real script. determinism beats vibes.

2. can you write a grep that returns 0/1 on "done"?
   ├── yes → continue to 3
   └── no  → rewrite the condition until you can, or pick Stop hook.

3. is the work driven by elapsed time, not by a stop condition?
   ├── yes → /loop [interval] [prompt]. polling, not running until done.
   └── no  → continue to 4

4. is the stop condition reusable across every future session?
   ├── yes → Stop hook in settings.json. session-scoped is the wrong scope.
   └── no  → continue to 5

5. is the condition cheap to check in transcript output?
   ├── yes → /goal <condition>, with an "or stop after N turns" tail.
   └── no  → split it. compound conditions get ambiguous; eval picks wrong.

default if you're not sure: /goal with a turn cap of 20.

The stack — Plan → Auto → /goal#

Read /goal as the third rung of an autonomy ladder — that framing isn’t Anthropic’s wording verbatim, but the per-prompt pairing they describe in the docs is exactly the same shape: each rung removes a class of operator approval.

The three stack. Plan mode shows me the work. Auto mode does the work without asking. /goal does the work until I tell it what done looks like. You can run all three at once and the result is an agent that proposes, executes, and self-terminates without you in the loop until the deliverable is real.

That’s the behavioral shift. The ladder isn’t a choice — it’s a sequence. Ch 21 was titled “Which Mode Right Now?” when there were three modes. There are four now, and /goal is the one that changes what you’re approving, not just how often.

/goal pre-session doc for Mentee A is drafted with
(1) their three commitments from last session, (2) this week's
async messages summarized into wins + blockers, (3) the
one open question I owe them a referral on, (4) my top 3
talking points ranked by Tier 1 cash impact, or stop after
5 turns

The night before, ~6:30 PM. I used to do this prep cold five minutes before the call and it showed — I’d forget what we’d agreed last time, scroll the message thread mid-call, miss the through-line. Ran the goal with last session’s transcript, this week’s message export, and the action tracker pasted in. Four turns. Haiku rejected turn 2 because the “wins + blockers” section was generic (“Mentee A is busy”). Turn 4 was specific: “Mentee A closed 2 deals but is blocked on a vendor contract.” Cleared.

Receipt: the session next day ran 11 minutes shorter and we hit all three priority items. Mentee A flagged later that “this was the tightest one.” Saved: 45 min of in-call drift, plus the implicit cost of looking like the mentor who forgot what we agreed last week.

The brittle line: /goal can only see what you paste. If WhatsApp export is incomplete or the action tracker is stale (it usually is by Friday), the doc is confident and wrong — which is worse than no doc. The 5-minute cost is keeping the action tracker current. If you don’t pay that cost, this use case is a fancy way to lie to yourself.

Eight operator scenes#

The trick to a useful /goal is the same as the trick to a useful : the condition has to be cheap to check, demonstrable in transcript, and measurable. Eight scenes I actually use.

1. CI loop — tests pass#

/goal pnpm test exits 0, eslint --max-warnings 0 is clean,
tsc --noEmit shows no errors, or stop after 20 turns

Eval reads the transcript for the runner exit codes Claude had to print. The agent iterates — write code, run test, read failure, patch — until the three lines land green. Single most-used /goal in my rotation.

2. Research loop — five sources agree#

/goal find five independent sources (different domains)
that cite the same claim about Opus 4.7's effort parameter,
list each with URL + publish date, or stop after 15 turns

Eval counts the URL list in transcript. Cheap to check. The agent keeps searching until the count hits five. Pairs well with WebFetch + WebSearch in auto mode.

3. Proposal pipeline triage — go / no-go / follow-up#

/goal triage these 12 inbound proposal threads — for each,
assign go / no-go / follow-up using the 5-question filter
(budget named? decision-maker on thread? timeline under 90d?
fit with our 3 ICPs? response within 48h?), output one row
per thread with the verdict and the failing question if no-go,
or stop after 1 turn per thread (max 12)

Wednesday, 7:02 AM. Twelve threads in the proposal inbox from the week. Usually I triage these over coffee and it eats an hour. Dropped the thread exports into a folder, ran the goal. Haiku evaluator was tight here — the condition was “12 rows present, each with verdict + failing question.” Three turns in, Claude tried to skip threads where decision-maker wasn’t clear. Evaluator rejected: “row 4 has no verdict.” Back to work. Turn 9 cleared with all 12 rows.

Receipt: 7 no-go (4 failed on budget, 2 on decision-maker, 1 on ICP fit), 3 follow-up (timeline soft), 2 go — which I personally replied to within 20 minutes. Usually those two would’ve been buried until Friday. Saved: ~50 minutes of triage, and the two go’s got same-day replies, which our last sales data says doubles close rate.

The brittle line: the 5-question filter is mine — if your filter is fuzzy, /goal can’t help you. “Good fit” isn’t a question. “Budget over $5k stated in thread” is. /goal exposes whether your sales process is actually a process or whether it’s been vibes the whole time.

4. Content loop — anti-takeaway closer lands#

/goal the draft of chapter 38 contains a final paragraph
that names a specific mistake I made (not a generic lesson),
uses the words "the trick" or "the truth is" or "the lesson"
exactly once, and ships under 150 words, or stop after 8 turns

Eval greps the transcript for the closer block. This is the chapter you’re reading. Eight turns. Five drafts.

5. Refactor loop — file size#

/goal no file in src/ exceeds 500 lines, find -name '*.ts'
| xargs wc -l | sort -n shows the longest file under 500,
existing tests still pass, or stop after 25 turns

Eval reads the wc -l output Claude pastes. Forces the agent to split files until the count falls. Boring work, mechanical receipt, exactly the shape /goal was built for.

6. Multi-condition stop#

/goal pnpm test passes AND eslint is clean AND the fix
for the auth race condition in src/auth/session.ts compiles
under strict tsc AND no file outside src/auth/ is modified,
or stop after 30 turns

Compound conditions work. The Haiku evaluator handles “all of A, B, C, D” fine. The risk isn’t expressiveness — it’s that an ambiguous compound condition can have multiple “done” interpretations and the evaluator picks the wrong one. Be precise on each clause.

7. The trap — open-ended#

/goal investigate until you find the cause

Don’t. The next section is about this.

8. CFO defense — the AI bill#

/goal a 600-word memo defending this month's Anthropic spend
is drafted with 3 specific labor-replacement receipts
(role, hours saved, dollars), a quote from the CFO's own
last objection, and a closing ask of "keep the line item"
or stop after 8 turns

Tuesday, 9:14 AM. Anthropic bill landed at $847 for the month. CFO emailed “what is this and why is it growing.” I had a board call at 11. I typed the goal, fed Claude my Customer.io usage report, the last three CFO emails, and a list of three tasks Claude actually did this month (CSV cleanup that would’ve been a $40/hr VA, a proposal draft that would’ve been 4 hours of mine, a research pass that replaced a $200 Fiverr gig). Six turns. The Haiku evaluator kept rejecting drafts that had generic “AI is the future” lines — turn 4’s reject reason was “no dollar receipt for claim 2.” Good. Turn 6 cleared. Memo was 612 words, three receipts named, CFO’s “we need to justify every SaaS line” line quoted back at her.

Receipt: memo sent at 9:41. CFO replied “fine, but cap at $1k.” Line item survived. Saved: roughly 90 minutes of writing + the 11 AM board prep I would’ve cannibalized.

The brittle line: if you don’t have receipts before you start, /goal makes them up. It’ll happily invent a “VA replaced at $40/hr × 12 hours” if you don’t paste the actual usage data first. The goal forces structure, not honesty. Honesty is upstream.

/goal active overlay during a real run
/goal active overlay during a real run A live Claude Code session with the ◎ /goal active panel visible — elapsed time, turns evaluated, tokens spent, and the last evaluator 'why not yet' line.

The Haiku-as-evaluator pattern#

The evaluator defaults to Haiku for a reason that becomes obvious the first time you do the math. A single Opus 4.7 turn on a real coding task — read files, think, write, run tests — runs on the order of cents-to-dimes per turn. A Haiku 4.5 evaluator pass on the same transcript runs roughly two orders of magnitude cheaper. The per-turn ratio is what matters, not the exact dollar figure — check your own pricing page before you reason about it.

If you ran the evaluator on Opus, a 30-turn /goal session roughly doubles your cost — eval overhead becomes its own line item. The productivity gain — fire and forget while you make coffee — gets eaten by the evaluator’s overhead. The whole pattern only pencils because Haiku is cheap enough to run after every turn without anyone noticing.

The first day I had /goal, I tried to force the evaluator to Opus through a custom hook config (I will not explain how — it was a bad idea). A 47-turn session ran roughly 2× the expected cost because the evaluator was as expensive as the worker. The fix was reverting to default. The lesson was that the cheap evaluator isn’t a cost-cutting move — it’s the move that makes the primitive viable at all.

The same logic applies if you build a custom Stop hook with an LLM-as-judge: judge cheap or don’t judge at all. Use Haiku 4.5 or Sonnet 4.6 for the eval step. Save Opus for the work.

Anti-pattern — open-ended goals#

/goal investigate until you find the cause. /goal keep going until the code is good. /goal ship this when it's ready. These are vibe-evals, and a vibe-eval is how you discover the agent’s failure mode is “loop on the same three cycles forever while you bleed tokens.”

The Haiku evaluator is reading transcript and judging against natural language. “Found the cause” has no measurable signal. The agent generates three hypotheses, tests two, fails to isolate, and on turn 4 the evaluator says “not yet” because the agent’s own writeup said “still investigating.” Turn 5: same. Turn 41: same. You watch $11 evaporate and learn nothing.

The rule is mechanical: every /goal needs a stop condition you could write a grep for. Test exit codes. File counts. URL counts. Specific strings. Compound boolean. If the eval condition reads like an essay prompt, you’re not running until done — you’re running until tired.

Two safety nets that cost nothing:

What I got wrong#

Saturday, May 16, 2026. I typed /goal ship the launch page by 6pm. The agent shipped at 5:58. The page deployed. The evaluator said the goal was met — the wall-clock condition was true. I opened the live URL and the checkout button 500’d because the agent had pushed without re-running the build after a last-minute env var rename. The eval was real. The output was broken. The page was live, the customers were locked out, and the receipt was a deploy log timestamped 5:58:14 PM with green check marks next to every step the agent watched itself complete.

The trick was that “ship by 6pm” is a clock condition, not a quality condition. A clock condition only knows about the clock. Pair every clock condition with a quality one — “ship by 6pm AND the live URL returns 200 AND the checkout flow completes a test transaction” — or don’t write a clock condition at all. The agent will hit the deadline. It will not, on its own, refuse to hit the deadline with broken code, because nothing in the eval told it to.

I rewrote the goal after I fixed the deploy. The new one ran three more turns and stopped clean at 6:11 PM. Eleven minutes late. Site working. That’s the version that should have shipped the first time.

Spotted something wrong, missing, or sharper? Email Vlad with feedback on this chapter →
Stay close

The next edition lands when this list says it does.

No course. No paywall. Operator playbooks weekly. 10K+ subscribers.