Method

Audit the auditor.

You audit your codebase with agent swarms. Your agent's own setup — the skills, hooks, memory, and permissions it runs on — rots faster than your code and never throws an error. Point the same machinery at it.

Everything below happened in one session, on my real setup, with the numbers kept. Sanitized where it has to be; honest everywhere.

Jump to section tap to open

The claim

Here's the move, and it's the whole move: the multi-agent audit you already run against codebases — parallel domain reviewers, adversarial verification of every finding, a synthesis you can act on — works on the agent's own configuration. Skills are code. Hooks are code. The memory index, the permission grants, the launchd jobs — all of it is a system you built incrementally, never reviewed holistically, and trust completely.

I ran it on mine. Five auditors in parallel, one per surface: the skill library, the per-session context budget, the memory system, hooks-and-permissions, and the automation layer. Then a red-team pass that re-verified every high and medium finding empirically and listed what all five auditors missed. Forty-one findings. The red-team confirmed twenty-nine, weakened four, and refuted two outright — and those two refutations are the most important part of this page, because I would have executed both.

This is the swarm and dynamic workflows pointed inward, the same way Dreaming points the loop at the agent's memory. The codebase was never the only system that rots.

Why it rots

Your agent config accretes the way a junk drawer does — one reasonable deposit at a time. Every incident leaves a skill. Every "just this once" leaves a permission grant. Every experiment leaves a hook. None of it ever gets removed, because removal has no trigger: nothing fails when a dead skill sits in the listing, it just quietly taxes every session and occasionally misroutes one.

Then the model under it all leapfrogs the scaffolding. Anthropic's own Fable 5 migration guidance says it flat out: skills developed for prior models are often too prescriptive and can degrade output quality. The workarounds you wrote for last year's model — agent-count ceilings, watchdog timers, token budgets per task — aren't just dead weight on the new one. They're instructions the model will dutifully follow into worse behavior. A capability jump is a standing audit trigger, and almost nobody treats it as one.

And the failure mode of config rot is never a crash. It's a skill that fires when it shouldn't. A memory index one byte over its load limit, silently dropping its tail. A hook that burns sixty seconds per file edit producing output nobody — human or model — ever sees. You will not notice any of this from inside a session. That's the point.

The harness

One workflow script, two phases. Phase one: five auditors in parallel, each owning one surface, each forced to return structured findings — title, evidence with file paths and byte counts, severity, a specific fix, and an effort estimate. Vague findings are useless findings; the schema is the discipline.

  • Skills inventory — overlap clusters, dead skills, misfire-prone trigger descriptions, name collisions with built-ins, cross-referenced against actual usage telemetry.
  • Context budget — what loads into every session before any work happens: context files, rules, the memory index, skill-listing bytes, plugin fleets, session-start injections. Measured in bytes, not vibes.
  • Memory system — index size vs. its load limit, stale entries contradicted by later events, broken cross-references, orphaned files.
  • Hooks + permissions — what every hook actually costs per invocation, whether its output ever reaches the model, and what the permission allowlist quietly auto-approves.
  • Automation layer — scheduled jobs: which exist, which run, which claim to run. Last-exit codes don't lie.

Phase two is the part most people skip and the part that pays: an adversarial red-team that takes every high and medium finding and tries to kill it — re-running the measurements, re-reading the configs, checking what breaks if the recommendation executes. It also answers a separate question no individual auditor can: what did all five miss? Mine found eight things, including an internal contradiction between two auditors that would have broken the system (next section) and the fact that the security work motivating the whole session was sitting undeployed on a branch.

The two refutations that paid for everything

Refutation one: the engines behind the keepers. The automation auditor recommended retiring six workflow scripts as "redundant with native features" — zero usage, plausible story. The skills auditor, independently, recommended keeping three skills as the consolidated review surface. The red-team caught what neither saw: those three keeper skills dispatch exactly those six scripts. Execute both recommendations as written and you delete the engines of the things you decided to keep. Two individually-reasonable audits, one system-breaking combination — this is why reconciliation is a phase, not a hope.

Refutation two: the wrong denominator. The context auditor computed the per-session overhead as a percentage of a 200k-token window and framed it as a crisis. The sessions run on a 1M-token model. Every urgency figure was overstated five-fold. The trims were still worth doing — bytes are real money and tight skill descriptions route better — but "worth doing" and "on fire" are different plans, and only one of them has you deleting things in a hurry.

One more catch worth naming: an auditor cited a months-old memory note as evidence for a security claim that a later commit had already fixed. The red-team re-verified against the filesystem and killed it. Your agent's memory is part of the audit surface, not a source of ground truth about it.

The numbers

From one evening, on a setup I'd have described as "well maintained" that morning:

Surface Before After The detail that stings
Findings 41 5 parallel auditors, one evening
Red-team verdicts 29 confirmed · 4 weakened · 2 refuted plus 8 issues every auditor missed
Skills 81 66 15 archived (not deleted) on telemetry evidence
Permission grants 162 49 incl. a live bot token and an auto-approved keychain read
Memory index 26.6 KB (truncating) 18.2 KB the tail had been invisible to every session for weeks
Hooks 1 typecheck per edit 0 it scanned a 15k-file tree per edit and never once finished
Scheduled jobs silently dead 2 0 unknown one exit-126 for weeks, one plist never loaded

The single best finding wasn't a number. It was a hook — added months earlier, perfectly reasonable at the time — that ran a full typecheck of a fifteen-thousand-file tree on every file edit. Forensics showed it had never completed once: no incremental build cache existed despite the flag being set. And because its output piped through a truncator that always exited clean, even a successful run would have been invisible to the model. Maximum cost, zero benefit, zero errors. That's config rot in one artifact.

Everything fails silently

Walk the list again and notice what's common: the over-limit memory index didn't warn — it just stopped loading its tail, which happened to include reference data I "knew" was loaded. The live token in the allowlist didn't leak — it just sat in a plaintext file that rides along in every backup. The dead scheduled job didn't alert — it recorded exit 126 in a table nobody reads. The misfire-prone skill didn't crash — it occasionally grabbed requests meant for something else, which reads as "the model being dumb," not "my trigger description claims every deployment on earth."

Config failures are silent by construction, because configs don't have users — they have a single operator who trusts them. Which yields the two operating rules:

Telemetry before deletion. One tiny hook logging every skill invocation to a JSONL file turned "I think I use this" into "zero fires in seven weeks, across two independent telemetry sources." Every kill decision in the audit cited usage evidence and stated its observation window. Without the log, pruning is vibes — and vibes keep everything.

Archive, don't delete. Every removed skill moved to an archive directory the loader can't see. Reversible removal converts each kill from a debate into an experiment: if something was secretly load-bearing, the restore is a mv.

The layer the agent can't own

One class of finding no in-session audit can fix: things that must happen when no session is open. The audit found a health heartbeat fifty-two days stale — the on-demand model for "check the state of my systems" had simply stopped being invoked, which is itself the lesson. Anything that depends on you remembering to ask will eventually not be asked.

The fix is the OS scheduler — Chapter 7's answer, still — running the agent headless on a calendar: a weekday health pulse that rewrites the heartbeat and alerts only on state transitions, and a weekly Dreaming run. The agent does the judgment; launchd does the showing up.

And close the loop: the health pulse's first duty is sweeping the scheduler itself for non-zero exit codes and configured-but-unloaded jobs — the automation watches the automation, on a clock no one has to remember.

Run it Monday

The whole recipe, model-agnostic:

  1. Instrument first. One pre-tool hook appending {skill, timestamp} to a JSONL. Let it run two weeks before you trust it; state the window whenever you cite it.
  2. Fan out five auditors — skills, context budget, memory, hooks + permissions, automation. Structured findings only: evidence with paths and bytes, severity, a specific fix.
  3. Red-team every high/medium finding. The prompt that matters: "Refute or weaken findings that are wrong, overstated, or whose recommendation would backfire — verify empirically. Then list what every auditor missed."
  4. Reconcile before executing. Specifically hunt for cross-auditor contradictions — the keep-list of one against the kill-list of another.
  5. Execute with the kill-rules: archive over delete, telemetry over intuition, and re-verify the load-bearing invariants (does the memory index fit its limit? do the smoke probes still pass?) after, in the transcript, not on faith.
  6. Install the spine. Whatever must outlive the session goes on the OS scheduler, and the scheduled job's first task is checking its siblings.

Then put it on a cadence. Mine is: every model-generation jump (the scaffolding audit), plus whenever the setup "feels slow" — which, I now know, is about two months after the last time.

Related: the setup this ran against · Swarms — the fan-out patterns · Dynamic Workflows — the harness that runs it · Ch 39 — the 73% problem, the same rot in public libraries

Stay close

The next edition lands when this list says it does.

No course. No paywall. Operator playbooks weekly. 10K+ subscribers.