Dreaming — Memory That Curates Itself

I have been running dynamic workflows since Opus 4.8 shipped them, and the thing that actually impressed me wasn’t the parallelism. Everyone fixates on the swarm — dozens of agents at once, the token bill, the speed. That’s not the part that stayed with me. The part that stayed with me is that the script held the memory.

Watch one run closely. Claude writes a few hundred lines of JavaScript that plans the job, then fans out a crowd of micro-agents. Each agent spins up, does one read, returns one quote-backed finding, and vanishes — it remembers nothing, because it doesn’t have to. The plan, the intermediate results, the running state, all of it lives in the script. Dozens of disposable agents, coordinated by a file small enough to read on one screen, and the file is the only thing that remembers anything. The orchestration is genuinely impressive, and it’s impressive for a reason nobody puts on the announcement slide: it solved working memory for a single task by writing it down in code and throwing the agents away.

Then the script finishes and gets deleted. The task’s memory was perfect and temporary.

That’s the question this chapter is about. A workflow remembers a job. The vault remembers a project. But the thing I learn across hundreds of sessions — the gotchas, the corrections, the “we tried that, don’t do it again” — what holds that, after the script for each job is gone? For a long time the honest answer was: nothing did.

The enemy isn’t forgetting. It’s curation.#

We already solved forgetting, three different ways. Chapter 3 named the problem — the model is a temp agency, it forgets you every morning. Chapter 4 gave you the fix — the vault is the journal you hand it back on wake-up. Chapter 37 gave you the architecture — the four layers, CLAUDE.md and memory/ and skills and the session, and which one wins when they disagree.

Every one of those is about writing memory and handing it over. None of them answers the question that shows up once you’ve actually been doing this for a year: across hundreds of sessions, how does that memory stay curated — deduplicated, verified, true, and under its own ceiling?

Here are my real numbers. My MEMORY.md index sits at 170 of the 200 lines I capped it at — 85% full, the brake already engaged. I wrote 483 session transcripts last week. Nine hundred and ninety-five all-time. No human reads 483 transcripts. So every lesson I learned this week and didn’t stop to hand-write into a memory file is, right now, already gone — sitting in a transcript no one will ever open.

Forgetting was solved. Curation wasn’t. That’s the gap.

Rick’s been dreaming on a Mac mini for months#

Here’s what’s funny: I wasn’t starting from zero. I’d just never made it safe.

Rick — my agent platform — has an OpenClaw, the research-and-synthesis archetype, the one built to read broadly, summarize, and cite. I’ve had one running on a Mac mini in the corner of the office for months, on a rotation of models rather than a single one, doing a version of exactly this every night: it reads back over what happened that day, looks for the patterns, and writes a dated file into a dreams/ folder. It has been dreaming, quietly, on local hardware, the whole time. The idea was never the hard part. I had a working one humming in the next room.

Then Anthropic shipped Dreaming in Managed Agents — a research preview, as of mid-2026 — an agent that reads its own past sessions, finds patterns, and can auto-update its own memory. That’s the fourth time in ninety days they’ve shipped the same move: define success, then walk away. /goal was one. Outcomes was another. The swarm in Chapter 38 was a third. Dreaming is the fourth, and it pointed the autonomy somewhere new — at the agent’s own memory. Seeing a frontier lab ship it was the validation that the idea was right. The same lab later put the macro version on the record — When AI builds itself, its recursive-self-improvement essay — the identical define-success-then-walk-away arc, scaled up from an agent curating its memory to a model building its own successor.

It was also exactly where I got off the train.

The Mac-mini Rick writes dream files — I read them, I decide. Anthropic’s version writes the memory. And the single most dangerous thing you can hand an autonomous writer is the one curated index you cannot afford to corrupt. So when I built the Claude Code version, I built it deliberately weaker than the two things that inspired it: propose-only. A surfacer, not a writer.

Four pieces, and the model only touches one#

The whole thing is four pieces — digest, extract, verify, review — and the model runs in exactly one of them.

Digest is the load-bearing move, and it’s plain Python, not a model. The largest session transcript I had was 5.8 megabytes. A 5.8-megabyte file cannot enter an agent — it doesn’t fit, and it shouldn’t try. So before any model sees anything, code strips the transcript down: keep the human prompts and the assistant’s prose, drop the tool calls, the tool results, the thinking blocks, the system reminders. In the one session I measured, the signal — the actual reasoning — was about 16% of the mass. toolUseResult alone was 732 kilobytes of it. What’s left after the strip is roughly 24,000 characters, small enough to hand to an agent. That’s the rule the whole book keeps coming back to: the model is for judgment, code is for everything code can do.

Extract is the one model stage, and it’s OpenClaw logic — the same read-broadly-summarize-cite shape Rick’s been running on the Mac mini. Read-only Explore agents fan out, one per session. Each candidate lesson must quote a real line from the transcript. The agents have no write tool. They cannot touch memory if they tried.

Verify is the gate that earns the whole thing. The agent read the digest — a lossy, truncated summary — so “grounded in the digest” is not the same as “grounded in what happened.” Before any candidate survives, its quote gets re-checked against the raw .jsonl, decode-aware, because the transcript stores text JSON-escaped and a naive search would miss it. Real quotes survive. Invented ones die.

Review is the only output: a dated review-<date>.md where each survivor is a claim, a why, a how-to-apply, and the verified quote with its source session. Then the ledger updates so nothing gets dreamed twice. It never writes to memory/. I do, with the existing /learn skill.

The pipeline: five stages, and only EXTRACT is the model. SELECT → DIGEST → EXTRACT → VERIFY → REVIEW. Four of five are deterministic code.

Most of a transcript is plumbing. About 16% is the lesson — one session, measured. Keep prompts + assistant prose, drop tool_use / tool_result / thinking. 5.8MB → ~24K (~240×).

Two runs, stated honestly#

I ran it twice the day I finished it.

Run	Sessions	Candidates	Verified	Dropped	Net-new	Tool-flagged dup	Missed dup	Trivial (correctly nothing)
Single-project	3	3	3	0	3	0	0	1 — “how many LOC in 30 days lol”
Cross-project	6	15	15	0	14	1	2	1 — an idle session

Read those numbers carefully, because the honest version matters here. Fifteen candidates surfaced, fifteen quote-verified, zero dropped, fourteen net-new, and one the tool itself flagged as a duplicate of a memory I already had. Zero dropped means nothing on that run was fabricated — it does not mean the verifier caught a hallucination, because there wasn’t one to catch. The anti-hallucination power is real, but I proved it separately: I fed the verifier fabricated quotes that read perfectly plausible, and watched every one of them drop. Real quotes survive the raw-transcript re-check. Invented ones don’t. That’s the test that earns the gate, and it’s a different test than these two runs.

The entire output on one screen — I skim it in two minutes. A dated review file: each candidate is Claim / Why / How-to-apply / the verified quote + source session, grouped NEW vs ⚠DUP.

There’s a receipt hiding in how this chapter got made, too. The plan for it — format, outline, the honesty guards you just read — came out of a swarm: five agents arguing different angles, then a reconciler, then a red team whose entire job was to attack the result. The red team caught me overstating one of these numbers and made me check it against disk. The thing this chapter is about — verify against ground truth, not the summary — is the thing that saved the chapter from shipping a wrong receipt about itself.

The fan-out that planned this chapter — five perspective agents in parallel. The same read-only fan-out /dream points at session transcripts, pointed here at the book.

Propose → Stress-test → Synthesize. Generate, adversarially verify, then synthesize. Eight agents, one answer — the red-team stage is where the overstated receipt got caught.

The second run is the one that made me trust it, and I want to walk the logic carefully because it’s the opposite of what it looks like at a glance.

On the cross-project run, the tool flagged one duplicate — a lesson about a fan-out workflow that rate-limits and reports “completed” while having done nothing — against my home index. Good catch. And it missed two: two lessons about rotating HubSpot credentials came back as clean, net-new candidates. They were not net-new. I’d saved both of them, by hand, days earlier.

At a glance that’s a failure. It isn’t, and the reason is the whole argument for propose-only.

The tool deduplicates against the portfolio index — MEMORY.md, the one-line-per-memory pointer file. The portfolio index holds pointers, not detail. The two HubSpot lessons lived one level down, inside a single project’s own memory file — a place the portfolio index points at but never reads. So the tool did not fail to notice a duplicate it could see. It correctly returned “new” for two candidates it had no structural way to know were already filed. The miss was a function of where I’d put the originals, not a flaw in its judgment. It told me the exact truth about the edge of its own vision.

Now hold that next to the cost. The worst thing that miss did was put two suggestions in a review file that I delete in three seconds. The worst thing the auto-writing version does, with the identical blind spot, is write a confident duplicate into an index I don’t discover is corrupted for three weeks. Same blind spot. Two completely different blast radii. One of them I can see and dismiss; the other I find out about long after it’s done its damage.

Memory is the moat. The moat is a lot of work.#

Here’s the part I’d keep if you keep nothing else from this chapter.

Everyone has the same model. You, me, your competitor down the street — we are all renting the same Opus by the token. The frontier model is not a moat; it’s a commodity with a price list. What compounds, what actually separates one operator from another a year in, is the curated record their sessions leave behind. Your memory layer is the only part of your stack the vendor can’t ship to your competitor next Tuesday. Memory is the moat.

And the moat is a lot of work. That’s not a footnote — it’s the catch. Chapter 3 told you the model forgets you. Chapter 4 gave you the vault to hand it back. Chapter 37 gave you the layers and the rules. This chapter is the maintenance loop that keeps all of that from quietly rotting while you’re busy shipping. The operator who wins with AI isn’t the one with the cleverest prompt or the biggest context window. It’s the one whose memory is the deepest and the cleanest a year from now. Memory is the key to succeeding with this — full stop.

Which is why the failure mode I built against isn’t a crash. Memory tools don’t crash; they get abandoned. They quietly surface nothing useful for a few weeks until you stop running them, and you never decide to stop — you just drift. So this one has a yield-floor tripwire: three runs in a row that surface zero new candidates and it prints, in plain text, “DREAM IS PROPOSING NOTHING — extractor may be mis-tuned or the corpus is saturated.” Now abandonment is a decision I make on purpose, not a drift I sleepwalk into.

And auto-write — letting it skip the review file and write memory itself — is earned, not assumed. It gets that promotion only after the frontmatter is normalized to one schema, the ledger is proven stable, and thirty days of accept/reject data show the extractor is right at least 80% of the time. Until then it surfaces, and I write. The loop pointed at your own memory is the one job you don’t hand to autonomy yet — precisely because it’s the one place a confident mistake is most expensive.

Propose-only isn't the default. It's the ceiling — nailed to the floor. Floor = surface a review file (where it lives AND its max). Auto-write is earned later. Auto-delete/overwrite: no code path exists.