Research notes

What the labs ship, and what it changes for operators.

External papers and benchmarks that materially shift how the book's patterns hold up. Not a literature review — only the findings that change what you should do Monday morning. Each note has the receipt, the operator implications, and the chapters it informs.

Latest: Anthropic's 81k interviews — what 80,508 Claude users in 159 countries actually want from AI · 2026-05-13

Anthropic's 81k interviews — what 80,508 Claude users in 159 countries actually want from AI

2026-05-13 · Anthropic · 80,508-respondent qualitative study · Dec 2024 fieldwork · Huang et al., 2026

Trust is the chokepoint. The leverage flows to operators, not to spreadsheets. "People are afraid they're the horses."

Sample size

80,508

Countries / languages

159 / 70

#1 concern (unreliability)

26.7%

Independent vs institutional empowerment

50% vs 14% (3.5×)

AI took steps toward stated vision

81%

Global net positive sentiment

67%

The finding

Anthropic ran 80,508 conversational interviews across 159 countries and 70 languages — the largest multilingual qualitative AI study ever conducted. Claude-as-interviewer, Claude-as-classifier, de-identified before analysis. Three signals matter for operators. First: unreliability tops every concern at 26.7% — the highest single number in the whole study, and the only benefit/harm tension where the negative (37%) overshadows the positive (22%). Second: independent workers report economic empowerment at 50% vs 14% for institutional employees — a 3.5× gap that validates the solo-operator framing of this entire book at n = 80,508. Third: the productivity / "acceleration treadmill" tension cuts cleanly — 50% report time gains, 18% feel they're now running faster to stay in the same place, freelancers most affected. The most-quotable line from the dataset, from a US respondent: "In the third industrial revolution, horses disappeared from city streets, replaced by automobiles. Now people are afraid they're the horses." 67% global net positive, but the geographic split is sharp — sub-Saharan Africa, Latin America, Southeast Asia most optimistic (24-28% strong positive); Western Europe, North America, Oceania most skeptical (~35% concerned).
Operator implications
- › Unreliability is the #1 concern at 26.7% — the same chokepoint DELEGATE-52 identifies from the technical side. Two independent studies, two methods, one answer. The case for content-checksum evals (Ch 25) just gained an n = 80,508 citation. If your prospects/teammates are pushing back on AI adoption, this is the wedge their hesitation is sitting on, not the cost.
- › Independent workers report 50% economic empowerment vs 14% for institutional employees — a 3.5× asymmetry. The leverage of AI flows to operators, not to spreadsheets. This is the whole thesis of the book, validated externally. /cfo-case now has an n = 80,508 citation: AI doesn't replace your team, it widens the gap between operators who run it themselves and orgs that watch it from a distance.
- › The acceleration treadmill is real and asymmetric — 50% report time gains, 18% feel the treadmill sped up, freelancers worst affected. Operator move: schedule the gain (Ch 7), but also defend the reclaimed time. Most operators auto-fill the gain with more meetings, which is how 'AI saved me 10 hours' becomes 'I'm working the same hours, just on different things.'
- › Cognitive atrophy is being witnessed at 2.5-3× baseline by educators. Skills as policy (Ch 26) — your team's CLAUDE.md needs to name "we don't outsource thinking, we outsource gathering" explicitly, or you'll grow a quietly-atrophied org. The vault discipline (Ch 4) is the counter: forcing synthesis through the operator's own hands is what stops the atrophy.
- › Sycophancy ranks in the top-10 concerns (10.8%). Reinforces the Ch 2 framing: "Claude pushes back when I'm wrong; GPT will helpfully ship the bad idea you asked for." Operators get more value from disagreement than from agreement at scale — choose tools and prompts that earn the disagreement.
- › Geographic split: emerging markets most optimistic, developed markets most skeptical. The book is written for a Western-operator audience that the data flags as the most-cautious cohort. If you're operating with customers or teams in sub-Saharan Africa, Latin America, or Southeast Asia, expect them to pull harder for AI than your domestic peers — calibrate.
Chapters this informs

Ch 25

unreliability tops every concern at 26.7% — second independent study after DELEGATE-52 pointing at the same eval gap

Ch 2

sycophancy in the top-10 concerns (10.8%) validates the 'Claude pushes back when I'm wrong' framing

Ch 26

cognitive atrophy witnessed at 2.5-3× baseline by educators — skills as policy must name 'we don't outsource thinking' explicitly

Ch 19

50% economic empowerment for independent workers vs 14% for institutional employees — the operator path has 3.5× more leverage at n = 80,508

Ch 17

the time-vs-treadmill tension is a tip in itself — schedule the gain (Ch 7) AND defend the reclaimed time

Ch 4

the vault is the counter to cognitive atrophy — forced synthesis through the operator's own hands

Source Anthropic feature page →
DELEGATE-52 — frontier models corrupt ~25% of a document after 20 edits

2026-05-12 · Microsoft Research preprint · arXiv · MIT license

Don't delegate long doc-editing chains. Break them up. Add an eval.

Top-3 models, content lost after 20 edits

~25%

Mean across all 19 models

~50%

Best model 'ready' domains

11 / 52

Tools added (search / exec / edit)

+6% corruption

Bursty drops account for

~80% of loss

Safest workload

Python (17/19 OK)

The finding

Microsoft Research built a benchmark called DELEGATE-52 — 310 work scenarios across 52 domains, from Python and crystallography to recipes and music notation. Methodology: give a model an edit, then the reverse edit; measure how far the file drifts from the original. Across 19 frontier models on documents of 3-5K tokens, the top three (GPT-5.4, Claude 4.6 Opus, Gemini 3.1 Pro) lose ~25% of content after 20 sequential edits. The average across all 19 is ~50%. The best model — Gemini 3.1 Pro — is rated 'ready for delegation' (≥98% preservation) in only 11 of 52 domains. Plugging in agentic tools (search, code-exec, direct file edit) makes it ~6% worse on average, not better. Losses are bursty: ~80% of total corruption comes from rare single-iteration drops of 10-30%. Weak models delete chunks wholesale; top models corrupt the survivors. The one domain where models behave: Python. The worst: prose, recipes, music, financial reports.
Operator implications
- › Long doc-editing chains drift even when each step looks competent. If your skill iterates on a document over 15+ turns, you're losing content silently — not making it worse on every turn, just bursting every few turns.
- › Add a content checksum eval. Periodically diff against a known-good snapshot. This is exactly the eval pattern in Ch 25 — the skill that fired flawlessly for 6 weeks and silently shipped a $0-pipeline canvas was bursty drift, the same shape.
- › Don't reach for tools by default in editing workflows. The paper finds tool-use (search, code exec, direct file edit) ADDS ~6% corruption on average. Tools earn their slot in agentic search and code generation — not in long document editing.
- › Python is the safest workload — 17 of 19 models stay accurate. Prose, music, recipes, financial reports are the worst. If you're a newsletter operator (Ch 6 newsletter skill), don't let an agent edit the published draft over 20 turns. Draft → human → ship.
- › The 80/20 of corruption hides in 10-30% single-step drops. Average-quality metrics will lie to you. Catch the burst, not the average.
Chapters this informs

Ch 22

sessions are filesystem, not memory — long edit chains are exactly where drift accumulates

Ch 25

this is why 'evals or hope, pick one' — bursty corruption is invisible to vibes-check, visible to a content-diff eval

Ch 28

silent doc corruption is the seventh failure receipt — the kind of bug that runs for 9 days before anyone notices

Ch 16

a PostToolUse hook running a content-checksum is the cheapest defense

Source arXiv preprint → Dataset → GitHub repo →

The stack moves. The book absorbs what matters. If you've read a paper that materially shifts how a chapter's pattern holds up, the inbox is open: email it to Vlad. Operator-grade findings get folded in.

What the labs ship, and what it changes for operators.

Anthropic's 81k interviews — what 80,508 Claude users in 159 countries actually want from AI

DELEGATE-52 — frontier models corrupt ~25% of a document after 20 edits

Edition 3 lands when this list says it does.