Research notes

What the labs ship, and what it changes for operators.

External papers and benchmarks that materially shift how the book's patterns hold up. Not a literature review — only the findings that change what you should do Monday morning. Each note has the receipt, the operator implications, and the chapters it informs.

Latest: Anthropic's 81k interviews — what 80,508 Claude users in 159 countries actually want from AI · 2026-05-13

  1. Anthropic's 81k interviews — what 80,508 Claude users in 159 countries actually want from AI

    2026-05-13 · Anthropic · 80,508-respondent qualitative study · Dec 2024 fieldwork · Huang et al., 2026

    Trust is the chokepoint. The leverage flows to operators, not to spreadsheets. "People are afraid they're the horses."

    Sample size
    80,508
    Countries / languages
    159 / 70
    #1 concern (unreliability)
    26.7%
    Independent vs institutional empowerment
    50% vs 14% (3.5×)
    AI took steps toward stated vision
    81%
    Global net positive sentiment
    67%
    The finding

    Anthropic ran 80,508 conversational interviews across 159 countries and 70 languages — the largest multilingual qualitative AI study ever conducted. Claude-as-interviewer, Claude-as-classifier, de-identified before analysis. Three signals matter for operators. First: unreliability tops every concern at 26.7% — the highest single number in the whole study, and the only benefit/harm tension where the negative (37%) overshadows the positive (22%). Second: independent workers report economic empowerment at 50% vs 14% for institutional employees — a 3.5× gap that validates the solo-operator framing of this entire book at n = 80,508. Third: the productivity / "acceleration treadmill" tension cuts cleanly — 50% report time gains, 18% feel they're now running faster to stay in the same place, freelancers most affected. The most-quotable line from the dataset, from a US respondent: "In the third industrial revolution, horses disappeared from city streets, replaced by automobiles. Now people are afraid they're the horses." 67% global net positive, but the geographic split is sharp — sub-Saharan Africa, Latin America, Southeast Asia most optimistic (24-28% strong positive); Western Europe, North America, Oceania most skeptical (~35% concerned).

    Operator implications
    • Unreliability is the #1 concern at 26.7% — the same chokepoint DELEGATE-52 identifies from the technical side. Two independent studies, two methods, one answer. The case for content-checksum evals (Ch 25) just gained an n = 80,508 citation. If your prospects/teammates are pushing back on AI adoption, this is the wedge their hesitation is sitting on, not the cost.
    • Independent workers report 50% economic empowerment vs 14% for institutional employees — a 3.5× asymmetry. The leverage of AI flows to operators, not to spreadsheets. This is the whole thesis of the book, validated externally. /cfo-case now has an n = 80,508 citation: AI doesn't replace your team, it widens the gap between operators who run it themselves and orgs that watch it from a distance.
    • The acceleration treadmill is real and asymmetric — 50% report time gains, 18% feel the treadmill sped up, freelancers worst affected. Operator move: schedule the gain (Ch 7), but also defend the reclaimed time. Most operators auto-fill the gain with more meetings, which is how 'AI saved me 10 hours' becomes 'I'm working the same hours, just on different things.'
    • Cognitive atrophy is being witnessed at 2.5-3× baseline by educators. Skills as policy (Ch 26) — your team's CLAUDE.md needs to name "we don't outsource thinking, we outsource gathering" explicitly, or you'll grow a quietly-atrophied org. The vault discipline (Ch 4) is the counter: forcing synthesis through the operator's own hands is what stops the atrophy.
    • Sycophancy ranks in the top-10 concerns (10.8%). Reinforces the Ch 2 framing: "Claude pushes back when I'm wrong; GPT will helpfully ship the bad idea you asked for." Operators get more value from disagreement than from agreement at scale — choose tools and prompts that earn the disagreement.
    • Geographic split: emerging markets most optimistic, developed markets most skeptical. The book is written for a Western-operator audience that the data flags as the most-cautious cohort. If you're operating with customers or teams in sub-Saharan Africa, Latin America, or Southeast Asia, expect them to pull harder for AI than your domestic peers — calibrate.
  2. DELEGATE-52 — frontier models corrupt ~25% of a document after 20 edits

    2026-05-12 · Microsoft Research preprint · arXiv · MIT license

    Don't delegate long doc-editing chains. Break them up. Add an eval.

    Top-3 models, content lost after 20 edits
    ~25%
    Mean across all 19 models
    ~50%
    Best model 'ready' domains
    11 / 52
    Tools added (search / exec / edit)
    +6% corruption
    Bursty drops account for
    ~80% of loss
    Safest workload
    Python (17/19 OK)
    The finding

    Microsoft Research built a benchmark called DELEGATE-52 — 310 work scenarios across 52 domains, from Python and crystallography to recipes and music notation. Methodology: give a model an edit, then the reverse edit; measure how far the file drifts from the original. Across 19 frontier models on documents of 3-5K tokens, the top three (GPT-5.4, Claude 4.6 Opus, Gemini 3.1 Pro) lose ~25% of content after 20 sequential edits. The average across all 19 is ~50%. The best model — Gemini 3.1 Pro — is rated 'ready for delegation' (≥98% preservation) in only 11 of 52 domains. Plugging in agentic tools (search, code-exec, direct file edit) makes it ~6% worse on average, not better. Losses are bursty: ~80% of total corruption comes from rare single-iteration drops of 10-30%. Weak models delete chunks wholesale; top models corrupt the survivors. The one domain where models behave: Python. The worst: prose, recipes, music, financial reports.

    Operator implications
    • Long doc-editing chains drift even when each step looks competent. If your skill iterates on a document over 15+ turns, you're losing content silently — not making it worse on every turn, just bursting every few turns.
    • Add a content checksum eval. Periodically diff against a known-good snapshot. This is exactly the eval pattern in Ch 25 — the skill that fired flawlessly for 6 weeks and silently shipped a $0-pipeline canvas was bursty drift, the same shape.
    • Don't reach for tools by default in editing workflows. The paper finds tool-use (search, code exec, direct file edit) ADDS ~6% corruption on average. Tools earn their slot in agentic search and code generation — not in long document editing.
    • Python is the safest workload — 17 of 19 models stay accurate. Prose, music, recipes, financial reports are the worst. If you're a newsletter operator (Ch 6 newsletter skill), don't let an agent edit the published draft over 20 turns. Draft → human → ship.
    • The 80/20 of corruption hides in 10-30% single-step drops. Average-quality metrics will lie to you. Catch the burst, not the average.
The stack moves. The book absorbs what matters. If you've read a paper that materially shifts how a chapter's pattern holds up, the inbox is open: email it to Vlad. Operator-grade findings get folded in.
Stay close

Edition 3 lands when this list says it does.

No course. No paywall. Operator playbooks weekly. 10K+ subscribers.