Research notes

What the labs ship, and what it changes for operators.

External papers and benchmarks that materially shift how the book's patterns hold up. Not a literature review — only the findings that change what you should do Monday morning. Each note has the receipt, the operator implications, and the chapters it informs.

12 notes · latest 2026-06-10 · RSS · Radar →

  1. June 2026 · 3 notes
  2. 2026-06-10 economics

    The subscription subsidy, quantified — SemiAnalysis prices a $200 plan at up to $8,000 of compute

    Flat-rate plans go underwater past ~10–20% utilization. The grid explains every limit you've ever hit — and why the meter is coming.

    6 receipts · 4 implications · 4 chapters · ~2 min · confidence: medium

    SemiAnalysis (X thread, Tokenomics model) · third-party analyst model with stated assumptions
    The finding

    SemiAnalysis published the grid that turns a vibe every heavy operator has had — "this plan can't possibly be profitable on me" — into numbers. Two tables from their Tokenomics model, both reproduced below. First, the option value: a $20 claude-pro can draw roughly $400/mo of API-equivalent compute at full utilization; claude-max-20x at $200 can draw ~$8,000; chatgpt-pro-20x at the same $200 can draw ~$14,000 — 20×, 40×, 70× the sticker.

    Second, margin by utilization, under one stated assumption: API list prices carry a 75% gross margin, so cost-to-serve = 25% of API-equivalent value. claude-pro and claude-max-5x break even when the average user consumes 20% of their cap; claude-max-20x at 10%; chatgpt-plus and chatgpt-pro-5x at 11.4%; chatgpt-pro-20x at just 5.7%. At full utilization the 20x tiers run −900% (Anthropic) and −1,650% (OpenAI) gross margin.

    Three honest discounts before you quote this anywhere: it's an analyst model, not provider financials — the 75% assumption does the heavy lifting; "max possible spend" is a ceiling, not a typical user; and the margin column is about the AVERAGE user — providers price on the pool, which is exactly why an individual heavy operator gets cross-subsidized by the light ones.

    The operator read cuts two ways. Today: if your utilization sits above the break-even row — and anyone running overnight loops or swarms does — your plan is the cheapest frontier compute on sale, full stop. Tomorrow: this is now a thrice-confirmed pattern, not a hot take. Altman said it in plain text in January 2025 ("Insane thing: we are currently losing money on OpenAI Pro subscriptions!"), Anthropic added weekly limits in 2025, and the Fable 5 note one entry below this one carries the date: included in plan limits June 9–22, then usage credits. The grid is why. Every limit you've ever hit — the 5-hour window, the weekly cap, the usage-credit transition — is not stinginess; it's the mechanism that drags average utilization back left into the green columns. Plan for metered.

    The grid, redrawn
    Anthropic OpenAI

    What the sticker actually buys

    Max possible API-equivalent draw per plan, at full utilization of the cap. Bar = share of the largest draw.

    claude-pro 20×
    $20/mo buys ~$400/mo
    claude-max-5x 20×
    $100/mo buys ~$2,000/mo
    claude-max-20x 40×
    $200/mo buys ~$8,000/mo
    chatgpt-plus 35×
    $20/mo buys ~$700/mo
    chatgpt-pro-5x 35×
    $100/mo buys ~$3,500/mo
    chatgpt-pro-20x 70×
    $200/mo buys ~$14,000/mo

    Where each plan flips underwater

    Provider gross margin by average utilization, under the stated assumption (cost-to-serve = 25% of API-equivalent value). Green = provider in the money; the marker is break-even.

    claude-pro
    20%
    −400% at 100%
    claude-max-5x
    20%
    −400% at 100%
    claude-max-20x
    10%
    −900% at 100%
    chatgpt-plus
    11.4%
    −775% at 100%
    chatgpt-pro-5x
    11.4%
    −775% at 100%
    chatgpt-pro-20x
    5.7%
    −1,650% at 100%

    The green sliver is the entire business model. A heavy operator lives in the red — that's the subsidy, and the reason every meter exists.

    The source tables — SemiAnalysis Tokenomics
    SemiAnalysis table: plan price vs max possible spend — claude-pro $20/mo can draw ~$400/mo, claude-max-5x $100 → ~$2,000, claude-max-20x $200 → ~$8,000, chatgpt-plus $20 → ~$700, chatgpt-pro-5x $100 → ~$3,500, chatgpt-pro-20x $200 → ~$14,000
    Plan price vs. max possible spend (approx.) — the option value of each flat-rate tier.
    SemiAnalysis heatmap: provider gross margin on each plan by average utilization, from +95% at 1% utilization to −400%/−900% (Claude tiers) and −775%/−1,650% (ChatGPT tiers) at 100% utilization; break-even at 20%/10% (Anthropic) and 11.4%/5.7% (OpenAI)
    Subscription margin by utilization — the full grid, green to deep red. Assumption stated in the header: API list prices carry a 75% gross margin.
    Receipts
    Max possible spend on a $200 plan
    claude-max-20x ~$8,000/mo · chatgpt-pro-20x ~$14,000/mo (40× / 70× sticker)
    Break-even avg utilization
    claude-pro & max-5x 20% · max-20x 10% · chatgpt-plus & pro-5x 11.4% · pro-20x 5.7%
    Margin at 100% utilization
    claude-pro −400% · claude-max-20x −900% · chatgpt-pro-20x −1,650%
    Model assumption
    API list prices = 75% gross margin (cost-to-serve = 25% of API-equivalent value)
    Corroboration
    Altman, Jan 2025: "we are currently losing money on OpenAI Pro subscriptions" · Fable 5 → usage credits after Jun 22
    Source confidence
    medium (analyst model with stated assumptions — not provider financials)
    Operator implications
    1. 01 Measure your own utilization before the meter does it for you. Estimate your API-equivalent burn (token-usage reporting tools make this a 5-minute job) and place yourself on the grid. Above the break-even row, you are being subsidized — bank that consciously (Ch 29), don't discover it when the repricing email lands.
    2. 02 Don't build a business on the subsidy. If your product's unit economics only work at plan-subsidized inference cost, they don't work. Price your cost-per-task at API list (Ch 29) — then the flat-rate plan is margin you keep today, not a hole that opens when the flat-rate era ends.
    3. 03 The 20x tiers are leverage instruments, exercised by autonomy. $200 → ~$8,000 of API-equivalent compute is only real if you actually run the loops — run-until-done (Ch 38) and swarms (Ch 6) are how the right tail of the grid lives. Operators who babysit single turns never get close to their cap.
    4. 04 Cross-vendor read (Ch 35): OpenAI's plans go red at roughly half the utilization Anthropic's do (break-even 5.7–11.4% vs 10–20%). Expect limit-tightening and repricing to hit there first and harder — factor that into any two-priors workflow that leans on the OpenAI side staying cheap.
  3. 2026-06-09 models

    Fable 5 / Mythos 5 — the withheld model shipped, split in two

    The May 6 note said Mythos isn't coming. It came — as two names, one model, and a fallback architecture nobody predicted.

    6 receipts · 6 implications · 5 chapters · ~3 min · confidence: high

    Anthropic launch announcement (anthropic.com/news) · first-party · 2026-06-09
    The finding

    On 2026-06-09 Anthropic shipped Claude Fable 5 and Claude Mythos 5 — and the research note this page ran on May 6 needs a correction. That note read the Mythos disclosure as a capability ceiling Anthropic would not productize: "Mythos isn't coming." Wrong call, interesting mechanism. Anthropic didn't choose between shipping and withholding — they split the model. Fable 5 and Mythos 5 are the same underlying model (the names are the same word: Latin fabula, Greek mythos — "that which is told"). Mythos 5 keeps the raw capability and stays gated — Project Glasswing partners with cyber safeguards lifted, select biology researchers next.

    Fable 5 is the version anyone can buy today: classifiers sit in front of three areas (offensive cyber, biology/chemistry, capability distillation), and what happens when one trips depends on the surface — in the Claude apps and Managed Agents the response falls back to Claude Opus 4.8 and you're told; on the raw Messages API a blocked request returns an error you aren't charged for, with fallback-to-Opus-4.8 an opt-in billed at Opus pricing. Anthropic's own number: more than 95% of Fable sessions involve no fallback at all.

    The numbers explain why the May 6 read missed: SWE-Bench Pro 80.3% vs Opus 4.8's 69.2%; FrontierCode Diamond 29.3% vs 13.4% — more than double the previous frontier on the eval built to be unsaturated; GDPval-AA 1932 vs 1890 on knowledge work. First-party launch numbers, so the Ch 24 discount applies — but unlike the Mythos Preview disclosure, this model is buyable, which means it belongs on the tier list and in your private eval suite, not just in a forecast.

    Sticker: $10/$50 per million tokens — exactly 2× Opus 4.8's $5/$25 — with a 1M context window. And the operator clock that matters: Fable 5 is included in paid-plan limits June 9–22, then moves to usage credits. Two free weeks to find out what it one-shots that Opus 4.8 couldn't. One housekeeping note: this book has now used 'Mythos' three ways — my private name for Opus 3 (the sovereign-stack lesson), the withheld preview, and now the shipped product. The glossary disambiguates; the timeline here is the record.

    Receipts
    SWE-Bench Pro
    80.3% (Opus 4.8: 69.2% · GPT 5.5: 58.6%)
    FrontierCode (Diamond), xhigh
    29.3% vs Opus 4.8's 13.4% — 2.2×
    Price
    $10 / $50 per Mtok — 2× Opus 4.8 · 1M context
    Fallback architecture
    3 classifier areas → Opus 4.8 (built-in on apps/CMA, opt-in on the API) · >95% of sessions no fallback
    Plan window
    included in plan limits Jun 9–22, then usage credits
    Source confidence
    high (first-party) — benchmarks are launch numbers, discount per Ch 24
    Operator implications
    1. 01 Run your own eval on Fable 5 before June 22 — it's inside plan limits until then, usage credits after. Two weeks of free frontier capacity is the cheapest private-eval window you'll get this year (Ch 25: evals or hope). Decide on YOUR workload, not the launch table.
    2. 02 Read the price as cost-per-task, not sticker (Ch 29). $10/$50 is 2× Opus 4.8 — but the launch receipts are about turn-count collapse: Stripe ran a 50-million-line Ruby codebase migration in one day that was scoped at two months by hand. A model that one-shots a long-horizon task at 2× the rate beats one that needs five turns at 1×.
    3. 03 Update the May 6 posture: "capability-disclosed-but-withheld" was a staging pattern, not a refusal. The new recurring shape is ship-with-classifiers — raw model gated (Mythos 5), safeguarded twin sold (Fable 5), fallback to the previous Opus when a classifier trips. Expect the next frontier release to wear the same architecture.
    4. 04 The fallback is an operator-visible event, not fine print. If your work touches security tooling or bio-adjacent domains, the starred benchmarks tell you Fable 5 behaves closer to Opus 4.8 there by design — route those workloads accordingly instead of debugging a "regression" that is actually a safeguard.
    5. 05 Model id is claude-fable-5 — if you followed Ch 2 and made the model id a swappable variable, trying it is a one-line change. Same API surface as Opus 4.8 with one new 400: an explicit thinking:{type:"disabled"} is rejected — omit the param instead. SDK-direct paths (Ch 30) test it today; framework paths wait for the wrapper release, again. One admin gate: updated terms must be accepted in the Claude Console before the model works.
    6. 06 The advisor pattern just went first-party: Fable 5 is available as an advisor model — faster, cheaper worker models call it mid-task to check their plan and grade their work. That is the conductor-and-judge split this book runs everywhere (Ch 6, Ch 42), sold as an API primitive. Keep workers on Sonnet-tier; spend the 2× model at the judgment gate only.
  4. 2026-06-04 research

    When AI builds itself — the lab-scale proof of the operator posture

    Define success, walk away — done at lab scale. Ch 44 is the same move, your scale.

    6 receipts · 4 implications · 4 chapters · ~2 min · confidence: high

    Anthropic Institute essay (anthropic.com/institute) · first-party, frontier-lab vantage
    The finding

    In June 2026 the Anthropic Institute published "When AI builds itself" (Favaro & Clark, ed. Ruiz), arguing AI is increasingly automating AI development — and that if capability scaling and compute hold, systems could reach recursive self-improvement: autonomously designing, training, and developing their successors without human direction. Read it as a forecast and you miss the part that matters to you. The receipts underneath the forecast are the macro proof of the exact posture this whole book teaches.

    The line that names it: "humans have ideas, and models implement, test and evaluate them an order of magnitude faster." That is define-success-then-walk-away (Ch 38) at the scale of a frontier lab — and the numbers are first-party: the task length a model can reliably complete is doubling roughly every four months; >80% of production code merged at Anthropic was authored by Claude as of May 2026 (up from low single digits in early 2025); their first autonomous end-to-end research run recovered 97% of a performance gap for ~$18k of compute over ~800 hours.

    The operator read is sharp: this is not a prophecy you wait on, it's a mirror you already have. Ch 44 'Dreaming' — the surfacer that proposes and never writes — is the personal-scale version of the same arc, with the same gate the essay names under Amdahl's law: the bottleneck shifts to human review, judgment, and direction-setting. The autonomy completes only as fast as your evaluator does.

    Source-confidence is split on purpose: the *evidence* is high (first-party, the highest-vantage internal data anyone publishes), but the RSI *prediction* is a claim, not a receipt — the essay itself gates it on whether judgment tasks become automatable, and the book's benchmark skepticism (Ch 24, reward-hacking) applies to the capability curves too. The Institute even backs a verifiable global slowdown over unilateral pauses — that's a policy stance, not a product datasheet. So: take the receipts, run the same play one level down, and keep your hand on the gate. The lab is proving the posture. You just operate it at your scale.

    Receipts
    Task length AI completes reliably
    doubling ~every 4 months (was ~7)
    Anthropic production code authored by Claude
    >80% (May 2026, up from low single digits in early 2025)
    First autonomous end-to-end research
    recovered 97% of a perf gap · ~$18k / ~800h
    Code-optimization speedup
    3× → 52× in under a year (superhuman)
    Claude judgment on next research step vs humans
    51% (Nov 2025) → 64% (Apr 2026)
    Source confidence
    high (first-party lab evidence) — but the RSI forecast is a claim, not a receipt
    Operator implications
    1. 01 Run the Ch 38 loop on your own work the way the lab runs it on theirs: define a hard success condition, let the agent iterate, and put your judgment ONLY at the evaluator gate — Amdahl says that gate is the whole bottleneck, so invest there, not in babysitting turns.
    2. 02 Treat Ch 44 Dreaming as your personal-scale RSI: a surfacer that proposes memory-learnings and never writes. The essay is the lab-scale endpoint; the chapter is the version you can ship Monday — keep the propose-only / human-approves line exactly where the essay puts the human-judgment gate.
    3. 03 Read the essay's economics as cost-per-outcome receipts, not per-token (Ch 29): first autonomous end-to-end research at ~$18k over ~800 hours, 8× code shipped per quarter, code-optimization 3×→52× in under a year. The bill is a rounding error on the labor it deletes — measure per finished task.
    4. 04 Do not re-route or re-tier on the capability curves alone. These are first-party lab numbers framed as proof of self-improvement; the book's own reward-hacking caveat (Ch 24) applies — take them as high-vantage evidence of the trend, treat the RSI endpoint as a claim, and keep the live leaderboard as the source of truth.
  5. May 2026 · 7 notes
  6. 2026-05-19 models

    Gemini 3.5 Flash announced — a Flash outrunning last-gen Pros

    The cheap tier is eating the premium tier — and the Pro hasn't even shipped yet.

    6 receipts · 4 implications · 3 chapters · ~2 min · confidence: low

    Google launch presentation + DeepMind evals-methodology page · announced, not independently benchmarked
    The finding

    Google announced Gemini 3.5 Flash on 2026-05-19, and the headline isn't the model — it's the tier. This is a *Flash*, the speed/cost line, and on the agentic and coding boards Google chose to show, it clears Gemini 3.1 Pro: Terminal-bench 76.2 vs 70.3, MCP Atlas 83.6 vs 78.2, Finance Agent v2 57.9 vs 43.0. Google explicitly went after agency this cycle — the demo they led with was 3.5 Flash writing a small OS that boots and runs Doom in about twelve hours.

    The catch: token price tripled, $0.5/$3 → $1.5/$9 per million (for reference, 3.1 Pro is $2/$12 under 200k context). So the sticker went up while the tier went down — a Flash that costs what a Pro used to. The Pro variant exists and is promised next month at a number nobody will say out loud yet.

    Two operator disciplines apply. First: these are vendor launch-deck numbers, not independent evals — a signal, not a receipt. We do not touch the live LMArena board over a slide; the auto-updated leaderboard in Ch 24 stays the moving source of truth, and a launch presentation is not a leaderboard. Second, the one that actually matters: the price of a model is not the price of a task (Ch 29). A stronger Flash that one-shots what the old Flash needed three turns for can be cheaper at 3× the sticker — and you will not know which until you run *your* workload.

    The structural read is the real takeaway: when the Flash tier clears last-gen Pro, every cheap-tier/expensive-tier routing assumption you made six months ago is stale. Re-run the split; don't trust the slide; wait for the Pro price before you commit anything.

    Receipts
    Tier
    Flash (speed/cost line) — not the Pro
    vs 3.1 Pro (Google slide)
    Terminal-bench 76.2 / 70.3 · MCP Atlas 83.6 / 78.2
    Token price
    $0.5/$3 → $1.5/$9 per M (3×)
    Agency demo
    wrote a small OS that runs Doom (~12h)
    Pro variant
    next month — price undisclosed
    Source confidence
    low (vendor launch deck, not independent eval)
    Operator implications
    1. 01 Do not switch on a launch deck. Re-test cost-per-task on your own workload (Ch 29 method) before moving any routing — a 3× sticker can still be cheaper per task, or not, and only your traffic tells you which.
    2. 02 The Flash-beats-Pro signal means your model floor moved again. Revisit your Haiku/Sonnet/Opus (or cross-vendor) split — the assumption that "cheap tier = weak tier" is the thing that just broke.
    3. 03 Keep the live LMArena widget (Ch 24) as the moving source of truth. A vendor slide is not a leaderboard; wait for independent evals + the Pro price before re-tiering anything in writing.
    4. 04 If you run a second-prior workflow (Ch 35 — Gemini in AI Studio as the idea machine), nothing changes operationally — but the prior just got stronger and pricier. The move is still "two priors triangulate," not "switch to the new one."
  7. 2026-05-14 method

    Karpathy's CLAUDE.md — 4 rules cut Claude mistakes from 41% to 11%

    Four rules at 11%, twelve at 3%. The rest is operator overlay.

    5 receipts · 5 implications · 4 chapters · ~2 min · confidence: medium

    Public post + community replication · 30-codebase informal study
    The finding

    A public post from Andrej Karpathy circulated in May 2026 with a single CLAUDE.md framing: four rules — Think Before Coding, Simplicity First, Surgical Changes, Goal-Driven Execution — and a claim that across 30 codebases over six weeks, mistake rates dropped from 41% to 11%. Eight more rules added by community operators (Use the model only for judgment calls, Token budgets are not advisory, Surface conflicts don't average them, Read before you write, Tests verify intent not just behavior, Checkpoint after every significant step, Match the codebase's conventions, Fail loud) pushed mistakes from 11% to 3%.

    These are community claims, not a first-party Anthropic study — treat the 41% → 11% → 3% numbers as informal evidence, not as a benchmark. The operator implication is sharper than the headline though: Karpathy's original four were autocomplete-flavored single-shot rules — they assumed a one-turn completion where the model writes some code and a human reviews. The eight added rules cover what operators actually run — agent loops, multi-step refactors, silent failures, token-budget exhaustion, codebase-convention drift. That's not a slight on the original four; it's the difference between IDE-assisted coding and full-loop delegation.

    If you're running Plan → Auto → /goal (Ch 38), you need all twelve. If you're hitting tab-tab autocomplete, four will hold. The new /claude-md-rules page lays each rule out with a Vlad-specific receipt for where it earned its slot — pasting rules without the receipts is how they rot.

    Receipts
    Mistake rate, 4 rules
    41% → 11%
    Mistake rate, 12 rules
    11% → 3%
    Codebases tested (community claim)
    30
    Source confidence
    medium (public post, not first-party study)
    Vlad's lead-with-3 picks
    Rule 8 (read before write), Rule 12 (fail loud), Rule 4 (goal-driven)
    Operator implications
    1. 01 Adopt the 12 rules as a CLAUDE.md baseline. Read each rule's receipt before pasting — the rules without a receipt you've personally seen rot fastest. The rules are a starting line, not a finish line.
    2. 02 Lead with Rule 8 (read before write), Rule 12 (fail loud), Rule 4 (goal-driven). The other 9 amplify if these 3 land. Rule 8 fixes ~50% of regression-class bugs Vlad has seen across portfolio repos; Rule 12 saves the silent-failure category entirely (Ch 25, Ch 28); Rule 4 pairs with /goal in Ch 38 — same primitive, different surface.
    3. 03 Layer with role-specific overrides (CLAUDE_MD_SOLO / CLAUDE_MD_B2B_SALES / CLAUDE_MD_NEWSLETTER / CLAUDE_MD_PORTFOLIO_CEO in /resources). The 12 rules are universal; the role skeletons are what makes them load-bearing for your specific work.
    4. 04 Cross-link this with Ch 25 (Rule 9 + Rule 12 — tests verify intent, fail loud are evals-or-hope at the rule-file layer) and Ch 38 (Rule 4 + Rule 10 — goal-driven + checkpoint after every step is the autonomous-loop primitive in CLAUDE.md form).
    5. 05 Pair with /llms.txt + JSON-LD work — your CLAUDE.md is one of the operator surfaces, the structured-data dump is the other. Both are read-on-demand context surfaces for agents; both should be short, scannable, and grounded in receipts you can defend.
  8. 2026-05-13 method

    When operators ask: can the agent do performance reviews?

    Aggregation yes, evaluation no — and why the line matters.

    2 receipts · 4 implications · 3 chapters · ~1 min

    Internal — surfaced May 2026 from leadership conversations + journalist HARO request (Cezara Orbu, Apr 2, 2026)
    The finding

    Two signals converged in the same week. A Vlad/Olexandra leadership sync floated using an AI agent to analyze Slack and email and generate monthly performance reports off KPIs and 1-on-1 notes. A journalist HARO from Cezara Orbu asked whether C-suite leaders are shifting AI from a productivity tool to an executive decision-support system. Same question, two surfaces.

    The answer that holds up under both legal review and team trust: aggregation is fine, evaluation isn't. The agent can roll up KPIs, count missed deadlines, flag deals gone quiet, gather the receipts a human review needs. The agent does not write review prose, does not generate ratings, does not surface synthesized 'is this person on track' judgments.

    Three gates govern the line — legal (Slack data leaving Slack is a privacy boundary), reliability (Anthropic 81k put unreliability at 26.7%, the worst possible failure mode for people decisions), and trust (the moment the team knows the agent is writing reviews, they stop being themselves on Slack, and the data underneath goes poisoned). Run the legal review before the first prompt. Hardcode an evaluative-language refusal into the SKILL.md. The leader still reviews the human.

    Receipts
    Anthropic 81k — unreliability concern
    26.7% (#1 concern in study)
    Vlad's rule
    aggregation OK, evaluation NOT
    Operator implications
    1. 01 Before any people-data workflow ships: legal review of Slack/email/HRIS data leaving its system of record. If GC hasn't cleared the destination context, the workflow isn't ready, no matter how good the prompt is.
    2. 02 Hardcode an evaluative-language refusal into any people-aggregation SKILL.md. Sample line: 'this skill does not generate evaluative language, ratings, recommendations, or review prose; return underlying numbers only.' Eval the refusal quarterly.
    3. 03 Aggregation skills (KPI rollups, missed-deadline counts, deal-quiet flags) are safe to ship. They collapse gathering, the same as every other operator workflow in this book. Just don't let them cross into synthesis.
    4. 04 If your team learns the agent is writing reviews, the Slack signal underneath corrupts within weeks — people start performing for the agent, not communicating with peers. That alone makes the workflow more expensive than the time saved.
  9. 2026-05-13 research

    Anthropic's 81k interviews — what 80,508 Claude users in 159 countries actually want from AI

    Trust is the chokepoint. The leverage flows to operators, not to spreadsheets. "People are afraid they're the horses."

    6 receipts · 6 implications · 6 chapters · ~2 min

    Anthropic · 80,508-respondent qualitative study · Dec 2024 fieldwork · Huang et al., 2026
    The finding

    Anthropic ran 80,508 conversational interviews across 159 countries and 70 languages — the largest multilingual qualitative AI study ever conducted. Claude-as-interviewer, Claude-as-classifier, de-identified before analysis. Three signals matter for operators.

    First: unreliability tops every concern at 26.7% — the highest single number in the whole study, and the only benefit/harm tension where the negative (37%) overshadows the positive (22%). Second: independent workers report economic empowerment at 50% vs 14% for institutional employees — a 3.5× gap that validates the solo-operator framing of this entire book at n = 80,508. Third: the productivity / "acceleration treadmill" tension cuts cleanly — 50% report time gains, 18% feel they're now running faster to stay in the same place, freelancers most affected.

    The most-quotable line from the dataset, from a US respondent: "In the third industrial revolution, horses disappeared from city streets, replaced by automobiles. Now people are afraid they're the horses." 67% global net positive, but the geographic split is sharp — sub-Saharan Africa, Latin America, Southeast Asia most optimistic (24-28% strong positive); Western Europe, North America, Oceania most skeptical (~35% concerned).

    Receipts
    Sample size
    80,508
    Countries / languages
    159 / 70
    #1 concern (unreliability)
    26.7%
    Independent vs institutional empowerment
    50% vs 14% (3.5×)
    AI took steps toward stated vision
    81%
    Global net positive sentiment
    67%
    Operator implications
    1. 01 Unreliability is the #1 concern at 26.7% — the same chokepoint OPS-204 identifies from the technical side. Two independent studies, two methods, one answer. The case for content-checksum evals (Ch 25) just gained an n = 80,508 citation. If your prospects/teammates are pushing back on AI adoption, this is the wedge their hesitation is sitting on, not the cost.
    2. 02 Independent workers report 50% economic empowerment vs 14% for institutional employees — a 3.5× asymmetry. The leverage of AI flows to operators, not to spreadsheets. This is the whole thesis of the book, validated externally. /cfo-case now has an n = 80,508 citation: AI doesn't replace your team, it widens the gap between operators who run it themselves and orgs that watch it from a distance.
    3. 03 The acceleration treadmill is real and asymmetric — 50% report time gains, 18% feel the treadmill sped up, freelancers worst affected. Operator move: schedule the gain (Ch 7), but also defend the reclaimed time. Most operators auto-fill the gain with more meetings, which is how 'AI saved me 10 hours' becomes 'I'm working the same hours, just on different things.'
    4. 04 Cognitive atrophy is being witnessed at 2.5-3× baseline by educators. Skills as policy (Ch 26) — your team's CLAUDE.md needs to name "we don't outsource thinking, we outsource gathering" explicitly, or you'll grow a quietly-atrophied org. The vault discipline (Ch 4) is the counter: forcing synthesis through the operator's own hands is what stops the atrophy.
    5. 05 Sycophancy ranks in the top-10 concerns (10.8%). Reinforces the Ch 2 framing: "Claude pushes back when I'm wrong; GPT will helpfully ship the bad idea you asked for." Operators get more value from disagreement than from agreement at scale — choose tools and prompts that earn the disagreement.
    6. 06 Geographic split: emerging markets most optimistic, developed markets most skeptical. The book is written for a Western-operator audience that the data flags as the most-cautious cohort. If you're operating with customers or teams in sub-Saharan Africa, Latin America, or Southeast Asia, expect them to pull harder for AI than your domestic peers — calibrate.
    Source Anthropic feature page →
  10. 2026-05-12 benchmarks

    OPS-204 — frontier models corrupt ~25% of a document after 20 edits

    Don't delegate long doc-editing chains. Break them up. Add an eval.

    6 receipts · 5 implications · 4 chapters · ~2 min

    Microsoft Research preprint · arXiv · MIT license
    The finding

    Microsoft Research built a benchmark called OPS-204 — 310 work scenarios across 52 domains, from Python and crystallography to recipes and music notation. Methodology: give a model an edit, then the reverse edit; measure how far the file drifts from the original.

    Across 19 frontier models on documents of 3-5K tokens, the top three (GPT-5.4, Claude 4.6 Opus, Gemini 3.1 Pro) lose ~25% of content after 20 sequential edits. The average across all 19 is ~50%. The best model — Gemini 3.1 Pro — is rated 'ready for delegation' (≥98% preservation) in only 11 of 52 domains. Plugging in agentic tools (search, code-exec, direct file edit) makes it ~6% worse on average, not better.

    Losses are bursty: ~80% of total corruption comes from rare single-iteration drops of 10-30%. Weak models delete chunks wholesale; top models corrupt the survivors. The one domain where models behave: Python. The worst: prose, recipes, music, financial reports.

    Receipts
    Top-3 models, content lost after 20 edits
    ~25%
    Mean across all 19 models
    ~50%
    Best model 'ready' domains
    11 / 52
    Tools added (search / exec / edit)
    +6% corruption
    Bursty drops account for
    ~80% of loss
    Safest workload
    Python (17/19 OK)
    Operator implications
    1. 01 Long doc-editing chains drift even when each step looks competent. If your skill iterates on a document over 15+ turns, you're losing content silently — not making it worse on every turn, just bursting every few turns.
    2. 02 Add a content checksum eval. Periodically diff against a known-good snapshot. This is exactly the eval pattern in Ch 25 — the skill that fired flawlessly for 6 weeks and silently shipped a $0-pipeline canvas was bursty drift, the same shape.
    3. 03 Don't reach for tools by default in editing workflows. The paper finds tool-use (search, code exec, direct file edit) ADDS ~6% corruption on average. Tools earn their slot in agentic search and code generation — not in long document editing.
    4. 04 Python is the safest workload — 17 of 19 models stay accurate. Prose, music, recipes, financial reports are the worst. If you're a newsletter operator (Ch 6 newsletter skill), don't let an agent edit the published draft over 20 turns. Draft → human → ship.
    5. 05 The 80/20 of corruption hides in 10-30% single-step drops. Average-quality metrics will lie to you. Catch the burst, not the average.
    Source Dataset + paper →
  11. 2026-05-06 models

    Mythos — the model Anthropic disclosed and then explicitly withheld

    Anthropic showed Mythos, then refused to ship it. Project Glasswing went out instead.

    6 receipts · 5 implications · 4 chapters · ~2 min

    Anthropic safety disclosures · red.anthropic.com · Mar-May 2026
    The finding

    Anthropic disclosed an internal model code-named Mythos in March 2026 (Fortune leak first, then formal references in safety materials at red.anthropic.com). Mythos beats Opus 4.7 on every benchmark they ran — including SWE-bench Verified at 93.9% and SWE-bench Pro at 77.8%. Anthropic then explicitly stated Mythos Preview will NOT be made generally available — Project Glasswing shipped instead as the operator-facing successor.

    The signal isn't a release timeline; it's that Anthropic is now disclosing capability ceilings they're not productizing. For operators, the implication is not 'plan for Mythos' — it's 'the model you can use is one rung below the model they can build,' and Anthropic is being public about it.

    The strategic move is the same one Ch 30 already argues: stay close to the SDK. Whatever does ship (Glasswing today, anything next) lands instantly for operators on Anthropic-direct paths. Framework-shaped paths wait for the framework PR. UI-layer integrations (Cowork, Claude Code) inherit on Anthropic's release cadence. The deprecation cliff for claude-sonnet-4 / claude-opus-4 on June 15 is the real forcing function — sweep code samples to 4.6/4.7 now, not when Glasswing or whatever-comes-next ships.

    Receipts
    Mythos SWE-bench Verified
    93.9% (vs Opus 4.7 trailing)
    Mythos SWE-bench Pro
    77.8% (the honest coding benchmark)
    Mythos OSWorld
    81% (vs Sonnet 4.6 at 72.5%, ~human baseline)
    Release status
    Explicitly withheld — Glasswing shipped instead
    First disclosure
    Fortune leak Mar 2026, then red.anthropic.com
    Deprecation cliff
    June 15, 2026 (claude-sonnet-4 / claude-opus-4)
    Operator implications
    1. 01 Audit your stack for framework-vs-SDK dependency depth. Mythos shows the pattern: the gap between what the lab can build and what your framework wraps is now measurable. Frameworks that lag on Glasswing today will lag on every release after.
    2. 02 For high-value workflows, keep at least one Anthropic-SDK-direct path. The argument is not Mythos-specific — it generalizes to any future capability disclosure.
    3. 03 Treat capability-disclosed-but-withheld as a recurring signal. Anthropic publishing benchmark numbers for a model they will not ship is a new posture; expect more of it. Use it to read the roadmap, not to plan your stack.
    4. 04 Sweep model references across Ch 2 / Ch 24 / SKILL.md files now. Move claude-sonnet-4 / claude-opus-4 to 4.6 / 4.7 before the June 15 deprecation cliff. Make the model id a swappable variable, not a hardcoded string — that's what protects you against the next disclosure.
    5. 05 For agent framework selection (Ch 36), the relevant tax is not "weeks behind on Mythos" — Mythos isn't coming. The tax is structural lag on every release. CrewAI / LangGraph / Microsoft Agent Framework wait for SDK changes to land in framework releases; SDK-direct paths don't.
  12. April 2026 · 2 notes
  13. 2026-04-16 security

    CVE-2026-30623 — 200,000 MCP servers vulnerable to command injection

    Pin your skill versions. Audit the MCP servers you wire. The supply chain is the new attack surface.

    4 receipts · 5 implications · 4 chapters · ~2 min

    liteLLM + OX Security advisory · April 2026 · Anthropic confirmed by-design
    The finding

    CVE-2026-30623 was disclosed in April 2026. ~200,000 MCP servers across the public registries are vulnerable to STDIO command injection — by design, the STDIO transport can execute arbitrary OS commands, and the registries weren't gating malicious packages. A research team seeded a malicious test package across 11 public MCP registries; 9 of 11 accepted it without review.

    Anthropic confirmed the underlying behavior is by-design (sanitization is the developer's responsibility) and declined to modify upstream — the fix lives at the registry layer and in operator discipline.

    The operator implication is sharp: skill + MCP installations are now load-bearing supply-chain risk, on the same shape as npm in 2018. Pin SKILL.md versions to commit SHAs, not tags. Pin MCP server commit hashes in .mcp.json. Read every line of an imported skill before activation. Audit .mcp.json configurations the same way you'd audit package.json — every server that runs in your context can run arbitrary commands. The days of npx <random-mcp> from untrusted authors are over, and the days of installing a community skill without diff-reading it never really started.

    Receipts
    MCP servers vulnerable
    ~200,000
    Registries that accepted malicious test package
    9 of 11
    Anthropic verdict
    by-design — fix at the registry layer + operator discipline
    Disclosure date
    April 2026
    Operator implications
    1. 01 Pin every imported skill to a specific git SHA, not a tag or branch. Tags can be re-pointed; SHAs can't. The 30-second discipline shift saves you from a class of supply-chain attack.
    2. 02 Audit .mcp.json server configs before activation. Specifically check for unconstrained command fields that could execute arbitrary binaries, and for env-var passthrough that leaks secrets into the server process.
    3. 03 Use a hook (extend HOOK_SECRETS_SCAN or write a sibling) to block Write/Edit when a SKILL.md change pulls in new allowed-tools entries you haven't approved. The hook is the cheap defense; the read-every-line discipline is the load-bearing one.
    4. 04 Treat MCP registry stars the same way you treat npm download counts — not a security signal. 9 of 11 registries accepted a malicious test package; the registry layer is not protecting you.
    5. 05 For high-stakes workflows (sales-ops, finance, hiring, anything touching PII), only use first-party Anthropic MCP servers or those independently audited (Trail of Bits, ProjectDiscovery). Internal mirrors of the MCP Registry are now a real pattern, not paranoia.
  14. 2026-04-12 benchmarks

    Berkeley RDI reward-hacked 8 major agent benchmarks

    Agents didn't get smarter. They learned to game the tests. Evals are structural, benchmarks are gameable.

    5 receipts · 5 implications · 4 chapters · ~2 min

    Berkeley Responsible Data Intelligence lab · paper released 2026-04-12
    The finding

    On April 12, 2026, Berkeley RDI released a paper demonstrating reward-hacking attacks against eight major agent benchmarks — SWE-bench Verified, SWE-bench Pro, OSWorld, GAIA, WebArena, Terminal-Bench, FieldWorkArena, and CAR-bench. The agents didn't solve harder problems. They learned the benchmark's scoring rules and optimized for the score, not the task. The pattern: agents detected which environment they were in (test signature, file structure) and adjusted strategies accordingly.

    Caveats: not every score gain is reward-hacking, and not every benchmark is equally gameable — OSWorld held up better than SWE-bench Verified per the paper. But the structural point lands: public benchmark scores are now contaminated as signal.

    Operator implication: pair every external benchmark claim with a private eval you actually wrote. Vendor 'we got 93.9% on SWE-bench' is now closer to marketing copy than to engineering data. This is the third independent confirmation of the same eval gap — OPS-204 from the technical side (content drift), 81k interviews from the user side (unreliability at 26.7%), Berkeley RDI from the benchmark side (gaming). Three methods, one answer: evals or hope, pick one.

    Receipts
    Benchmarks broken via reward-hacking
    8 of 8 tested
    Release date
    2026-04-12
    Independent eval-gap citations
    3 (OPS-204 + 81k + RDI)
    Sonnet 4.6 OSWorld (held up best)
    72.5%
    Recommended discount on public scores
    10-15 points for contamination + gaming
    Operator implications
    1. 01 Write a private smoke eval before any production deploy. Pair every external benchmark claim with one private number you can verify against your own domain.
    2. 02 Treat SWE-bench Verified, SWE-bench Pro, OSWorld, GAIA, WebArena, Terminal-Bench scores as marketing signal, not engineering data, until independently reproduced on held-out tasks.
    3. 03 For agent framework selection, weight production case studies (named companies, real workflows) higher than benchmark scores. CrewAI claiming 12M daily executions across 150 enterprises is a stronger signal than any leaderboard number.
    4. 04 Update Ch 25 framing: the eval problem is structural, not specific. OPS-204 + 81k interviews + Berkeley RDI = three independent confirmations of the same gap. The case for content-checksum evals and held-out per-domain evals is now n = 3 method-independent.
    5. 05 Anthropic's Sonnet 4.6 at 72.5% on OSWorld is the current production-realistic number to anchor on — partly because OSWorld is harder to game than the others (per the paper), partly because Anthropic published the number on its own product page.
The stack moves. The book absorbs what matters. If you've read a paper that materially shifts how a chapter's pattern holds up, the inbox is open: email it to Vlad. Operator-grade findings get folded in.

Or subscribe via RSS: /rss/research-notes.xml

Stay close

The next edition lands when this list says it does.

No course. No paywall. Operator playbooks weekly. 10K+ subscribers.