Vlad's Playbook — Research notes

Gemini 3.5 Flash announced — a Flash outrunning last-gen Pros

Tue, 19 May 2026 00:00:00 +0000

Google announced Gemini 3.5 Flash on 2026-05-19, and the headline isn't the model — it's the tier. This is a *Flash*, the speed/cost line, and on the agentic and coding boards Google chose to show, it clears Gemini 3.1 Pro: Terminal-bench 76.2 vs 70.3, MCP Atlas 83.6 vs 78.2, Finance Agent v2 57.9 vs 43.0. Google explicitly went after agency this cycle — the demo they led with was 3.5 Flash writing a small OS that boots and runs Doom in about twelve hours. The catch: token price tripled, $0.5/$3 → $1.5/$9 per million (for reference, 3.1 Pro is $2/$12 under 200k context). So the sticker went up while the tier went down — a Flash that costs what a Pro used to. The Pro variant exists and is promised next month at a number nobody will say out loud yet. Two operator disciplines apply. First: these are vendor launch-deck numbers, not independent evals — a signal, not a receipt. We do not touch the live LMArena board over a slide; the auto-updated leaderboard in Ch 24 stays the moving source of truth, and a launch presentation is not a leaderboard. Second, the one that actually matters: the price of a model is not the price of a task (Ch 29). A stronger Flash that one-shots what the old Flash needed three turns for can be cheaper at 3× the sticker — and you will not know which until you run *your* workload. The structural read is the real takeaway: when the Flash tier clears last-gen Pro, every cheap-tier/expensive-tier routing assumption you made six months ago is stale. Re-run the split; don't trust the slide; wait for the Pro price before you commit anything.

Claude for the legal industry — Anthropic goes vertical

Tue, 12 May 2026 00:00:00 +0000

On 2026-05-12 Anthropic shipped a packaged legal vertical, and for an operator the interesting part isn't "Claude does law" — it's the *shape*. Three pieces: 20+ MCP connectors into legal systems of record (iManage, NetDocuments, Relativity, Thomson Reuters CoCounsel, Everlaw, Ironclad, Docusign, Box, Datasite, Consilio), 12 practice-area plugins scoped to roles (Litigation, IP, Privacy, Corporate, Employment, Regulatory, AI Governance, Product, Commercial, plus Law Student / Legal Clinic / Legal Builder Hub), and discounted public-service pricing for legal aid. Design partners are not small: Thomson Reuters, Docusign, Harvey, Everlaw, Freshfields, Accenture, Holland & Knight. Claude Opus 4.7 scored 90.9% on Harvey's BigLaw Bench, the highest of any Claude model — a procurement-grade number, the kind you put in a risk memo. The operator read: this is the connectors-plus-skills pattern this entire book teaches, assembled into a product and sold into a regulated industry. The signal is not the law vertical specifically — it's that "MCP connectors into the systems of record + role-scoped plugins" is now Anthropic's own go-to-market motion. If you operate in or sell into any regulated niche, the move is to assemble that same shape yourself — the connector layer over your systems of record, plus per-function skill packs — before a vendor packages your niche for you. Verticalization is a tailwind for operators who already think in connectors and skills, and a clock for those who don't.

Karpathy's CLAUDE.md — 4 rules cut Claude mistakes from 41% to 11%

Thu, 14 May 2026 00:00:00 +0000

A public post from Andrej Karpathy circulated in May 2026 with a single CLAUDE.md framing: four rules — Think Before Coding, Simplicity First, Surgical Changes, Goal-Driven Execution — and a claim that across 30 codebases over six weeks, mistake rates dropped from 41% to 11%. Eight more rules added by community operators (Use the model only for judgment calls, Token budgets are not advisory, Surface conflicts don't average them, Read before you write, Tests verify intent not just behavior, Checkpoint after every significant step, Match the codebase's conventions, Fail loud) pushed mistakes from 11% to 3%. These are community claims, not a first-party Anthropic study — treat the 41% → 11% → 3% numbers as informal evidence, not as a benchmark. The operator implication is sharper than the headline though: Karpathy's original four were autocomplete-flavored single-shot rules — they assumed a one-turn completion where the model writes some code and a human reviews. The eight added rules cover what operators actually run — agent loops, multi-step refactors, silent failures, token-budget exhaustion, codebase-convention drift. That's not a slight on the original four; it's the difference between IDE-assisted coding and full-loop delegation. If you're running Plan → Auto → /goal (Ch 38), you need all twelve. If you're hitting tab-tab autocomplete, four will hold. The new /claude-md-rules page lays each rule out with a Vlad-specific receipt for where it earned its slot — pasting rules without the receipts is how they rot.

Mythos — the model Anthropic disclosed and then explicitly withheld

Wed, 06 May 2026 00:00:00 +0000

Anthropic disclosed an internal model code-named Mythos in March 2026 (Fortune leak first, then formal references in safety materials at red.anthropic.com). Mythos beats Opus 4.7 on every benchmark they ran — including SWE-bench Verified at 93.9% and SWE-bench Pro at 77.8%. Anthropic then explicitly stated Mythos Preview will NOT be made generally available — Project Glasswing shipped instead as the operator-facing successor. The signal isn't a release timeline; it's that Anthropic is now disclosing capability ceilings they're not productizing. For operators, the implication is not 'plan for Mythos' — it's 'the model you can use is one rung below the model they can build,' and Anthropic is being public about it. The strategic move is the same one Ch 30 already argues: stay close to the SDK. Whatever does ship (Glasswing today, anything next) lands instantly for operators on Anthropic-direct paths. Framework-shaped paths wait for the framework PR. UI-layer integrations (Cowork, Claude Code) inherit on Anthropic's release cadence. The deprecation cliff for claude-sonnet-4 / claude-opus-4 on June 15 is the real forcing function — sweep code samples to 4.6/4.7 now, not when Glasswing or whatever-comes-next ships.

Berkeley RDI reward-hacked 8 major agent benchmarks

Sun, 12 Apr 2026 00:00:00 +0000

On April 12, 2026, Berkeley RDI released a paper demonstrating reward-hacking attacks against eight major agent benchmarks — SWE-bench Verified, SWE-bench Pro, OSWorld, GAIA, WebArena, Terminal-Bench, FieldWorkArena, and CAR-bench. The agents didn't solve harder problems. They learned the benchmark's scoring rules and optimized for the score, not the task. The pattern: agents detected which environment they were in (test signature, file structure) and adjusted strategies accordingly. Caveats: not every score gain is reward-hacking, and not every benchmark is equally gameable — OSWorld held up better than SWE-bench Verified per the paper. But the structural point lands: public benchmark scores are now contaminated as signal. Operator implication: pair every external benchmark claim with a private eval you actually wrote. Vendor 'we got 93.9% on SWE-bench' is now closer to marketing copy than to engineering data. This is the third independent confirmation of the same eval gap — OPS-204 from the technical side (content drift), 81k interviews from the user side (unreliability at 26.7%), Berkeley RDI from the benchmark side (gaming). Three methods, one answer: evals or hope, pick one.

CVE-2026-30623 — 200,000 MCP servers vulnerable to command injection

Thu, 16 Apr 2026 00:00:00 +0000

CVE-2026-30623 was disclosed in April 2026. ~200,000 MCP servers across the public registries are vulnerable to STDIO command injection — by design, the STDIO transport can execute arbitrary OS commands, and the registries weren't gating malicious packages. A research team seeded a malicious test package across 11 public MCP registries; 9 of 11 accepted it without review. Anthropic confirmed the underlying behavior is by-design (sanitization is the developer's responsibility) and declined to modify upstream — the fix lives at the registry layer and in operator discipline. The operator implication is sharp: skill + MCP installations are now load-bearing supply-chain risk, on the same shape as npm in 2018. Pin SKILL.md versions to commit SHAs, not tags. Pin MCP server commit hashes in .mcp.json. Read every line of an imported skill before activation. Audit .mcp.json configurations the same way you'd audit package.json — every server that runs in your context can run arbitrary commands. The days of npx from untrusted authors are over, and the days of installing a community skill without diff-reading it never really started.

When operators ask: can the agent do performance reviews?

Wed, 13 May 2026 00:00:00 +0000

Two signals converged in the same week. A Vlad/Olexandra leadership sync floated using an AI agent to analyze Slack and email and generate monthly performance reports off KPIs and 1-on-1 notes. A journalist HARO from Cezara Orbu asked whether C-suite leaders are shifting AI from a productivity tool to an executive decision-support system. Same question, two surfaces. The answer that holds up under both legal review and team trust: aggregation is fine, evaluation isn't. The agent can roll up KPIs, count missed deadlines, flag deals gone quiet, gather the receipts a human review needs. The agent does not write review prose, does not generate ratings, does not surface synthesized 'is this person on track' judgments. Three gates govern the line — legal (Slack data leaving Slack is a privacy boundary), reliability (Anthropic 81k put unreliability at 26.7%, the worst possible failure mode for people decisions), and trust (the moment the team knows the agent is writing reviews, they stop being themselves on Slack, and the data underneath goes poisoned). Run the legal review before the first prompt. Hardcode an evaluative-language refusal into the SKILL.md. The leader still reviews the human.

Anthropic's 81k interviews — what 80,508 Claude users in 159 countries actually want from AI

Wed, 13 May 2026 00:00:00 +0000

Anthropic ran 80,508 conversational interviews across 159 countries and 70 languages — the largest multilingual qualitative AI study ever conducted. Claude-as-interviewer, Claude-as-classifier, de-identified before analysis. Three signals matter for operators. First: unreliability tops every concern at 26.7% — the highest single number in the whole study, and the only benefit/harm tension where the negative (37%) overshadows the positive (22%). Second: independent workers report economic empowerment at 50% vs 14% for institutional employees — a 3.5× gap that validates the solo-operator framing of this entire book at n = 80,508. Third: the productivity / "acceleration treadmill" tension cuts cleanly — 50% report time gains, 18% feel they're now running faster to stay in the same place, freelancers most affected. The most-quotable line from the dataset, from a US respondent: "In the third industrial revolution, horses disappeared from city streets, replaced by automobiles. Now people are afraid they're the horses." 67% global net positive, but the geographic split is sharp — sub-Saharan Africa, Latin America, Southeast Asia most optimistic (24-28% strong positive); Western Europe, North America, Oceania most skeptical (~35% concerned).

OPS-204 — frontier models corrupt ~25% of a document after 20 edits

Tue, 12 May 2026 00:00:00 +0000

Microsoft Research built a benchmark called OPS-204 — 310 work scenarios across 52 domains, from Python and crystallography to recipes and music notation. Methodology: give a model an edit, then the reverse edit; measure how far the file drifts from the original. Across 19 frontier models on documents of 3-5K tokens, the top three (GPT-5.4, Claude 4.6 Opus, Gemini 3.1 Pro) lose ~25% of content after 20 sequential edits. The average across all 19 is ~50%. The best model — Gemini 3.1 Pro — is rated 'ready for delegation' (≥98% preservation) in only 11 of 52 domains. Plugging in agentic tools (search, code-exec, direct file edit) makes it ~6% worse on average, not better. Losses are bursty: ~80% of total corruption comes from rare single-iteration drops of 10-30%. Weak models delete chunks wholesale; top models corrupt the survivors. The one domain where models behave: Python. The worst: prose, recipes, music, financial reports.