Evals — Smoke, Regression, Golden — Vlad's Ultimate AI Dive Deep

It’s 11:42 PM Thursday. The friday-wrapup that has run flawlessly for six weeks just shipped a leadership canvas with $0 in pipeline because someone renamed a HubSpot stage and the skill silently filtered everything out. My COO read it first. I found out from her Slack DM at 7:14 AM Friday — “Vlad, did we have a bad week or is the skill broken?” Both answers are bad. The skill was broken for nine days. I had no .

screenshot

The $0 canvas, in production

screenshot the leadership Slack canvas with the empty pipeline section, the timestamp, and the COO's DM underneath — readers need to feel the receipt, not just hear about it.

id: 25-evals-or-hope-1 · drop 25-evals-or-hope-1.png into public/screens/

The 9-day silent failure#

Here’s what nine days of silence costs. Nine canvases shipped to the leadership channel. Each one wrong in the same way — pipeline section empty, deal-motion section thin, executive summary contradicting itself. None of my reports flagged it because three of them had stopped reading the canvas closely two weeks earlier (it had become wallpaper) and the fourth assumed the empty pipeline meant a quiet stretch. The skill didn’t crash. It didn’t error. It returned a beautifully formatted canvas with a ghost inside.

The trigger was a HubSpot stage rename — “Qualified” became “Qualified — Round 1” because a new VP of Sales wanted to track a sub-stage. The skill’s filter was hard-coded against dealstage = 'Qualified'. Zero matches, zero drama. The model generated graceful prose around the empty result set. “A measured week with a focus on top-of-funnel motion” — that was the lede. There was no top-of-funnel motion. There was no funnel motion at all because the query returned nothing.

If you’re keeping score: a skill that failed silently for 216 hours, in front of every leader at the company, written by me, owned by me, with no instrumentation between the model and my COO’s screen. That’s not a model failure. That’s an operator failure. I shipped a worker into production and forgot to ship the supervisor.

What an eval actually is#

Strip the jargon. An eval is a function that runs against your skill’s output and answers one question: did this output meet a minimum bar? It returns a boolean and a reason. That’s it.

You don’t need an eval framework. You don’t need a benchmark suite. You don’t need Promptfoo, Braintrust, LangSmith, or anything with a logo. You need three lines:

def eval_friday_wrapup(canvas_text: str) -> tuple[bool, str]:
    if "$0" in canvas_text or "no pipeline" in canvas_text.lower():
        return False, "pipeline section is empty — likely stage filter drift"
    return True, "ok"

That’s the eval that would have caught my nine-day failure on day zero. Three lines. No framework. No vendor. The function takes the artifact the skill produces and asks one structural question: does this look like a working week?

The mistake operators make is treating evals like model evaluation — accuracy on a labeled test set, BLEU scores, factuality benchmarks. That’s research-team work. You’re not evaluating the model. You’re evaluating the workflow. The model can be perfectly fine and the workflow can still be broken because something upstream changed shape. Stage rename. API rate limit. Empty array where there should have been twelve rows. Connector returned an auth error and the skill quietly summarized “no recent activity” instead of escalating.

An eval is a smoke detector for the artifact. Not for the model.

The four eval types every operator needs#

Four shapes cover roughly 90% of what shipped skills actually need. Building one of each, even badly, beats building a perfect framework.

Smoke evals ask “did the artifact arrive and contain the obvious things?” Length over 200 chars. All required sections present. Headers in the right order. Money figures parse as numbers. These catch the dumb failures — empty output, truncated output, malformed JSON. Run them on every output, every time.

Regression evals compare today’s artifact to yesterday’s. Did the canvas length drop 80%? Did the deal count go from 47 to zero? Did the executive summary section disappear? You don’t need ML for this. You need a stored snapshot and a delta function. If today’s pipeline value is less than 10% of last week’s pipeline value, raise a flag — the skill might be right (a genuinely terrible week) or wrong (broken filter), and either way a human should look.

Golden-set evals are the smallest deliberate test data you can write. Three or four hand-built input scenarios with known correct outputs. You ship a skill change, you run it against the golden set, you check that the four answers still look right. This is the eval most operators skip because it feels like overhead. It is overhead. It’s also the cheapest insurance against a CLAUDE.md edit silently changing your pipeline math.

Adversarial evals assume the upstream world is hostile. Stage names change. APIs return 503. Connectors decide to require new scopes. Empty arrays appear. The adversarial eval feeds your skill the worst plausible inputs — empty result sets, malformed dates, surprise null fields — and confirms it fails loudly instead of producing graceful nonsense. Most silent failures live in the gap between “API returned nothing” and “model wrote graceful prose around the nothing.”

You don’t need all four on day one. Build the smoke eval first. It catches half of all failures and takes thirty minutes.

Running evals on cron#

Here’s the second job nobody talks about — the one that runs the eval, not the workflow.

The friday-wrapup skill fires at 5:00 PM ET on Fridays. The eval that watches it now fires at 4:30 PM ET on Fridays. Same skill, same connectors, same prompt — but the artifact gets piped through the eval function instead of into the leadership Slack canvas. If the eval returns True, nothing happens; the real run will fire thirty minutes later and ship for real. If the eval returns False, I get a Telegram ping with the failure reason and a link to the dry-run output. I look at it for sixty seconds, decide whether to suppress the 5 PM run or fix the upstream issue, and move on with my Friday.

This pattern works for every scheduled skill — see Chapter 7 for the cron syntax itself. The dry run + 30-minute lead is the eval’s actual job. The eval’s not there to be smarter than the skill. It’s there to give you a window to intervene before the real artifact lands in front of someone who matters.

The cost is negligible. Tokens for a dry run cost a few cents. The cost of a $0-pipeline canvas in front of your COO is harder to put on a spreadsheet but I assure you it’s more than a few cents.

The eval failure budget#

Evals fire false positives. If you treat every fired eval as a fire drill, you’ll mute the eval inside three weeks and be back to nine-day silent failures. The fix is a failure budget — how often the eval is allowed to be wrong before you change the eval, not the skill.

My rule: an eval that pages me more than once every two weeks gets refined. An eval that pages me less than once a quarter gets dropped or hardened — either it’s not catching anything real, or it’s so loose it’s not actually watching. The two evals I run for friday-wrapup have fired four times in the last six months. One real failure (stage rename), one near-real (HubSpot rate limit cascading into thin output), two false positives (genuinely quiet weeks where pipeline did drop hard). The 50% true-positive rate is on the low end of what I’d accept; if it drops below 25% I’ll tighten the threshold.

The eval is also a skill. It’s not divine. It can drift. It can be wrong. The thing you’re protecting against is silent failure, not all failure — accept the false positives as the cost of catching the silent ones.

The 30-minute starter eval#

Open a file. Call it eval_yourskill.py. Paste this. Edit the conditions for your skill. Wire it to a cron 30 minutes before the real one fires. Done.

import json
from pathlib import Path
from datetime import datetime
import requests

def run_skill_dry() -> dict:
    """Re-run your skill with the same prompt and inputs, capture the output."""
    # however your skill gets invoked — claude --print, an API call, a Cowork job
    output = your_skill_runner()
    return output

def smoke(output: dict) -> tuple[bool, str]:
    text = output.get("canvas", "")
    if len(text) < 200:
        return False, f"canvas too short: {len(text)} chars"
    required = ["Pipeline", "Deal Motion", "Executive Summary"]
    missing = [s for s in required if s not in text]
    if missing:
        return False, f"missing sections: {missing}"
    if "$0" in text and "pipeline" in text.lower():
        return False, "pipeline section reports $0 — likely upstream filter drift"
    return True, "ok"

def regression(output: dict, baseline_path: str) -> tuple[bool, str]:
    text = output.get("canvas", "")
    baseline = Path(baseline_path).read_text() if Path(baseline_path).exists() else ""
    if not baseline:
        Path(baseline_path).write_text(text)
        return True, "no baseline yet, stored"
    if len(text) < 0.4 * len(baseline):
        return False, f"canvas length dropped {1 - len(text)/len(baseline):.0%} vs baseline"
    return True, "ok"

def page_telegram(reason: str):
    requests.post(
        "https://api.telegram.org/bot$TOKEN/sendMessage",
        json={"chat_id": "YOUR_CHAT_ID", "text": f"[eval failed] {reason}"},
    )

if __name__ == "__main__":
    output = run_skill_dry()
    for name, check in [("smoke", smoke), ("regression", lambda o: regression(o, "/tmp/wrapup.txt"))]:
        ok, reason = check(output)
        if not ok:
            page_telegram(f"{name}: {reason}")
            exit(1)
    print(f"[{datetime.now().isoformat()}] all evals passed")

That’s the whole pattern. No framework. No vendor. Forty lines including imports. Run it on cron 30 minutes before the production skill fires. If it pages you, look. If it doesn’t, the artifact will land and you can keep eating dinner.

If your skill writes to a -driven workflow or uses an connector that talks to HubSpot, Stripe, or Slack — anything covered in Chapter 12 — the same eval shape applies. Smoke test the artifact, regression test against yesterday, page yourself when the world drifts.

The closer#

The skill is back. The eval is named friday-wrapup-eval and runs at 4:30 PM, thirty minutes before the real one fires. It checks for $0 pipeline, missing sections, and stage-name drift. It has fired twice. Both times, on a Friday afternoon, in time. The COO doesn’t read the eval. She reads the canvas. That’s how you know it’s working.