Why Is My Bill So High?

Token Math, Caching, Batch, Routing

prompt cachingbatch APImodel routingtoken economicscache read ratiowrite amortization

It’s 8:11 AM Wednesday and I’m staring at a $4,312 Anthropic bill for the prior week. Same workload as the week before, when the bill was $1,108. Nothing in my product code changed. Nothing in the cron schedule changed. The morning briefing still ran at 6:30, the friday-wrapup still fired at 5 PM, the deal-advancement alert still woke up at 4:02 AM Eastern. The numbers were just different by almost 4x.

I spent forty minutes thinking my workers had gone feral and were re-running themselves in some loop I couldn’t see. They hadn’t. The cron logs were clean. The token counts on each individual run looked roughly normal. The only thing that looked off was the ratio of cached input tokens to fresh input tokens — which had cratered.

The thing that changed was a 38-line edit to my the prior Saturday. I’d added some new portfolio context and rearranged a section. on its face. What I’d actually done was move a paragraph that lived inside the cached prefix to a different position, which meant every morning briefing was now sending a prefix the cache server had never seen, paying full input price on roughly 60,000 tokens of system prompt that used to cost a tenth of that.

Prompt caching isn’t a feature you turn on. It’s a contract about what changes between calls, and one paragraph of edits voided the contract.

The fix took 12 minutes. I moved the new section to the bottom of CLAUDE.md, behind the cache breakpoint. The bill went back to $1,108 the following week.

The four costs nobody draws on the same chart#

When you look at the Anthropic pricing page you see a per-million-token rate for input and output. That’s two numbers. The actual bill has four.

The 10x gap between cache read and full input is the entire game. Every operator-grade Claude workload I run leans on it. A morning briefing that pulls in 40K tokens of portfolio context, MCP tool schemas, and prior week’s running summary should pay full price for 40K tokens exactly once a week. The other six mornings, that prefix should be a cache hit costing roughly a tenth of that.

When the cache works you don’t think about it. When it breaks you don’t see it on the per-call view, you see it on the weekly invoice, which is where I learned this lesson the expensive way.

Stable prefixes, and what voids the contract#

Here’s the operator’s model of how prompt caching works. The Anthropic SDK lets you mark a point in your prompt with a cache_control breakpoint:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,  # 40K tokens of context
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)

That cache_control block draws a line. Everything before it is the cached prefix. The first call writes the prefix to a server-side cache keyed by the exact byte content. The next call within ~5 minutes that sends the same bytes pays the cheap cache-read rate for that section, and full price only for what comes after.

What voids the contract:

The mental fix is: treat your system prompt like an append-only log. New stuff goes at the end. The old stuff stays exactly as it was, even if you’d rather rewrite it. Your invoice will thank you.

Reading the caching scorecard#

The two sections above tell you how caching works and what voids it. This one tells you how to know, without waiting for the invoice, whether yours is actually working. The Anthropic console has a Caching tab. Most operators have never opened it. Open it weekly. Two numbers decide whether you’re winning.

Cache read ratio. Of every input token you sent, what fraction was a cheap cache read versus a full-price fresh or write token. This is the single number that would have caught my $4,312 week on the Tuesday instead of the following Monday. Mine over the last seven days: 98.1%. The target isn’t a figure I invented — anything north of ~90% means your prefixes are stable and the contract from the last section is holding. Below ~80% means you’re voiding it somewhere and don’t know it yet. Watch the chart, not just the number: there’s one sharp dip in mine around May 11. That dip is a cache-voiding event with a timestamp on it. You can trace it to the commit. That is the early-warning system the cold open of this chapter didn’t have.

The Caching tab — 7-day scorecard
The Caching tab — 7-day scorecard Real console: 98.1% cache read ratio, 11.7× write amortization, 306M cache-read tokens, per-model breakdown. The single read-ratio dip on May 11 is a traceable cache-voiding event.

Write amortization. How many times you read a cached prefix for each time you paid to write it. This is the number nobody talks about, and it’s the one that actually moves the bill. A cache write costs 1.25× the base input rate — a 25% premium. A cache read costs 0.1×. So if you write a prefix and read it back exactly once, your “discount” is a 35% surcharge wearing a discount’s clothes — you paid the premium and barely used it. The math only works on repetition. Mine, blended: 11.7×. Per model it splits the way the work splits:

The blended math, concretely: at 11.7× the cached portion of a prefix costs roughly (1.25 + 11.7 × 0.1) / 12.7 ≈ 0.19× the base input rate — an ~81% discount on the part of every prompt that never changes. At Haiku’s 25× it is ~86% off. At a 1× “amortization” it is a ~35% markup. Same feature, opposite sign, and the only variable is whether you reuse the prefix.

Input token composition + write amortization climbing
Input token composition + write amortization climbing Green (cache read) dominates the daily composition; amortization curves climb through the week as the same prefixes get reused — the cache paying itself back in real time.

Here is what that scorecard produces on the actual invoice, and this is the proof, not the theory. April direct-API token cost, every model, the whole month: $2,216. The daily bars run ~$220 early in the month and fall to ~$50 by the end — the same optimization arc as the token graphs above, now denominated in dollars. Then May: $556 month-to-date through the 14th — call it a ~$1,200 run-rate against April’s $2,216, the bill itself cut roughly in half with the same class of workload. Web search, code execution, session runtime: $0 in both months, because none of that work needed them. That is the caching receipt in dollars: not “tokens went down,” but “the invoice went down while the work didn’t.” Without the cache, that same workload is a four-figure-per-week number — I know, because I lived one week of it in the cold open.

April direct-API token cost — $2,216, the whole month
April direct-API token cost — $2,216, the whole month Cost view, grouped by model: $2,216 total, daily bars decaying ~$220 → ~$50 as optimization lands. This is the bill a 98% read ratio produces.
May direct-API token cost — $556 month-to-date
May direct-API token cost — $556 month-to-date Same view, May 1–14: $556.02 so far, ~$1,200 run-rate vs April's $2,216. The dollar proof the caching is working — same workload, half the bill.

The discipline is three lines. Open the Caching tab every Monday. If the read ratio cliffs, you voided a prefix — trace it to the commit the way I traced mine in the cold open. If a model’s write amortization sits under ~3× for a week, that cache breakpoint is earning its keep nowhere — consolidate the workload onto it or take the breakpoint out, because you are paying the write premium for a discount you are not collecting.

Batch API — half off if you can wait#

Anthropic’s API runs the same model at 50% off, with the trade-off that the batch returns within 24 hours instead of within seconds. For interactive workflows this is useless. For asynchronous backfills, evals, content generation, and overnight summarization, it’s free money.

Shape of a batch call:

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"summary-{deal_id}",
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": deal_transcript}
                ]
            }
        }
        for deal_id, deal_transcript in deals.items()
    ]
)

You poll the batch ID. When it returns, you pull results by custom_id. I run our weekly newsletter draft generation and the bulk Folderly deliverability eval on this. Together they were a $600/week line item at on-demand rates. They’re $300/week now, and the only thing that changed is that I submit the batch Friday at 5 PM and pull results Saturday morning.

If you can ask “does this need to be done in the next sixty seconds?” and the answer is no — it’s a batch job. Stop paying interactive prices for non-interactive work.

Haiku for triage, Sonnet for default, Opus when you actually need it#

Model routing is the second-biggest lever after caching, and the one most operators get wrong by defaulting to the smartest model for everything.

ModelStrengthWhen I use it$/quality
HaikuFast, cheap, decent at well-defined tasksClassifying inbound email intent. Tagging meeting transcripts. Yes/no triage. Routing decisions.Highest
SonnetDefault working model. Strong reasoning, tool use, long context.Morning briefings. Deal summarization. Code reviews on PRs. Skill execution.Best balance
OpusHardest reasoning, longest plans, agentic loops with many stepsArchitecture decisions. Multi-step workflows. Anything where a wrong answer costs more than the model run.Lowest, justifiable when stakes scale

The honest curve nobody plots: Haiku is roughly 1/15th the price of Sonnet and gets you 80% of the quality on simple, well-bounded tasks. Sonnet is roughly 1/5th the price of Opus and gets you 90% of Opus’s quality on the working middle of your workload. Using Opus for everything is the AI equivalent of flying first class to the corner store.

The router config I use looks like this:

def pick_model(task_type: str) -> str:
    if task_type in ("triage", "classify", "tag", "extract"):
        return "claude-haiku-4-5"
    if task_type in ("agent_loop", "architecture", "complex_plan"):
        return "claude-opus-4-5"
    return "claude-sonnet-4-5"  # default

Three lines. It cut my bill by roughly 30% the week I shipped it. Not from any one save — from the long tail of triage tasks that no longer ran on the wrong model.

Anthropic console — March 2026, the before picture
Anthropic console — March 2026, the before picture Real console: 2.25B tokens in for the month, grouped by model. The single 450M-token day mid-March is exactly the kind of spike the rest of this chapter is about killing.

The price of a model is not the price of a task#

Every time a new model lands, the first number everyone quotes is the sticker — dollars per million tokens. It is the wrong number to decide on, and the launch decks are built to make you decide on it.

Concrete, current case: on 2026-05-19 Google announced Gemini 3.5 Flash. The token price tripled — $0.5/$3 → $1.5/$9 per million (3.1 Pro, for reference, is $2/$12 under 200k context). Sticker-shock reaction: “the cheap tier got expensive, skip it.” But it’s a Flash that, on the boards Google showed, clears last-gen Gemini 3.1 Pro on agentic and coding work. A model that one-shots what the previous one needed three turns and a retry to get right is cheaper per task at 3× the per-token price — fewer turns, fewer tool round-trips, no failure-and-retry tax, less of your time reading a wrong answer. The sticker went up; the cost of the task may have gone down. The two numbers are not the same number and nothing on the pricing page tells you which way it broke for your workload.

The only honest way to know is to run it:

# cost-per-task, not cost-per-token. Run YOUR top-5 real tasks
# through old vs new, measure the thing that's actually on the invoice.
for task in top_5_real_workloads:
    old = run(task, model="prev")   # tokens in/out + turns + retries + wall-clock
    new = run(task, model="candidate")
    log(task, old.total_billed, new.total_billed, old.turns, new.turns)
# decide on the total-billed-per-completed-task column. Never on $/Mtok.

Three rules that fall out of this:

The token budget per skill#

Once your portfolio of crosses about a dozen, you stop being able to eyeball the bill. I attach a tiny logger to each skill that records (skill_name, input_tokens, output_tokens, cached_tokens, model, ts) to a Postgres table. Once a week I look at the top ten by cost. Some weeks the answer is “morning-briefing is doing what it should.” Some weeks the answer is “competitive-intel-scan is calling Sonnet 60 times in a row when it should be one Sonnet call orchestrating Haiku tool calls.”

You don’t tune what you don’t measure. The Anthropic dashboard tells you the total. It does not tell you which skill burned it. That’s on you.

On a subscription instead of the API, the dashboard speaks a different language — not dollars, but percent of a weekly cap. Same discipline, different units. On a Max (20x) plan the number that bites isn’t a bill, it’s the “all models” weekly bar hitting 88% on a Thursday with the heaviest day still ahead. Watch the weekly bar the way you’d watch the API graph: it’s the same question — what’s burning the budget — wearing a different outfit.

Claude Max (20x) plan usage limits
Claude Max (20x) plan usage limits Settings → Usage on a Max plan: current-session vs weekly limits, all-models at 88% mid-week, daily routine-run allowance. The subscription-side view of the same cost question.

Two subscriptions and a shrinking API line#

Here’s the actual cost structure, not a hypothetical. I run two Max (20x) subscriptions — about $400 a month, total — and for where I am right now that is more than enough. The discipline isn’t “spend less,” it’s “max out the thing with the predictable ceiling before you touch the thing with the open meter.” A Max seat has a wall you hit and a price you already know. The API has no wall and a meter that runs as fast as your worst-written agent loop. Push everything you can onto the subscription. The cap is a feature, not a limitation.

The direct API only comes out for the work the subscription can’t do: API integrations, and the autonomous always-on projects — the Rick-style agents that run unattended against the raw API because they need the SDK, not a chat surface (Chapter 30, Chapter 32). That’s a real cost, separate from the $400, and it’s the line I actually optimize.

And it has been shrinking. The March graph a few sections up ran ~2.25B tokens in, with one ugly 450M-token day in the middle. April held roughly flat at ~2.2B in — but the spikes started flattening as model routing landed. By mid-May the shape had changed entirely: ~590M tokens in over the first half of the month, no spikes, a flat daily baseline running around a tenth of March’s worst day. Same class of workload. The difference is everything earlier in this chapter, applied — Haiku for triage, stable prefixes for the cache, batch for anything that can wait. The line went down because the discipline went up, not because the work got smaller.

April 2026 — the spikes start flattening
April 2026 — the spikes start flattening ~2.2B tokens in, similar volume to March, but the daily peaks are coming down as routing lands. The transition month.
May 2026 — flat, no spikes
May 2026 — flat, no spikes ~590M tokens in over the first half of the month. The 450M single-day spike is gone; the baseline is roughly a tenth of March's worst day. The optimization, landed.

The lesson isn’t the numbers — yours will be different. It’s the structure: a predictable subscription ceiling for the bulk, a ruthlessly optimized API line for the rest, and a monthly look at the graph so a 450M-token day never happens twice without you seeing it coming.

The annual math#

I burn between 3 and 10 billion tokens a month across my stack. At Sonnet pricing with healthy caching, that’s somewhere in the $4,000 to $14,000 a month range depending on the week. Annualized: call it $80K to $150K a year for the entire portfolio.

That number freaks people out until you compare it to what it replaced. Before this stack, the same volume of analytical work — the morning briefings, the Friday wraps, the deal alerts, the mentee prep, the newsletter drafts, the deliverability evals, the competitive scans — was being done by humans I either employed or didn’t have. The lower-bound replacement value is a single mid-level analyst. The honest replacement value is closer to a small team. The bill is a rounding error on what it deletes.

The fact that you’re being charged at all is a feature, not a bug. It means the unit economics of the work are visible, which means they’re tunable. A line item is a thing you can negotiate. A salary is a thing you mostly can’t.

The bill went back to $1,108. I didn’t optimize harder. I just stopped breaking the thing that was already optimized. Most cost wins look like that. You’re not pulling levers. You’re putting the levers back where they were before you fiddled.

Spotted something wrong, missing, or sharper? Email Vlad with feedback on this chapter →
Stay close

The next edition lands when this list says it does.

No course. No paywall. Operator playbooks weekly. 10K+ subscribers.