How to Reduce Your Anthropic Bill

# How to Reduce Your Anthropic Bill

Your Anthropic dashboard shows one number per API key. It shows total tokens in, tokens out, and a dollar total. It does not tell you which feature drove the cost, which model version is burning budget, or whether your cache markers are actually hitting.

This is the default state for most teams running Claude in production. The dashboard is precise about what you spent. It is silent about what you spent it on.

The techniques below are Anthropic-specific. They account for the pricing quirks that make Claude different from OpenAI or Gemini: the output-to-input price ratio (5x), the explicit cache markers, the two-tier write premium, and the way agent loops compound cost per turn. Each one can be applied independently. Applied together they generally cut an Anthropic bill by 50-80% without changing output quality.

Understanding the structure of an Anthropic bill

Before optimising, it helps to know where the money goes. Anthropic's pricing has three features that matter:

Output costs 5x input. Across every Claude model, output tokens are priced at roughly five times the input rate. This means output-token discipline (shorter responses, structured outputs, early stopping) has five times the leverage of input-token reduction. A 20% cut in output tokens saves the same money as a 100% cut in input tokens.

Cache reads are the biggest discount on input. Cache-hit tokens cost 10% of the base input price. Cache writes cost 1.25x (5-minute TTL) or 2x (1-hour TTL). The write premium only matters on the first hit; every subsequent read within the TTL window is at the 0.1x rate. This is functionally a 90% discount on the cached portion of your prompts once the cache is warm.

Input and output stack independently. A single request can have expensive input (long system prompt with no cache hit), expensive output (verbose model), or both. Most optimisation guides focus on input. For Anthropic specifically, output is where the compounding savings live.

1. Prompt caching: the first lever

Prompt caching is the highest-ROI change you can make to a Claude workload. It costs nothing to enable (a few cache markers on your content blocks), requires no model change, and reduces input cost on the cached portion by 90%.

The mechanics: you mark a content block with cache_control: {"type": "ephemeral"}. On the first request within a 5-minute window, Anthropic writes that block to cache at 1.25x the base input price. Every subsequent request that includes the same prefix reads from cache at 0.1x. The TTL resets on each read, so steady traffic keeps the cache warm indefinitely.

The industry number I see consistently is 60-90% input-cost reduction on workloads with stable prefixes longer than 1,024 tokens. My own systems land closer to 85% on agent loops with large tool definitions.

Key practices for Claude specifically:

Stable prefix first, dynamic content last. Put your system prompt, tool definitions, and static instructions at the top of the message array. Tag them all as cacheable. Everything that changes between requests (user messages, retrieved context, timestamps) goes after the last cache breakpoint. This is called the relocation trick in the prompt caching literature, and it is the single biggest cache-hit-rate lever.

Minimum 1,024 tokens. The cache only engages on blocks of at least 1,024 tokens (2,048 for Haiku). If your system prompt is 800 tokens, pad it with stable few-shot examples or merge with tool definitions to cross the threshold.

Up to 4 breakpoints per request. You can mark up to 4 content blocks as cacheable, ordered from least to most volatile. A common pattern: (1) system prompt with tools, (2) RAG document base, (3) conversation history up to the last turn, (4) empty (everything after is dynamic). Each breakpoint adds granularity but also adds write premium if it misses.

1-hour TTL for sparse traffic. The default 5-minute TTL is right for production traffic. For batch jobs or cron workflows with gaps between runs, the 1-hour extended TTL (2x write cost) pays back if you get at least 3 reads within the hour.

Track your hit rate. The usage block in every response shows cache_read_input_tokens and cache_creation_input_tokens. Your target is cache_read > 70% of total input tokens on steady-traffic routes. If you see it below 30%, your prompt structure is the problem.

What happens when you get this right: a 15-turn agent conversation with a 4,000-token system prompt and 8,000-token tool definitions drops from roughly $0.055 per turn to $0.006 per turn. That is an 89% reduction on input cost. On 10,000 such conversations a month, it is the difference between $8,250 and $900 in input tokens.

2. Model downgrading with evals

The second lever is model selection. Claude offers four tiers: Haiku (fastest, cheapest), Sonnet (balanced), Opus (strong), and Fable (frontier). The price spread from Haiku to Fable is roughly 40x on input and 40x on output.

The naive approach is to pick a model and use it for everything. The slightly less naive approach is to hardcode a cheaper model for known-simple routes. The production approach is to downgrade systematically: replay real traffic from your expensive model through a cheaper candidate, evaluate the output with an automated judge, and only apply the swap where quality holds.

This matters because the model that works for your launch demo is almost never the right model for every production route. Classification, extraction, and summarisation tasks can often run on Haiku. Customer-facing chat and code generation need Sonnet or Opus. The ratio varies by product, but in most systems 50-70% of calls can safely downgrade.

The eval step is what makes this safe. Without it, you are guessing. A cheap model that seems fine in a demo can quietly degrade a downstream metric for weeks. The correct pattern is to instrument every route with per-call quality tracking, build a held-out eval set per route type, and only ship a downgrade after the eval passes.

Cost was built for this pattern. It wraps your Anthropic clients, attributes every call to a route, replays real traffic through a candidate model, judges the output with Claude Haiku, and surfaces the swap only after it passes. The approach works for any model pair, not just within the Anthropic family. You can use it to test whether GPT-5-mini handles your summarisation route before committing.

3. Batch API for async work

Anthropic's Batch API offers a flat 50% discount on both input and output tokens for workloads that can tolerate up to 24-hour latency. Same model, same quality, half price.

The Batch API is the least applied technique I see in practice. Most teams treat every LLM call as real-time, but in reality a significant fraction of their traffic does not need synchronous responses: nightly evaluation pipelines, content generation queues, offline data enrichment, reflection passes in agent loops, bulk classification.

The economics are straightforward. If a batch-eligible route currently costs $1,000/month in standard API calls, moving it to Batch saves $500/month with no code change beyond switching from the Messages API to the Batch Messages endpoint. Where prompt caching is already in place, the Batch discount compounds: a cached batch request pays 0.5x (batch) x 0.1x (cache read) = 0.05x the standard input rate.

The caveat is that batch processing does not return streaming responses and results can take minutes to hours depending on queue depth. It is not suitable for user-facing interactions. For everything else, it is free money.

4. Agent-loop discipline

Agent workloads are where Anthropic bills spiral. Each turn adds to the conversation history, the tool definitions are resent every time, and the output tokens accumulate across the full run. The compounding is invisible until you see the per-conversation cost.

Three patterns specifically for Claude agents:

Compact conversation history. Instead of sending the full conversation history on every turn, rewrite the accumulated context into a shorter summary every N turns. Claude can do this itself if you prompt it. A 40K-token turn that compacts to 5K tokens saves 35K of input on every subsequent read, and the compacted summary is more cache-friendly because it is stable across turns.

Exclude tool outputs from the cached prefix. Tool results are dynamic and vary by invocation. If you include them in a cacheable block, every different result invalidates the cache and you pay the write premium repeatedly. Move tool outputs after the last cache breakpoint. Tag only the tool definitions (the schema, not the results) as cacheable.

Route at the session level, not the per-turn level. Prompt caching requires routing identical prefixes to the same physical inference machine. A router that sends some turns to Haiku and others to Sonnet writes the same system prompt to two cache namespaces, paying the write premium on both. The fix is to route at session granularity: pick one model for the whole conversation and stay on it. The cache amortises across the session's turns.

Putting it together

The order matters. Start with prompt caching (highest ROI, lowest effort). Add model downgrading with evals next (biggest absolute savings, but do the eval first). Move async workloads to Batch (free money when it applies). Apply agent discipline last (tailored to complex workloads).

Teams that go through this sequence in this order typically cut their Anthropic bill by 50-80% in the first month without touching product quality. The techniques compound: caching makes model downgrading cheaper (the cheap model's cached input costs even less), Batch compounds with caching, agent discipline protects both.

One question I still get asked: do you start with the expensive model and downgrade routes one by one, or start cheap and upgrade when quality fails? I have seen both work. I prefer starting on the model that matches your product's quality bar and downgrading only the routes the eval clears. It feels slower but it never surprises your users.

What has your experience been with Anthropic's cache markers? Have you found certain workloads where the 1-hour TTL writes are worth the premium?

Start saving today

Know exactly where your LLM money goes.

Cost wraps your Anthropic, OpenAI, and Gemini clients in one line. Free tier covers 100,000 events per month. No card needed.

Start tracking your spend