The problem nobody talks about
Your LLM bill arrived. It's 3x last month's. You open the Anthropic dashboard and see one number: total tokens. Nothing tells you whether the cost came from your customer-facing chat, your nightly batch job, or a single rogue prompt someone pushed on Friday.
This is the default state for most teams. And it's expensive - not just in money, but in the paralysis it creates. You can't optimize what you can't measure.
Here's how to fix it, starting with attribution and ending with systematic cost reduction.
1. Model tiering: the single biggest lever
Most teams reach for the most capable model out of habit. That's leaving serious money on the table.
Anthropic pricing (approximate, per million tokens):
- Claude Opus 4.8: ~$75 input / $75 output
- Claude Sonnet 4.6: ~$3 input / $15 output
- Claude Haiku 4.5: ~$0.80 input / $4 output
OpenAI:
- GPT-4o: ~$2.50 input / $10 output
- GPT-4o-mini: ~$0.15 input / $0.60 output (roughly 16x cheaper)
Google:
- Gemini 2.5 Pro: ~$1.25 input / $10 output
- Gemini 2.5 Flash: ~$0.075 input / $0.30 output
A classification task that costs $75/million tokens on Opus costs $0.80 on Haiku - a 94x difference. For a route firing 50,000 times a day, that's the difference between a $3,750 daily bill and a $40 one.
The practical rule: start every new feature on the cheapest model that works. Only escalate when you have evidence the cheaper model fails.
2. Per-route attribution: find the culprit
Before you can optimize, you need to know where the money goes. The pattern is simple: wrap every LLM call with a route tag.
import { Cost } from "@botzone/cost-sdk";
const anthropic = Cost.wrapAnthropic(new Anthropic(), {
apiKey: process.env.COST_API_KEY,
route: "chat/respond", // feature-level tag
projectId: process.env.COST_PROJECT_ID,
});
Once you have this in place, the dashboard breaks down spend by route. You'll almost always find that 80% of your bill comes from 20% of your routes - and half of those routes are things you forgot about.
Common surprises teams find after adding attribution:
- A debug endpoint using GPT-4o for every request in staging
- A "smart" search feature that fires on every keypress
- A nightly summarisation job that grew 10x as the dataset grew
3. Prompt caching: free tokens you're already paying for
Both Anthropic and OpenAI offer prompt caching. Repeated prompt prefixes are cached server-side and re-reads cost a fraction of the original.
Anthropic cache pricing: read hits cost 10% of normal input price. Write is 25% more expensive, but after 2 reads you're in profit.
The pattern: put your system prompt and any static context at the top of the message array with a cache breakpoint. Dynamic content (the actual user message) goes last.
const response = await anthropic.messages.create({
model: "claude-haiku-4-5",
system: [
{
type: "text",
text: LONG_SYSTEM_PROMPT, // 2,000 tokens
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: userMessage }],
});
For a chatbot with a 2,000-token system prompt handling 100,000 messages a day, caching saves roughly $64/day on Sonnet pricing. That's $23,000/year from a one-line change.
4. Prompt compression
Long prompts cost money on every call. A few techniques that reliably reduce token count without hurting output quality:
Remove redundancy. Most system prompts contain the same instruction three ways ("be concise", "keep it brief", "do not ramble"). Pick one.
Replace examples with rules. Five worked examples showing what "good output" looks like can often be replaced by two crisp rules. Fewer tokens, same result.
Trim context aggressively. For RAG pipelines, measure retrieval precision and cut chunks that rarely appear in grounded answers. A context window half the size costs half as much.
Use structured output. Asking the model to "respond in JSON" and defining the schema cuts the output token count significantly compared to natural-language responses that must be parsed later.
Typical prompt compression saves 20-40% of input tokens with no measurable quality drop. On a high-volume route, that's material.
5. Batching
If your workload is latency-tolerant (nightly jobs, background analysis, email digests), batching is worth investigating.
Anthropic's Batch API prices batch jobs at 50% of standard input/output prices. For a nightly job processing 10,000 documents on Sonnet, that halves the cost with zero code change to the prompt logic.
The trade-off: results arrive asynchronously (typically within an hour). For any user-facing feature, this is not viable. For background jobs, it often is.
6. Eval-gated model downgrades
The highest-confidence optimization is a systematic downgrade: you identify routes running on expensive models, replay real traffic through a cheaper model, evaluate the output quality with an automated judge, and only apply the swap if quality holds.
This is what Cost's verification feature does. When it surfaces a downgrade recommendation for a route, it has already:
- Replayed the last 50 real requests through the cheaper model
- Scored each response across five dimensions: factual equivalence, instruction compliance, format match, completeness, and tool-use parity
- Checked for critical failure types (hallucinations, refusals, truncations)
- Applied the swap only if 95% of replays pass
The result: you get the cost reduction of a model downgrade with quantified evidence it doesn't hurt users. No more guessing.
Putting it together
The sequence that works:
- Add attribution first. Tag every route. Run for a week. Let the data show you where the money goes.
- Attack the big routes. Find the top 3 by spend. Check if they're on the right model tier.
- Apply caching. If your system prompt is over 500 tokens and the route fires more than 1,000 times a day, add a cache breakpoint.
- Compress prompts. Audit your top-cost prompts for redundancy. Aim for 20% reduction.
- Batch what you can. Move latency-tolerant workloads to the Batch API.
- Use eval-gated downgrades. Let automated evaluation carry the risk of model switches.
Teams that go through this sequence typically cut their LLM bill by 60-80% in the first month without touching product quality.
Start measuring today
You can't optimize a number you can't see. Cost wraps your existing Anthropic, OpenAI, and Gemini clients in one line, attributes every euro to a route, and flags which prompts to fix.
Free tier covers 100,000 events per month. No card needed. Sign up at cost.botzone.ai and have per-route attribution running before your next deployment.
Start saving today
Know exactly where your LLM money goes.
Cost wraps your Anthropic, OpenAI, and Gemini clients in one line. Free tier covers 100,000 events per month. No card needed.
Start tracking your spend