Technical note
LLM cost optimization
Most cost blowups are not caused by one bad model choice. They come from the interaction between model pricing, prompt growth, retries, context inflation, and weak operational controls.
The cost problem
LLM costs have a nasty habit of growing exponentially. A prototype that looks harmless at $200 per day can become a $2,000 per day production problem once usage grows, chats lengthen, and prompts absorb every edge case the team has ever seen.
The mechanics are simple: per-token pricing multiplied by usage, context window inflation, and call amplification from retries. That combination is why teams routinely underestimate production spend by an order of magnitude.
- Context window inflation means every follow-up turn is more expensive than the one before it.
- Timeout retries, parsing retries, and validation retries can turn one logical request into 2-5 model calls.
- Over-prompting is common: system prompts drift into 3,000+ tokens as teams patch behavior reactively.
- Many systems still use GPT-4o for work that GPT-4o mini handles at a tiny fraction of the cost.
A concrete cost-drift example
Here is a representative support-assistant trajectory. In development, the team saw short conversations and low usage. Production introduced long threads, retries, and prompt bloat. By week 7, the economics were completely different.
| Week | Spend | Users | Primary driver |
|---|---|---|---|
| Week 1 | $200/day | 50 users | Short queries, short chats |
| Week 3 | $800/day | 200 users | Conversation history starts to dominate |
| Week 5 | $1,500/day | 400 users | Retry loops and validation failures multiply calls |
| Week 7 | $2,400/day | 500 users | Prompt bloat + premium-model overuse |
After implementing routing, caching, and prompt compression, the same workload dropped from $2,400 per day to roughly $320 per day at 500 users. That is the right mental model for optimization: not one trick, but a stack of compounding improvements.
Model pricing is the first-order constraint
If you do not know your model price ratios, you cannot reason clearly about optimization. The most important number in the table below is not any single price; it is the spread between tiers.
| Model | Provider | Input | Output | Context | Notes |
|---|---|---|---|---|---|
| GPT-4o | OpenAI | $2.50 | $10.00 | 128K | Strong general-purpose default |
| GPT-4o mini | OpenAI | $0.15 | $0.60 | 128K | 17x cheaper input than 4o |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | 200K | Strong reasoning, large context |
| Claude Haiku 4.5 | Anthropic | $0.80 | $4.00 | 200K | Fast, lower-cost classification and extraction |
| Mistral Large 3 | Mistral | $2.00 | $6.00 | 128K | Reasonable API alternative |
| Llama 4 Maverick | Self-hosted | ~$0.30* | ~$0.30* | 1M | GPU cost only, utilization-sensitive |
The key gap is GPT-4o at $2.50 per million input tokens versus GPT-4o mini at $0.15. That 17x delta is why routing matters. For classification, extraction, and straightforward Q&A, the quality difference is often small while the price difference is massive.
Model routing
Model routing is usually the highest-impact optimization. The core idea is simple: route easy work to cheap models, reserve expensive models for the hard tail, and escalate only on failure or low confidence.
| Tier | Classifier score | Model | Input price | Typical work |
|---|---|---|---|---|
| Simple | score < 0.3 | GPT-4o mini | $0.15 / 1M input | FAQ, extraction, lightweight summarization |
| Medium | 0.3 - 0.7 | Claude Haiku 4.5 | $0.80 / 1M input | More nuanced summarization, moderate ambiguity |
| Complex | score > 0.7 | GPT-4o | $2.50 / 1M input | Hard reasoning, long-tail edge cases |
A useful implementation pattern is a cascade router. Start with the cheapest viable model, validate the output, and escalate only if the answer fails a quality gate. In many production systems, 70-80% of traffic is simple enough that a cheap model handles it.
- Customer support example: route 72% of traffic to GPT-4o mini, 20% to Claude Haiku 4.5, and 8% to GPT-4o.
- Result: monthly spend drops from $38,000 to $6,200, an 84% reduction, without measurable eval degradation.
- Common implementation choices: embedding-based classifier, keyword heuristic, or a small verifier model.
Semantic caching
If one user asks "What is your return policy?" and another asks "How do I return an item?", the system should not pay twice to discover the same answer. Semantic caching is one of the few optimizations that can improve both latency and cost.
| Approach | Hit rate | Effort | Savings | Best for |
|---|---|---|---|---|
| Exact match cache | 10-20% | Low | Low | Repeated identical inputs |
| Semantic cache | 30-50% | Medium | High | Support or FAQ flows |
| Prompt-aware cache | 40-60% | High | Very high | Stable system prompt plus repeated intents |
| Prefix caching | Automatic | None | Medium | Providers with built-in prompt prefix caching |
A typical implementation is Redis plus embeddings. Embed the incoming query, run cosine similarity search, and return a cached response if the match is above a high threshold such as 0.95. Use separate caches per system prompt to avoid contamination.
- Normalize text before embedding to improve hit rate.
- Cache at the intent level, not raw string level.
- Tune the similarity threshold with real false-positive data, not guesses.
Prompt optimization
Prompt optimization is the lowest-effort, highest-return place to start. Many production prompts carry 30-50% dead weight: verbose instructions, stale examples, repeated policies, and output requirements that the system could enforce structurally instead.
- System prompt compression: 20-40% input token reduction by removing redundancy and consolidating rules.
- Few-shot to zero-shot migration: 50-80% input token reduction when examples are replaced with tighter instructions or fine tuning.
- Structured outputs: 30-50% output token reduction by using JSON or tool calls instead of verbose prose.
- Context pruning: 40-70% input token reduction by summarizing old turns and only passing relevant history.
- Response length control: 20-60% output token reduction with tighter max-token limits and explicit brevity constraints.
A representative system prompt can shrink from 1,847 tokens to 612 tokens with no quality loss. At 50,000 requests per day on GPT-4o, that alone can save roughly $190 per day, or about $5,700 per month, on system prompt tokens.
Batch processing
If the workload is not interactive, batch APIs are an immediate economic win. OpenAI, Anthropic, and other providers commonly offer around 50% savings for asynchronous processing.
- Good batch candidates: content generation, backfills, summarization pipelines, evaluation suites, and embedding jobs.
- Bad batch candidates: chat, moderation, streaming UX, and anything with a hard sub-second SLA.
- Mixed workload pattern: use a queue to split real-time traffic from batch-eligible traffic.
The practical architecture is straightforward: front a queue such as Redis or SQS, mark latency-sensitive work as synchronous, and push everything else to batch endpoints.
Fine-tuning economics
Fine-tuning is attractive when you have a narrow task, enough examples, and real call volume. The economic reason to fine-tune is not novelty; it is replacing a large model plus a fat prompt with a smaller model whose behavior is already baked in.
| Approach | Cost / 1K calls | Quality | Latency | Setup cost |
|---|---|---|---|---|
| GPT-4o + detailed prompt | $25.00 | 95% | High | $0 |
| GPT-4o mini + few-shot | $1.50 | 88% | Low | $0 |
| GPT-4o mini fine-tuned | $0.90 | 93% | Low | $50-200 |
| Llama 4 Scout fine-tuned | $0.10 | 90% | Very low | $500-2000 |
- Fine-tune when you have a narrow task, 500+ good examples, and enough traffic that inference savings matter.
- Do not fine-tune when requirements shift weekly or when broad world knowledge is still the main bottleneck.
- In the example above, a fine-tuned GPT-4o mini can approach GPT-4o quality on a narrow task at a fraction of the inference cost.
Self-hosting open models
Self-hosting can reduce unit economics dramatically, but only if the workload is large enough and the team is willing to own GPU infrastructure, serving, monitoring, and capacity planning.
| Option | 100K req/mo | 1M req/mo | 10M req/mo | Notes |
|---|---|---|---|---|
| OpenAI API (GPT-4o) | $2,500 | $25,000 | $250,000 | No ops, highest marginal cost |
| GPU rental (A100 80GB) | $2,000 | $2,000 | $6,000 | Fixed cost, real ops burden |
| Owned hardware (H100) | $4,500* | $4,500* | $4,500* | Lowest long-run cost, high capex |
The break-even is rarely at low volume. Below roughly $5,000 per month in API spend, the operational burden usually dominates the savings. A more realistic midpoint is serverless inference on open models before committing to raw GPU operations.
What to do first
The right sequence matters more than theoretical completeness. You do not start with self-hosting. You start by stopping the obvious leaks, instrumenting the system, and making routing decisions based on data instead of intuition.
| Optimization | Effort | Impact | Savings | When to do it |
|---|---|---|---|---|
| Prompt compression | Low | Medium | 20-40% | Always do first |
| Model routing | Medium | Very high | 60-80% | As soon as spend is material |
| Semantic caching | Medium | High | 30-60% | When query patterns repeat |
| Batch processing | Low | Medium | 50% on eligible traffic | When latency is not critical |
| Fine-tuning | High | High | 70-90% | High-volume narrow tasks |
| Self-hosting | Very high | Very high | 80-95% | When spend or data constraints justify ops |
Bottom line
Cost optimization compounds
Start with a baseline of $10,000 per month. Prompt cleanup can plausibly bring that to $7,000. Routing can cut the remainder to roughly $2,100. Caching can take it to about $1,260. Batch APIs can push the total near $1,008. The point is not that every team lands on those exact numbers. The point is that improvements stack.
* Self-hosted costs are approximate and depend heavily on GPU utilization, throughput, and operational overhead.