The cost equation
For any LLM workflow, the bill is approximately:
cost = (input_tokens × input_price + output_tokens × output_price) × requests
Output tokens cost 3–5× more than input tokens across most providers. Reasoning/thinking tokens (where supported) cost like output. Caching reduces effective input cost for repeated prefixes by 50–90%.
Your levers, ordered by typical impact:
Lever 1: Pick the right model for the task (40–80% savings)
The single biggest mistake teams make: using a frontier model for everything because "it's smarter." For most production tasks — classification, extraction, formatting, routing — a smaller model is 10–30× cheaper and indistinguishable in quality once you tune the prompt. Reserve the frontier model for the small percentage of requests that actually need its capabilities.
Practical pattern: route requests by complexity. Use a tiny model (or rule) to classify; send hard cases to the big model. Most product workloads I've audited end up with 70–90% of traffic going through a small model and the rest through a frontier one, for a 50–70% cost reduction at equal quality.
Lever 2: Prompt caching (40–90% savings on repeated context)
If your prompts share a large static prefix — system prompt, instructions, few-shot examples, retrieved documents that repeat across users — caching that prefix turns most of your input cost from full price to discount (typically 10% of full input price after the cache hits).
Two non-obvious tips:
- Cache hit rate is sensitive to prefix structure. Put stable content first, variable content last.
- Most providers cache for ~5 minutes by default. For long-tail workloads, increase TTL where possible.
Caching alone often pays back the engineering effort within a week for any product with traffic above a few thousand requests per day.
Lever 3: Trim output (20–50% savings)
Output tokens are the expensive ones. If your prompt produces 800-token answers when 200-token answers would be fine, you're spending 4× what you need to. Audit a sample of real responses and check:
- Are answers verbose where users want crisp?
- Are there preambles ("Sure! Here's the…") you don't need? A short instruction in the prompt usually eliminates them.
- Are you asking for JSON but the model is including markdown wrappers? That's wasted output tokens.
- If you use structured output (function calling, JSON mode), the model produces less filler. Use it where you can.
Lever 4: Cut redundant input (10–30% savings)
Most production prompts accumulate over time — instructions get added, examples pile up, documents get longer. Periodically audit the static portion of your prompts. Typical findings:
- Few-shot examples that no longer match current behavior and can be removed.
- Instructions duplicated across system and user messages.
- Retrieved chunks that overlap with each other (rerank + dedupe before sending).
Lever 5: Batch where latency allows (50% savings on async)
Major providers offer batch processing with 50% discount for async workloads — anything you can run within a 24-hour window. Backfills, evaluations, content generation pipelines, embeddings: all candidates. Most teams underuse this.
What usually doesn't help much
- Aggressive temperature tuning. Doesn't affect cost.
- Switching providers for "cheaper" rates of the same model class. Usually a 10–30% difference, dominated by the levers above.
- Fine-tuning to save tokens. Sometimes valuable, but the engineering cost is high relative to picking the right base model and caching.
- Self-hosting open-weight models. Often more expensive than API at small scale once you account for GPU costs, devops time, and underutilization. Becomes cheaper at large scale, but the crossover point is further than people expect.
The audit playbook
- Export a week of usage by endpoint and model. Identify the top 3 cost drivers.
- For each, sample 50 requests. Look at input and output sizes. Decide which lever applies.
- Pick the highest-impact lever and implement it for that one endpoint.
- Measure for a week. If savings are real, move to the next endpoint.
Three iterations like this typically reduce LLM bills by 50–75% without quality loss.