LLM Cost Optimization: The Levers That Actually Move the Needle

The cost equation

For any LLM workflow, the bill is approximately:

cost = (input_tokens × input_price + output_tokens × output_price) × requests

Output tokens cost 3–5× more than input tokens across most providers. Reasoning/thinking tokens (where supported) cost like output. Caching reduces effective input cost for repeated prefixes by 50–90%.

Your levers, ordered by typical impact:

Lever 1: Pick the right model for the task (40–80% savings)

The single biggest mistake teams make: using a frontier model for everything because "it's smarter." For most production tasks — classification, extraction, formatting, routing — a smaller model is 10–30× cheaper and indistinguishable in quality once you tune the prompt. Reserve the frontier model for the small percentage of requests that actually need its capabilities.

Practical pattern: route requests by complexity. Use a tiny model (or rule) to classify; send hard cases to the big model. Most product workloads I've audited end up with 70–90% of traffic going through a small model and the rest through a frontier one, for a 50–70% cost reduction at equal quality.

Lever 2: Prompt caching (40–90% savings on repeated context)

If your prompts share a large static prefix — system prompt, instructions, few-shot examples, retrieved documents that repeat across users — caching that prefix turns most of your input cost from full price to discount (typically 10% of full input price after the cache hits).

Two non-obvious tips:

Cache hit rate is sensitive to prefix structure. Put stable content first, variable content last.
Most providers cache for ~5 minutes by default. For long-tail workloads, increase TTL where possible.

Caching alone often pays back the engineering effort within a week for any product with traffic above a few thousand requests per day.

Lever 3: Trim output (20–50% savings)

Output tokens are the expensive ones. If your prompt produces 800-token answers when 200-token answers would be fine, you're spending 4× what you need to. Audit a sample of real responses and check:

Are answers verbose where users want crisp?
Are there preambles ("Sure! Here's the…") you don't need? A short instruction in the prompt usually eliminates them.
Are you asking for JSON but the model is including markdown wrappers? That's wasted output tokens.
If you use structured output (function calling, JSON mode), the model produces less filler. Use it where you can.

Lever 4: Cut redundant input (10–30% savings)

Most production prompts accumulate over time — instructions get added, examples pile up, documents get longer. Periodically audit the static portion of your prompts. Typical findings:

Few-shot examples that no longer match current behavior and can be removed.
Instructions duplicated across system and user messages.
Retrieved chunks that overlap with each other (rerank + dedupe before sending).

Lever 5: Batch where latency allows (50% savings on async)

Major providers offer batch processing with 50% discount for async workloads — anything you can run within a 24-hour window. Backfills, evaluations, content generation pipelines, embeddings: all candidates. Most teams underuse this.

What usually doesn't help much

Aggressive temperature tuning. Doesn't affect cost.
Switching providers for "cheaper" rates of the same model class. Usually a 10–30% difference, dominated by the levers above.
Fine-tuning to save tokens. Sometimes valuable, but the engineering cost is high relative to picking the right base model and caching.
Self-hosting open-weight models. Often more expensive than API at small scale once you account for GPU costs, devops time, and underutilization. Becomes cheaper at large scale, but the crossover point is further than people expect.

The audit playbook

Export a week of usage by endpoint and model. Identify the top 3 cost drivers.
For each, sample 50 requests. Look at input and output sizes. Decide which lever applies.
Pick the highest-impact lever and implement it for that one endpoint.
Measure for a week. If savings are real, move to the next endpoint.

Three iterations like this typically reduce LLM bills by 50–75% without quality loss.

LLM maliyetlerinden şikayet eden mühendislik ekiplerinin çoğu, yanlış kollarla uğraşırken %60–%80'lik tasarrufu masada bırakıyor. Maliyet yapısının birkaç hâkim sürücüsü var; onları atlayıp gerisini optimize etmek en yaygın desen. Etkilerine göre sıralı gerçek kollar:

Maliyet denklemi

Herhangi bir LLM iş akışı için fatura yaklaşık olarak:

maliyet = (input_token × input_fiyat + output_token × output_fiyat) × istek_sayısı

Çoğu sağlayıcıda output token'lar input'tan 3–5× pahalıdır. Reasoning/thinking token'ları (destekleyen modellerde) output gibi fiyatlanır. Cache'leme tekrarlanan prefix'lerin etkin input maliyetini %50–%90 düşürür.

Tipik etkilerine göre sıralı kollar:

Kol 1: Görev için doğru modeli seç (%40–%80 tasarruf)

Ekiplerin yaptığı en büyük tek hata: "daha akıllı" diye her şey için frontier model kullanmak. Üretimdeki çoğu görev için — sınıflandırma, çıkarım, biçimlendirme, yönlendirme — küçük model 10–30× daha ucuz ve prompt'u ayarlandığında kaliteden ayırt edilemez. Frontier'ı gerçekten yeteneklerini gerektiren küçük yüzde için saklayın.

Pratik desen: karmaşıklığa göre yönlendirin. Sınıflandırmak için minik model (veya kural); zor vakaları büyük modele gönderin. Denetlediğim ürün iş yüklerinin çoğunda trafiğin %70–%90'ı küçük modele, gerisi frontier'a gider — eşit kalitede %50–%70 maliyet düşüşü.

Kol 2: Prompt cache (tekrarlanan context'te %40–%90 tasarruf)

Prompt'larınız büyük statik bir prefix paylaşıyorsa — sistem prompt'u, talimatlar, few-shot örnekler, kullanıcılar arası tekrarlanan dokümanlar — bu prefix'i cache'lemek input maliyetinin büyük kısmını tam fiyattan indirimliye (cache hit sonrası tam input fiyatın ~%10'u) çeker.

İki bariz olmayan ipucu:

Cache hit oranı prefix yapısına duyarlıdır. Stabil içerik önce, değişken içerik sona.
Çoğu sağlayıcı varsayılan ~5 dakika cache'ler. Uzun kuyruk iş yükleri için TTL'yi mümkünse artırın.

Tek başına cache'leme, günde birkaç bin istek üstü trafiği olan her ürün için mühendislik eforunu bir hafta içinde geri çıkarır.

Kol 3: Output'u kırp (%20–%50 tasarruf)

Pahalı olan output token'lardır. 200 token'lık cevap yeterken prompt'unuz 800 token'lık cevap üretiyorsa, gereksiz 4× ödüyorsunuz. Gerçek cevaplardan örnek alıp denetleyin:

Kullanıcı kısa istiyorken cevaplar uzun mu?
İstemediğiniz giriş ("Tabii, işte…") var mı? Prompt'a kısa bir talimatla genelde temizlenir.
JSON istiyorsunuz ama model markdown sarmalı ekliyor mu? Bu boşa giden output token.
Structured output (function calling, JSON mode) kullanırsanız model daha az dolgu üretir. Mümkünse kullanın.

Kol 4: Gereksiz input'u kes (%10–%30 tasarruf)

Üretim prompt'ları zamanla birikir — talimatlar eklenir, örnekler yığılır, dokümanlar uzar. Periyodik olarak prompt'unuzun statik kısmını denetleyin. Tipik bulgular:

Artık güncel davranışla eşleşmeyen few-shot örnekler — silinebilir.
Sistem ve user mesajları arasında yinelenen talimatlar.
Birbiriyle örtüşen retrieved chunk'lar (göndermeden önce rerank + dedupe).

Kol 5: Gecikmeye izin veriliyorsa batch (async'de %50 tasarruf)

Büyük sağlayıcılar async iş yükleri için %50 indirimli batch işleme sunar — 24 saatlik pencereye sığabilen her şey. Backfill'ler, değerlendirmeler, içerik üretim hatları, embedding'ler: hepsi aday. Çoğu ekip bunu yeterince kullanmıyor.

Genelde fazla işe yaramayanlar

Agresif temperature ayarı. Maliyeti etkilemez.
Aynı model sınıfı için "daha ucuz" sağlayıcıya geçmek. Genellikle %10–%30 fark; yukarıdaki kollar baskındır.
Token tasarrufu için fine-tuning. Bazen kıymetli ama mühendislik maliyeti, doğru base model ve cache'lemeye göre yüksek.
Açık ağırlık modelleri self-host etmek. GPU maliyeti, devops zamanı ve düşük kullanım hesaba katıldığında küçük ölçekte API'den genellikle pahalıdır. Büyük ölçekte ucuzlar ama kesişme noktası beklenenden uzaktır.

Denetim oyun kitabı

Endpoint ve model bazında bir haftalık kullanımı dışa aktarın. En çok maliyet üreten 3'ü tespit edin.
Her biri için 50 isteklik örnek alın. Input ve output boyutlarına bakın. Hangi kol uygun karar verin.
O endpoint için en yüksek etkili kolu uygulayın.
Bir hafta ölçün. Tasarruf gerçekse sıradaki endpoint'e geçin.

Üç böyle iterasyon LLM faturalarını kalite kaybı olmadan tipik olarak %50–%75 düşürür.