LLM Rate Calc

Plan your API capacity before hitting 429 errors in production.

Supports OpenAI GPT-5.4, GPT-4.1, o3, o4-mini, Anthropic Claude Opus 4.7, Sonnet 4.6, Haiku 4.5, Google Gemini 3.1, Gemini 2.5, Groq Llama 4, DeepSeek V3, Mistral, xAI Grok. Calculate RPM, TPM, concurrent users, and monthly API costs.

Provider & model

Provider
Model
Tier / plan ?Your tier sets the per-minute request and token limits your provider grants you. It usually upgrades automatically as your cumulative spend increases. Check your provider's console to confirm your current tier.

Concurrent users ?Users actively firing API requests at the same time — not your total user base. For 1,000 registered users, real concurrency is typically 5–15% of that.
Input tokens / msg ?Tokens in your request: system prompt + user message combined. Rule of thumb: 1 token ≈ 0.75 English words. A 500-token message ≈ 375 words.
Output tokens / msg ?Tokens the model generates per response. Short answers: 50–150, detailed replies: 300–800. Output tokens are typically 3–5x more expensive than input tokens.
Avg. request duration (seconds) ?Time from sending the request to receiving the full response. This determines how many requests per minute each user generates. For streaming, use total duration — not time-to-first-token.
Typical: GPT-4o mini ~1–2 s · GPT-5.4 ~2–5 s · Claude Sonnet ~2–5 s · Gemini Flash ~1–3 s · Groq ~0.5–1 s

Required RPM ?RPM = Requests Per Minute. How many API calls your users generate each minute. Each message sent = 1 request. If your required RPM exceeds the tier limit, you'll get 429 errors.
limit: —
Required TPM ?TPM = Tokens Per Minute. Total input tokens processed per minute across all users. Longer prompts or more users burn through TPM faster. Separate from RPM — either one can trigger a 429 first.
limit: —
Max concurrent users
on this tier
RPM usage ?How much of your tier's RPM limit you're consuming. Above 90% means traffic spikes will cause 429 errors.
0%50%100%
TPM usage (input) ?How much of your tier's input TPM limit you're consuming. Often the binding constraint for apps with long system prompts or large context windows.
0%50%100%
Bottleneck ?Which limit you'll hit first — RPM or TPM. Apps with many short messages are RPM-bound. Apps with long prompts or large contexts are TPM-bound.
Estimated cost / minute (USD)

Data from official provider docs — April 2026. Always verify at your provider's console before finalizing architecture decisions.

Monthly budget (USD) ?How much you're willing to spend on this model per month. The planner will calculate the maximum concurrent users this budget supports at your chosen usage profile.
Input tokens / msg ?Tokens in your request including system prompt + user message. 1 token ≈ 0.75 English words.
Output tokens / msg ?Tokens the model generates per response. Output is typically 3–5x more expensive than input.
Messages per user / day ?How many API messages an average active user sends per day. For a chat app: 10–30 typical. For an agent running tasks: could be 100+.
Active days / month ?How many days per month each user is active. 22 = weekdays only, 30 = daily usage.

With your budget you can support
monthly active users
Cost per user / month
Cost per message
Messages per user / month
Total messages / month (at max users)

Cost estimate only — excludes prompt caching savings, batch API discounts, and infrastructure costs. Always verify pricing at your provider's documentation.

How it works

LLM APIs enforce two independent rate limits: RPM (Requests Per Minute) and TPM (Tokens Per Minute). Hitting either one returns a 429 error and blocks your users until the window resets.

This planner calculates how much of your tier capacity your application will consume based on concurrent users, average message size, and request duration. It tells you which limit you will hit first — and how far away you are from the ceiling.

1. Select your model
Choose your provider, model, and current API tier.
2. Enter usage profile
Set concurrent users, tokens per message, and request duration.
3. Read the results
See required RPM/TPM, your bottleneck, and estimated cost per minute.

Example scenarios

Early-stage SaaS — 50 concurrent users, GPT-4o mini, OpenAI Tier 1
50 users × 500 input tokens × 3s request = 1,000 RPM needed. Tier 1 allows 500 RPM. Result: critical — you will hit the RPM limit immediately. Upgrade to Tier 2 or reduce concurrency.
Production chatbot — 200 concurrent users, Claude Sonnet 4.6, Anthropic Tier 3
200 users × 800 input tokens × 4s = 3,000 RPM needed. Tier 3 limit is 2,000 RPM. TPM usage is only 40%. Bottleneck is RPM — consider upgrading to Tier 4 or batching requests.
Document processing pipeline — 30 concurrent jobs, Gemini 2.0 Flash, Pay-as-you-go
30 jobs × 8,000 input tokens × 10s = 180 RPM, 1.44M TPM. Pay-as-you-go limit is 2,000 RPM / 4M TPM. Result: comfortable — plenty of headroom on both dimensions.

Frequently asked questions

What is a 429 error?
A 429 Too Many Requests error means you have exceeded your API provider's rate limit. The response includes a Retry-After header indicating how long to wait. Your application should implement exponential backoff to handle these gracefully.
What is the difference between RPM and TPM?
RPM (Requests Per Minute) limits how many API calls you can make, regardless of size. TPM (Tokens Per Minute) limits the total volume of text processed. Either limit can trigger a 429 — whichever you hit first is your bottleneck. Apps with many short messages are RPM-bound. Apps with long system prompts or large context windows are typically TPM-bound.
How do I find my current tier?
OpenAI: platform.openai.com → Settings → Limits. Anthropic: console.anthropic.com → Settings → Limits. Google: aistudio.google.com → API keys. Tiers are usually assigned automatically based on cumulative spend.
What is "concurrent users" — is it the same as total users?
No. Concurrent users means users actively making API requests at the same moment. For most SaaS apps, real concurrency is 5–15% of your total user base. If you have 1,000 registered users, expect 50–150 concurrent at peak hours.
How often is the pricing data updated?
We update the model and pricing data manually when providers make changes. Data was last verified in April 2026. Always cross-check with your provider's official documentation before making architecture or budget decisions.
Does prompt caching affect TPM limits?
Yes — significantly. Anthropic and OpenAI both exclude cached input tokens from TPM calculations. If your system prompt is large and consistent across requests, enabling prompt caching can effectively multiply your TPM capacity by 5–10x. This planner does not account for caching, so real-world limits will be higher if you use it.

⌨ Token Counter

Paste your text — token count will be filled into the field

Start typing to count tokens…

About the LLM Rate Calculator

LLM Maliyet Hesaplayıcı Hakkında

This tool estimates total cost for an LLM-based workflow. You provide model choice, average input/output token sizes, requests per day, and (optionally) cache hit rate. The calculator multiplies through and gives daily, monthly, and annual cost — with separate lines for input, output, and cached input.

For most production workloads, the cost structure is dominated by 2–3 factors: model choice (frontier vs smaller), output token volume (output is 3–5× input price), and cache utilization (typically 90% discount on cached prefixes). Optimizing these three covers most savings.

The calculator helps with three real decisions: deciding whether to migrate to a different model class, deciding whether prompt caching engineering is worth the effort, and estimating cost before launching a new feature so you can size the budget conversation honestly.

Bu araç LLM tabanlı bir iş akışı için toplam maliyeti tahmin eder. Model seçimi, ortalama input/output token boyutları, günlük istek sayısı ve (opsiyonel) cache hit oranı verirsiniz. Hesaplayıcı çarpar ve günlük, aylık ve yıllık maliyet verir — input, output ve cache'lenmiş input için ayrı satırlar.

Çoğu üretim iş yükünde maliyet yapısı 2–3 faktörle baskınlaşır: model seçimi (frontier vs küçük), output token hacmi (output input'tan 3–5× pahalı) ve cache kullanımı (genellikle cache'lenmiş prefix'lerde %90 indirim). Bu üçünü optimize etmek tasarrufun büyük kısmını kapsar.

Hesaplayıcı üç gerçek kararda yardım eder: farklı model sınıfına geçmenin değer olup olmadığına karar vermek, prompt caching mühendisliğinin eforuna değer olup olmadığını anlamak ve yeni özellik lansmanı öncesi maliyeti tahmin etmek; böylece bütçe sohbetini dürüstçe boyutlandırabilirsiniz.

Where LLM costs come from

LLM maliyetleri nereden geliyor

Migration sanity check. A workflow currently using a frontier model at $15/M output tokens, 2k average output, 100k requests/day = $300/day. Switching to a smaller model at $1/M output saves ~$280/day or $100k/year — if quality holds.

Cache payback. A 5k-token system prompt repeated across 50k daily requests at $3/M input = $750/day at full price. Caching that prefix at 90% discount drops it to $75/day; over a year, $250k saved.

Output trimming. Cutting average output from 800 to 200 tokens (by instruction tuning the prompt) divides output cost by 4. At $15/M output and 50k req/day, that's $90/day saved.

Migration kontrolü. Şu anda frontier modeli $15/M output token'da, 2k ortalama output, 100k istek/gün = $300/gün kullanan iş akışı. $1/M output'lu küçük modele geçmek ~$280/gün veya $100k/yıl tasarruf eder — kalite tutarsa.

Cache geri ödeme. 5k token'lık sistem prompt'u günlük 50k istekte tekrarlanır, $3/M input'ta tam fiyat $750/gün. O prefix'i %90 indirimle cache'lemek $75/gün'e düşürür; yılda $250k tasarruf.

Output kırpma. Ortalama output'u 800'den 200 token'a (prompt'u talimatla ayarlayarak) düşürmek output maliyetini 4'e böler. $15/M output ve 50k istek/gün ile $90/gün tasarruf.

Cost estimation mistakes

Maliyet tahmin hataları

Frequently asked questions

Sık sorulan sorular

Are prices always per million tokens?
Most providers price per million tokens, separated into input and output. Some have additional tiers for reasoning, image inputs, audio inputs, etc. Check the current pricing page of your provider.
How accurate is this estimate?
For steady-state workloads, within 5–15% of actual. The biggest sources of error are inaccurate input/output size averages and unaccounted retries.
Does it handle multi-model workflows?
Compute each model leg separately and sum. The tool helps with one leg at a time.
What about batch pricing?
Most providers discount batch (async) at 50%. If your workload tolerates a 24-hour turnaround, run that estimate at half rate.
Fiyatlar her zaman milyon token başına mı?
Çoğu sağlayıcı milyon token başına fiyatlar; input ve output ayrı. Bazılarında reasoning, görüntü input'u, ses input'u vb. için ek kademeler vardır. Sağlayıcınızın güncel fiyat sayfasını kontrol edin.
Bu tahmin ne kadar doğru?
Kararlı durum iş yükleri için gerçeğin %5–%15 içinde. En büyük hata kaynakları yanlış input/output boyut ortalamaları ve hesaba katılmamış yeniden denemelerdir.
Çoklu-model iş akışlarını ele alır mı?
Her model bacağını ayrı hesaplayıp toplayın. Araç tek bacakla yardım eder.
Batch fiyatlandırma?
Çoğu sağlayıcı batch (async) için %50 indirim verir. İş yükünüz 24 saat geri dönüş tolere ediyorsa, o tahmini yarı oranda çalıştırın.
References
Kaynaklar
Related deep dive
LLM Cost Optimization: The Levers That Actually Move the Needle →
Read the full guide
İlgili derinlemesine rehber
LLM Maliyet Optimizasyonu: Faturayı Gerçekten Düşüren Kollar →
Tam rehberi oku