LLM Rate Calc

Provider & model

Provider

Model

Tier / plan Your tier sets the per-minute request and token limits your provider grants you. It usually upgrades automatically as your cumulative spend increases. Check your provider's console to confirm your current tier.

Usage profile

Concurrent users Users actively firing API requests at the same time — not your total user base. For 1,000 registered users, real concurrency is typically 5–15% of that.

Input tokens / msg Tokens in your request: system prompt + user message combined. Rule of thumb: 1 token ≈ 0.75 English words. A 500-token message ≈ 375 words.

Output tokens / msg Tokens the model generates per response. Short answers: 50–150, detailed replies: 300–800. Output tokens are typically 3–5x more expensive than input tokens.

Avg. request duration (seconds) Time from sending the request to receiving the full response. This determines how many requests per minute each user generates. For streaming, use total duration — not time-to-first-token.

Typical: GPT-4o mini ~1–2 s · GPT-5.4 ~2–5 s · Claude Sonnet ~2–5 s · Gemini Flash ~1–3 s · Groq ~0.5–1 s

Required RPM RPM = Requests Per Minute. How many API calls your users generate each minute. Each message sent = 1 request. If your required RPM exceeds the tier limit, you'll get 429 errors.

—

limit: —

Required TPM TPM = Tokens Per Minute. Total input tokens processed per minute across all users. Longer prompts or more users burn through TPM faster. Separate from RPM — either one can trigger a 429 first.

—

limit: —

Max concurrent users

—

on this tier

RPM usage How much of your tier's RPM limit you're consuming. Above 90% means traffic spikes will cause 429 errors. —

0%50%100%

TPM usage (input) How much of your tier's input TPM limit you're consuming. Often the binding constraint for apps with long system prompts or large context windows. —

0%50%100%

Bottleneck Which limit you'll hit first — RPM or TPM. Apps with many short messages are RPM-bound. Apps with long prompts or large contexts are TPM-bound. —

Estimated cost / minute (USD) —

Data from official provider docs — April 2026. Always verify at your provider's console before finalizing architecture decisions.

Usage profile

Monthly budget (USD) How much you're willing to spend on this model per month. The planner will calculate the maximum concurrent users this budget supports at your chosen usage profile.

Input tokens / msg Tokens in your request including system prompt + user message. 1 token ≈ 0.75 English words.

Output tokens / msg Tokens the model generates per response. Output is typically 3–5x more expensive than input.

Messages per user / day How many API messages an average active user sends per day. For a chat app: 10–30 typical. For an agent running tasks: could be 100+.

Active days / month How many days per month each user is active. 22 = weekdays only, 30 = daily usage.

With your budget you can support

—

monthly active users

Cost per user / month —

Cost per message —

Messages per user / month —

Total messages / month (at max users) —

Cost estimate only — excludes prompt caching savings, batch API discounts, and infrastructure costs. Always verify pricing at your provider's documentation.

How it works

LLM APIs enforce two independent rate limits: RPM (Requests Per Minute) and TPM (Tokens Per Minute). Hitting either one returns a 429 error and blocks your users until the window resets.

This planner calculates how much of your tier capacity your application will consume based on concurrent users, average message size, and request duration. It tells you which limit you will hit first — and how far away you are from the ceiling.

1. Select your model

Choose your provider, model, and current API tier.

2. Enter usage profile

Set concurrent users, tokens per message, and request duration.

3. Read the results

See required RPM/TPM, your bottleneck, and estimated cost per minute.

Example scenarios

Early-stage SaaS — 50 concurrent users, GPT-4o mini, OpenAI Tier 1

50 users × 500 input tokens × 3s request = 1,000 RPM needed. Tier 1 allows 500 RPM. Result: critical — you will hit the RPM limit immediately. Upgrade to Tier 2 or reduce concurrency.

Production chatbot — 200 concurrent users, Claude Sonnet 4.6, Anthropic Tier 3

200 users × 800 input tokens × 4s = 3,000 RPM needed. Tier 3 limit is 2,000 RPM. TPM usage is only 40%. Bottleneck is RPM — consider upgrading to Tier 4 or batching requests.

Document processing pipeline — 30 concurrent jobs, Gemini 2.0 Flash, Pay-as-you-go

30 jobs × 8,000 input tokens × 10s = 180 RPM, 1.44M TPM. Pay-as-you-go limit is 2,000 RPM / 4M TPM. Result: comfortable — plenty of headroom on both dimensions.

Frequently asked questions

What is a 429 error?

A 429 Too Many Requests error means you have exceeded your API provider's rate limit. The response includes a Retry-After header indicating how long to wait. Your application should implement exponential backoff to handle these gracefully.

What is the difference between RPM and TPM?

RPM (Requests Per Minute) limits how many API calls you can make, regardless of size. TPM (Tokens Per Minute) limits the total volume of text processed. Either limit can trigger a 429 — whichever you hit first is your bottleneck. Apps with many short messages are RPM-bound. Apps with long system prompts or large context windows are typically TPM-bound.

How do I find my current tier?

OpenAI: platform.openai.com → Settings → Limits. Anthropic: console.anthropic.com → Settings → Limits. Google: aistudio.google.com → API keys. Tiers are usually assigned automatically based on cumulative spend.

What is "concurrent users" — is it the same as total users?

No. Concurrent users means users actively making API requests at the same moment. For most SaaS apps, real concurrency is 5–15% of your total user base. If you have 1,000 registered users, expect 50–150 concurrent at peak hours.

How often is the pricing data updated?

We update the model and pricing data manually when providers make changes. Data was last verified in April 2026. Always cross-check with your provider's official documentation before making architecture or budget decisions.

Does prompt caching affect TPM limits?

Yes — significantly. Anthropic and OpenAI both exclude cached input tokens from TPM calculations. If your system prompt is large and consistent across requests, enabling prompt caching can effectively multiply your TPM capacity by 5–10x. This planner does not account for caching, so real-world limits will be higher if you use it.

About the LLM Rate Calculator

This tool estimates total cost for an LLM-based workflow. You provide model choice, average input/output token sizes, requests per day, and (optionally) cache hit rate. The calculator multiplies through and gives daily, monthly, and annual cost — with separate lines for input, output, and cached input.

For most production workloads, the cost structure is dominated by 2–3 factors: model choice (frontier vs smaller), output token volume (output is 3–5× input price), and cache utilization (typically 90% discount on cached prefixes). Optimizing these three covers most savings.

The calculator helps with three real decisions: deciding whether to migrate to a different model class, deciding whether prompt caching engineering is worth the effort, and estimating cost before launching a new feature so you can size the budget conversation honestly.

Where LLM costs come from

Migration sanity check. A workflow currently using a frontier model at $15/M output tokens, 2k average output, 100k requests/day = $300/day. Switching to a smaller model at $1/M output saves ~$280/day or $100k/year — if quality holds.

Cache payback. A 5k-token system prompt repeated across 50k daily requests at $3/M input = $750/day at full price. Caching that prefix at 90% discount drops it to $75/day; over a year, $250k saved.

Output trimming. Cutting average output from 800 to 200 tokens (by instruction tuning the prompt) divides output cost by 4. At $15/M output and 50k req/day, that's $90/day saved.

Cost estimation mistakes

Estimating from peak instead of average. Peak requests are 2–3× average. Cost is dominated by average, not peak.
Ignoring reasoning tokens. Some models charge separately for internal reasoning tokens, which can dwarf visible output for hard tasks.
Overestimating cache savings. 90% discount applies only to the cached portion. If 80% of input is variable, only 20% sees the discount.
Forgetting failed requests. Retries and refused requests still consume input tokens. Add 5–10% buffer.

Frequently asked questions

Are prices always per million tokens?

Most providers price per million tokens, separated into input and output. Some have additional tiers for reasoning, image inputs, audio inputs, etc. Check the current pricing page of your provider.

How accurate is this estimate?

For steady-state workloads, within 5–15% of actual. The biggest sources of error are inaccurate input/output size averages and unaccounted retries.

Does it handle multi-model workflows?

Compute each model leg separately and sum. The tool helps with one leg at a time.

What about batch pricing?

Most providers discount batch (async) at 50%. If your workload tolerates a 24-hour turnaround, run that estimate at half rate.

References

Related deep dive

LLM Cost Optimization: The Levers That Actually Move the Needle →

Read the full guide

How it works

Example scenarios

Frequently asked questions

⌨ Token Counter

About the LLM Rate Calculator

LLM Maliyet Hesaplayıcı Hakkında

Where LLM costs come from

LLM maliyetleri nereden geliyor

Cost estimation mistakes

Maliyet tahmin hataları

Frequently asked questions

Sık sorulan sorular