AI
Free AI API Cost Calculator
Calculate per-call and monthly cost across GPT-5/4o/o1, Claude Opus/Sonnet/Haiku 4.x, Gemini 2.5/2.0, DeepSeek V3. Models prompt caching and Batch API discounts. 100% browser-based.
How LLM API pricing actually works
Every commercial LLM API bills the same way: per token, per million, with a different rate for input and output. There is no per-call fee, no monthly subscription, no minimum commitment on the standard tier — you pay for exactly the tokens you send and receive. This calculator multiplies your token counts by the published rates from OpenAI, Anthropic, Google, and DeepSeek to surface the real-world dollar number for any workload size.
The catch is that real bills are dominated by patterns the per-million-token rate hides. A chatbot that looks cheap on paper at $0.0008 per call becomes $7,200/month at 300k requests/day. A frontier model that costs $0.02 per call sounds harmless until the agent loop calls it 12 times per user task. Budgeting for AI requires plugging the token math into the actual call volume — which is what this tool does in one panel.
Input vs output: the 3–5× rule
Every model on this calculator charges more for output than input. The ratio is consistent across vendors:
| Model | Input $/1M | Output $/1M | Out:in ratio |
|---|---|---|---|
| GPT-5 | $5.00 | $20.00 | 4× |
| o1 | $15.00 | $60.00 | 4× |
| Claude Opus 4.7 | $15.00 | $75.00 | 5× |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 5× |
| Gemini 2.5 Pro | $1.25 | $10.00 | 8× |
| GPT-4o mini | $0.15 | $0.60 | 4× |
| Gemini 2.0 Flash | $0.10 | $0.40 | 4× |
| DeepSeek V3 | $0.27 | $1.10 | 4× |
Two practical consequences: (1) for chatbots that produce long answers, output dominates the bill — capping max_tokens is the highest-leverage cost lever; (2) for RAG and summarization workloads, where you stuff lots of context in and pull a short answer out, input dominates — and prompt caching becomes the highest-leverage lever.
Prompt caching can cut input cost by 90%
All three major providers now support prompt caching: when a long prefix repeats across calls (system prompt, tool definitions, retrieved documents, few-shot examples), the provider stores the precomputed KV cache and bills cached tokens at a steep discount on subsequent calls.
- Anthropic — 90% off cached input. Claude Opus drops from $15 to $1.50 per 1M.
- OpenAI — 50–90% off cached input depending on model. Automatic on prompts > 1024 tokens.
- Google — ~75% off cached input on Gemini 2.5 Pro / 2.0 Flash. Explicit cache creation API.
Switch the calculator's Pricing mode selector to Prompt cache hit to see the upper-bound savings (assumes 100% of input tokens are cache hits). Real production cache hit rates land between 40–80% — multiply the savings accordingly.
Batch API: 50% off if you can wait
For workloads that don't need a real-time response, the Batch API gives a flat 50% discount on both input and output across OpenAI, Anthropic, and Google. You submit a JSONL file of requests, the provider returns results within 24 hours.
Good fits: nightly summarization runs, embeddings backfills on historical data, LLM-as-judge evaluation sweeps, content moderation passes, document classification jobs. Bad fits: anything user-facing, agent loops, customer chat, on-call alerts.
Switching the calculator's Pricing mode to Batch API applies the discount across both directions. A workflow that costs $20k/month in real-time becomes $10k/month batched — pure margin recovery if you can tolerate the delay.
Pick the right tier — frontier vs mid vs fast
The 30–150× price gap between fast and frontier models is the most important tradeoff in production AI. The fast tier (GPT-4o mini, Claude Haiku 4.5, Gemini 2.0 Flash, DeepSeek V3) is good enough for the vast majority of business workflows: classification, extraction, structured output, customer Q&A on a known knowledge base.
Frontier models (Claude Opus 4.7, o1, GPT-5) earn their cost on hard reasoning: multi-step planning, code generation across files, ambiguous edge cases, novel problem framings. Sending routine work to a frontier model is the most common avoidable cost mistake — a routing layer that picks the model based on task complexity typically cuts spend by 70–90%.
| Tier | Best for | Avoid for | Per-1M output cost range |
|---|---|---|---|
| Fast | Classification, extraction, simple Q&A, embeddings | Hard reasoning, code refactoring across files | $0.40–$5 |
| Mid | Most product features, RAG, structured generation | Hardest research / planning loops | $10–$15 |
| Frontier | Agent loops, novel reasoning, codegen on hard bugs | High-volume routine work | $20–$75 |
Cost-cutting checklist
Output is the expensive direction. A max_tokens of 4000 when 200 would do is a 20× overcharge waiting to happen on long-tail responses. Set per-route caps tight to typical output length × 1.5.
Every call re-sends the system prompt. Cutting 500 tokens of unused boilerplate from a system prompt that runs 10M times/month is a $1,500 saving on Claude Opus, $50 on Haiku.
Anthropic and OpenAI auto-cache long prefixes if you mark them. A 5k-token RAG context cached at 70% hit rate cuts input cost by ~63% on that workload.
Classifier picks fast vs frontier per request. Most queries land on the cheap model; only the hard ones escalate. Typical savings: 70–90% vs always-frontier.
Anything tolerant of 24h latency — embeddings backfills, evals, content audits — runs at 50% off. Real margin uplift, zero quality loss.
Long chat histories are expensive to keep verbatim. Summarize older turns into a compact running summary; keep the last 4–6 messages full-fidelity.
How to read the calculator
Pick a workload preset to seed realistic token sizes, or punch in your own numbers from the Token Counter. The headline card shows your selected model's per-month and per-call cost; the comparison table sorts every supported model from cheapest to most expensive and shows the multiplier vs the cheapest option. Click any row to make it the headline. Toggle Pricing mode to see how prompt caching and Batch API change the picture.
All math runs in your browser — no requests leave your device. Pricing data is a static snapshot dated next to the volume strip; cross-check against the vendor's console for binding production figures.
Related tools
FAQ
Common questions
Is anything I type sent to a server?
No. The calculator does pure arithmetic on the token counts and per-million-token rates you see on the page — there is no inference, no API call, and no telemetry. Open DevTools → Network and you will see zero outbound requests while you change the inputs. Safe to use with confidential workload sizes and internal volume estimates.
How is the per-call cost computed?
(input_tokens / 1,000,000 × input_rate) + (output_tokens / 1,000,000 × output_rate). Input and output are billed at different rates — output is almost always 3–5× more expensive per token than input. Monthly cost multiplies the per-call number by calls_per_day × 30. Switching pricing mode swaps the rates: Standard uses list prices, Prompt cache hit uses the cached-input rate (where the provider supports it), and Batch API uses the 50%-off rate available from OpenAI, Anthropic, and Google for non-real-time workloads.
Where does the pricing data come from?
The rates come from each vendor's official pricing page (OpenAI Platform, Anthropic Pricing, Google Vertex AI / AI Studio, DeepSeek Open Platform) and are stored as a flat constant per model in this site's code. The "as of" date shown next to the volume strip is when the table was last refreshed. If a vendor adjusts pricing between releases, the calculator may briefly lag — always cross-check against the vendor's console for binding numbers.
Why is output so much more expensive than input?
Generating tokens is more compute-intensive than reading them. For input, the model runs a single forward pass over the prompt (highly parallelizable, batched across requests). For output, the model runs one autoregressive step per token, with KV-cache state and sampling — each token needs a fresh forward pass. Providers reflect this in pricing: GPT-5 charges 4× more for output than input, Claude Opus charges 5×, DeepSeek charges 4×. The output:input ratio is the single biggest driver of cost for chatbot-style workloads.
What does Prompt cache hit mode actually save?
When a long prefix of your prompt repeats across calls — system prompt, few-shot examples, tool definitions, retrieved documents — the provider can store the KV cache and skip recomputation. Anthropic gives a 90% discount on cached input (Claude Opus 4.7 drops from $15 to $1.50 per 1M input tokens). OpenAI gives 50–90% on cached prefixes. Google Gemini gives ~75%. The mode in this calculator assumes 100% of the input is cached, which is the upper bound — real cache hit rates are 40–80% for production workloads, so divide the savings by your real hit rate.
When should I use the Batch API instead?
The Batch API is 50% cheaper on both input and output across OpenAI, Anthropic, and Google. The trade-off: results return within 24 hours instead of seconds. Good fits: nightly summarization, embeddings backfills, A/B prompt evaluation, classification of historical documents, content moderation passes. Bad fits: anything user-facing, agent loops, real-time chat. For mixed workloads, run the live tier on the Standard rate and the offline tier on Batch — the calculator lets you compute each separately.
How do I estimate my actual input and output token counts?
Use our Token Counter to measure a representative sample of your real prompts (system + few-shot + user message) and a typical model response. Average across 5–10 examples to smooth variance. For the input field, include the entire prompt the API will see, not just the user's message — system prompts and history can be 10× larger than the user input. For output, look at your actual completions, not your max_tokens cap.
Are tokens the same across providers?
No, and that matters. The same English paragraph might be 250 tokens in GPT-4o, 270 in Claude, and 220 in Gemini 2.5 — Gemini's 256k-token vocabulary tokenizes English very efficiently, while Claude's smaller vocab fragments more. On code or non-Latin scripts (Chinese, Cyrillic, Arabic), the gap can hit 2–3×. So a "cheaper per-million-token" model isn't automatically cheaper for your workload — measure on your actual inputs. The calculator uses your token counts as-is; it does not auto-translate between tokenizers.
Does this include image, audio, or vision tokens?
No — the calculator covers text-only inference. Vision and audio inputs are billed at different per-token rates that vary by image resolution and audio length. For workloads with multimodal content, compute the text portion here and add the multimodal cost from the vendor's pricing page separately. As of 2026-04, GPT-4o images cost ~$0.001875 per low-detail tile, Claude images cost ~$4.80 per 1k images, Gemini audio costs $0.10 per minute on the Pro tier.
Why is the same workload sometimes 100× more expensive on a frontier model?
Frontier models (Claude Opus 4.7, o1, GPT-5) sit at the top of the price curve because they target hard reasoning tasks where the value per call justifies the cost. For routine work — classification, extraction, summarization, customer Q&A — a fast-tier model (GPT-4o mini, Claude Haiku 4.5, Gemini 2.0 Flash) is typically 50–150× cheaper and benchmarks within a few points. The right strategy for cost-sensitive products is to route by complexity: cheap model first, escalate to frontier only when the cheap model fails a confidence check.
Why does my real bill differ from the estimate?
Common reasons: (1) you forgot to include system prompt and history tokens, which dominate input volume; (2) cache hit rate is lower than 100% in production; (3) max_tokens cap was higher than typical output, inflating worst-case cost; (4) request retries on 5xx errors double-charge; (5) tool-calling round-trips multiply per-conversation calls; (6) some providers charge minimum-call fees on small batches. Treat the calculator as an order-of-magnitude planning tool — track real spend in the provider dashboard for binding numbers.
How do I budget for an unpredictable workload?
Calculate three scenarios — pessimistic (worst-case input + max output + zero cache hits + Standard mode), expected (median observed values + 60% cache hit rate), and optimistic (small inputs + Batch API + 80% cache hits). Set your monthly budget cap at 1.5× the expected number. Wire spend alerts at 50% / 80% / 100% in your provider dashboard. For products with viral risk, add a per-user daily cap so a single bad actor cannot drain your budget overnight.
More in AI