AI
AI Model Comparison: GPT, Claude, Gemini, Llama & More
Compare 19 LLMs across 7 providers — pricing, context windows, capabilities (vision, tool use, thinking, batch API), and open-weights status. Filter by tier, sort by cost, select up to 3 models for side-by-side breakdown.
How to use this tool
The AI Model Comparison tool covers 19 large language models across 7 providers as of April 2026. Use the filter chips to narrow by tier (Frontier, Mid, Fast / Cheap), reasoning models, or open-weights models. Sort by input price, output price, context window, or model name. Switch between card grid and table view depending on whether you want a visual overview or a dense sortable table.
To compare models side-by-side: click up to 3 model cards (or table rows) to select them — a bar appears at the bottom of the screen. Once 2 or more are selected, click Compare → to see a full attribute comparison with ★ best highlights on the winning value for each metric (cheapest price, largest context window).
Model tiers explained
Models in this table are grouped into three tiers, which reflect a rough cost-capability tradeoff rather than a strict benchmark ranking.
- Frontier — GPT-5, o3, Claude Opus 4.7, Gemini 2.5 Pro, Grok 3. These models lead on capability benchmarks (MMLU, HumanEval, MATH, GPQA). They charge premium rates: $1.25–$15 per million input tokens, $10–$75 per million output tokens. Use them when you need the best possible quality on hard reasoning, complex code, or long multi-document tasks.
- Mid-tier — GPT-4o, o4-mini, Claude Sonnet 4.6, Gemini 2.5 Flash, Grok 3 mini, DeepSeek V3/R1, Llama 4 Maverick, Mistral Large. These are within 10–20 percentage points of frontier on most benchmarks, but cost 3–10× less. For the majority of production workloads (coding assistants, summarization, RAG retrieval, customer service bots), a mid-tier model is the right default choice.
- Fast / Cheap — GPT-4o mini, Claude Haiku 4.5, Gemini 2.0 Flash, Llama 4 Scout, Mistral Small. Under $1 per million input tokens. Ideal for high-volume classification, keyword extraction, structured output generation, or any task where you can measure quality empirically and it passes your bar. At 50–150× cheaper than frontier, the economics for large-scale offline workloads are very compelling.
Reasoning and thinking models
A subset of models in this table are "reasoning" models — they internally generate extended chains of thought before producing a final answer. This dramatically improves performance on multi-step math, competitive programming, scientific research questions, and complex logical deduction. The models with this capability in this table are:
- OpenAI o3 and o4-mini — OpenAI's dedicated reasoning series. o3 benchmarks at or above PhD level on GPQA Diamond. o4-mini is the cost-efficient reasoning option at $1.10/M input.
- Claude Opus 4.7 and Sonnet 4.6 — Anthropic's extended thinking mode lets Claude spend tokens on visible reasoning steps. Particularly useful for agentic tasks where intermediate steps matter.
- Gemini 2.5 Pro and Flash — Google's thinking mode is configurable (budget tokens). Pro is the strongest on long-context reasoning tasks; Flash is cheaper with comparable thinking capability.
- DeepSeek R1 — Open-weights reasoning model that competes with o1 on AIME and MATH at a fraction of the cost. Weights are publicly available for self-hosting.
- Grok 3 mini — xAI's efficient reasoning model with thinking mode, priced very competitively at $0.30/M input, $0.50/M output.
The trade-off: reasoning models are slower (latency can be 10–30s for hard problems) and charge per thinking token generated, which can significantly inflate cost. Always measure on your real task before committing to a reasoning model — for simpler tasks, a mid-tier non-reasoning model often performs just as well at 5–20× lower cost.
Open-weights models: self-host or use an API?
Four providers in this table have released weights publicly: Meta (Llama 4 Scout and Maverick), Mistral (Mistral Small), and DeepSeek (DeepSeek R1). This means anyone can download the weights and run the model on their own infrastructure — a GPU cluster, a cloud VM, or a local workstation for smaller models.
Advantages of self-hosting: zero per-token cost at scale, complete data privacy (no third-party sees your prompts), no rate limits, ability to fine-tune on proprietary data, and no vendor lock-in.Trade-offs: infrastructure cost (A100/H100 GPUs or equivalent), engineering overhead to deploy and scale, and responsibility for security, updates, and monitoring.
If self-hosting is not practical, all major open-weights models are available via third-party API providers — Groq (fastest inference), Fireworks, Together AI, AWS Bedrock, and Azure AI. Prices shown in this tool are indicative of typical third-party API rates.
Context windows: when they matter
Context window is the maximum combined length of your input plus the model's output, measured in tokens. Most models in this table support 128K–200K tokens — roughly 90,000–140,000 words, or a short novel. This is sufficient for the vast majority of tasks. Where it breaks down:
- Long conversations — chat history accumulates fast. At 128K tokens, a busy 8-hour customer service session can overflow.
- Large codebases — feeding an entire repository as context requires 200K+ tokens for mid-sized projects.
- Legal and scientific documents — contracts, research papers, or clinical trial filings can exceed 100K tokens each.
- Multi-document RAG — if you retrieve 20 documents of 5K tokens each, plus a long system prompt, you can exceed 128K quickly.
For these workloads, Gemini 2.5 Pro (2M tokens) and Llama 4 Scout (10M tokens)are in a different league. Gemini 2.5 Pro is the strongest large-context commercial option; Llama 4 Scout provides open-weights access at a massive context depth. Note that larger context also typically increases latency and cost.
Vision and multimodal capabilities
All frontier models and most mid-tier models now accept images alongside text. Vision-capable models can analyze charts and graphs, understand document layout, debug UI screenshots, classify product images, and answer questions about visual content. Models without vision in this table as of April 2026 are DeepSeek V3, DeepSeek R1 (API), Mistral Large, Mistral Small, and Grok 3 mini.
Vision is billed separately from text tokens. OpenAI charges per image tile based on resolution; Anthropic charges a flat per-image rate plus additional tokens for the image content; Google Gemini charges tokens at the same rate as text. For workloads with many images, these costs can add up significantly — benchmark your specific use case rather than relying on text-only estimates.
Capabilities: tool use, JSON mode, batch API, fine-tuning
- Tool use (function calling) — the model can call structured functions you define, parse structured arguments, and act on the result. This is the foundation of agent and RAG architectures. Nearly all models in this table support it.
- JSON mode / structured output — forces the model to emit syntactically valid JSON matching a schema you specify. Essential for reliable programmatic parsing. All models shown here support this either via a dedicated mode or via prompt engineering.
- Batch API — submit jobs for asynchronous processing (up to 24 hours) at 50% off standard pricing. Available from OpenAI, Anthropic, and Google. Not yet available from xAI, DeepSeek, Meta, or Mistral as of April 2026. Ideal for large offline workloads where latency is not a constraint.
- Fine-tuning — adapt the base model to your domain using labeled examples. OpenAI offers fine-tuning on GPT-4o and GPT-4o mini; Meta's Llama 4 series can be fine-tuned since weights are available. Fine-tuning typically improves consistency on narrow tasks but does not reliably improve general reasoning — consider it a polish step after prompt engineering is mature.
How to choose: a practical framework
Rather than picking the "best" model, match the model to the task:
- Hard reasoning, math, competitive code, research → o3, Claude Opus 4.7, Gemini 2.5 Pro, DeepSeek R1
- General-purpose coding assistant → Claude Sonnet 4.6, GPT-4o, Gemini 2.5 Flash
- High-volume classification or extraction → GPT-4o mini, Gemini 2.0 Flash, Claude Haiku 4.5, Llama 4 Scout
- Long document analysis (200K+ tokens) → Gemini 2.5 Pro (2M), Llama 4 Scout (10M)
- Data privacy / self-hosted inference → Llama 4 Maverick, Llama 4 Scout, Mistral Small, DeepSeek R1
- Budget-constrained reasoning → DeepSeek R1, o4-mini, Grok 3 mini, Gemini 2.5 Flash
- Multimodal (image + text) → GPT-5, GPT-4o, Claude Opus/Sonnet, Gemini 2.5 Pro/Flash, Llama 4
- European data sovereignty → Mistral Large (France-based, GDPR-native)
A pragmatic approach: start with a mid-tier model (Claude Sonnet 4.6 or GPT-4o), measure quality on your real task using a representative eval set, then decide whether upgrading to frontier is worth the cost or downgrading to fast-tier still passes your quality bar. The right model is the cheapest one that meets your quality threshold — not necessarily the most capable one.
Related tools
Use the AI API Cost Calculator to estimate per-call and monthly spend for any model in this table — enter your token counts, call volume, and pricing mode (Standard / Prompt cache / Batch). Use the AI Token Counter to measure exactly how many tokens your prompts consume across GPT, Claude, and Gemini tokenizers.
FAQ
Common questions
How is the comparison data sourced?
Pricing comes from each vendor's official pricing page and is stored as constants in this site's code. Capability data (vision, tool use, thinking, batch API) comes from the provider's API documentation. The "as of" date shown below the tool is when data was last refreshed. Always cross-check the provider's current docs before making a production decision — models can be deprecated, capabilities updated, and prices cut without warning.
What does "Frontier" vs "Mid" vs "Fast" mean?
"Frontier" models sit at the top of the capability curve — highest benchmark scores, best reasoning, largest context. They also cost the most. "Mid" models offer strong general performance at a significant price reduction, typically within 10–20% of frontier on most benchmarks. "Fast / Cheap" models are optimized for throughput and cost — ideal for classification, extraction, summarization, and high-volume tasks where raw reasoning depth matters less. For most real workloads, a mid-tier model handles 80% of tasks well; escalate to frontier only for the hard 20%.
What is a "reasoning" or "thinking" model?
Reasoning models (OpenAI o-series, DeepSeek R1, Grok 3 mini) internally generate long chains of thought before producing their final answer. They're significantly better at multi-step logic, math, and scientific problems, but also considerably slower and more expensive — o3 charges $40/M output tokens vs. $20 for GPT-5. The "Thinking / CoT" capability in this table refers to extended chain-of-thought that the model can optionally reveal. Use reasoning models when accuracy on hard problems matters more than speed or cost.
What is an "open weights" model?
Open-weights models (Meta's Llama 4 series, Mistral Small, DeepSeek R1) have their weights publicly released, meaning you can download and run them yourself on your own hardware or cloud infrastructure. This gives you: zero per-token cost at scale, full data privacy (nothing leaves your servers), no rate limits, and the ability to fine-tune on proprietary data. The trade-off is infrastructure cost and engineering overhead. Shown API pricing is from popular providers (Groq, Fireworks, Together) — self-hosting is free beyond compute.
Why is output so much more expensive than input?
Generating tokens requires one full autoregressive forward pass per token, with KV-cache state maintained throughout. Reading input tokens can be parallelized across the whole prompt in a single pass. Providers reflect this in pricing: output is typically 3–8× more expensive than input. For chatbot or agent workloads where output is large, the output rate dominates total cost. For RAG or classification where you send large prompts but get short answers, the input rate matters more.
What is the Batch API and when should I use it?
OpenAI, Anthropic, and Google all offer a Batch API that processes requests asynchronously (within 24 hours) at 50% off standard prices on both input and output. This is ideal for offline workloads: nightly report generation, historical data classification, content moderation runs, A/B prompt evaluation, embeddings backfills. It's not suitable for user-facing or real-time tasks. If you have a mixed workload, run live traffic at standard rates and batch your offline tasks — the savings add up quickly at scale.
How do I choose between GPT-5, Claude Opus 4.7, and Gemini 2.5 Pro?
All three are top-tier frontier models. In practice, differences come down to: (1) Context window — Gemini 2.5 Pro's 2M token window is in a different league for large document processing. (2) Output pricing — Claude Opus 4.7 charges $75/M output (the highest), so it's expensive for verbose tasks. (3) Thinking — all three support CoT/extended thinking. (4) Ecosystem — GPT-5 fits naturally in OpenAI-native stacks; Claude is best-in-class for code in agentic loops; Gemini integrates with Google Workspace and Vertex AI. (5) Price/performance — Gemini 2.5 Pro at $1.25 input is significantly cheaper than the other two for the same capability tier.
What capabilities are not covered in this comparison?
This table covers text/multimodal (vision) inference. It does not cover: audio input/output (GPT-4o audio, Gemini 2.5 audio mode), real-time streaming voice APIs, image generation (DALL-E, Imagen), embeddings models (text-embedding-3, Gecko), code execution (Gemini Code Execution tool), or search/grounding integrations. For a complete picture of what each platform offers, consult the official API documentation. Prices and capabilities for add-on features like web search grounding or interpreter tools are charged separately.
Why does context window size matter?
Context window is the maximum combined length of your input (system prompt + history + user message + tool results) plus the model's output. Smaller windows (32K–128K) are fine for most chat and single-document tasks. You hit limits with: long conversations that accumulate history, large codebases fed as context, multi-document RAG retrieval, legal or scientific documents, or long agent traces. Gemini 2.5 Pro's 2M window (≈1,500 pages of text) and Llama 4 Scout's 10M window are purpose-built for these use cases. Larger context also typically means higher input cost, so don't over-provision.
What is vision/multimodal capability?
Vision-capable models can accept images (and in some cases video, audio, PDFs) as part of the input. This enables: document understanding with visual layout, chart/graph analysis, screenshot debugging, product image classification, OCR on complex layouts, and visual question answering. All frontier models and most mid-tier models now support vision. Exceptions in this table are DeepSeek V3/R1 and Mistral models, which are text-only as of April 2026. Vision input is billed separately from text tokens — typically per image tile or resolution bracket.
How do I estimate the actual cost of my workload?
Use the AI API Cost Calculator (linked in the sidebar) to input your expected token counts and call volume. For token counts, measure a representative sample of your actual prompts and responses using the Token Counter — the same text can vary 10–20% in token count across providers due to different vocabularies. Key inputs: (1) average input tokens per call (include system prompt + history + user message), (2) average output tokens per call, (3) calls per day, (4) whether you can use Batch API for offline tasks, (5) whether your prompt prefix is stable enough to benefit from prompt caching.
Are all models available everywhere?
Not necessarily. Some models have geographic restrictions, require enterprise agreements, or have capacity limits. xAI/Grok is available via api.x.ai with an API key. DeepSeek is available via platform.deepseek.com — subject to export controls for some jurisdictions. Open-weights models (Llama 4, Mistral Small, DeepSeek R1) can be accessed via third-party providers (Groq, Fireworks, Together AI, AWS Bedrock, Azure AI) in addition to self-hosting. Always check the provider's terms of service and your jurisdiction's AI regulations before deploying to production.
More in AI