Context Window Cost Calculator
Long context costs add up fast — and the per-request math is what surprises most teams the first time they run it. Plug in your context size and see what one request costs across every major model, sorted cheapest first.
Last updated: April 2026
Your context
System prompt + context (e.g. RAG docs, conversation history). 100k tokens ≈ 75k words.
Cost per request and per month
Sorted by monthly cost. Input-heavy workloads favor cheaper input pricing.
Gemini 2.0 Flash
$0.010/request
$30.60/mo
GPT-4o mini
openai
$0.015/request
$45.90/mo
GPT-5.4 nano
openai
$0.021/request
$61.88/mo
Claude Haiku 3
anthropic
$0.026/request
$76.88/mo
Gemini 2.5 Flash
$0.031/request
$93.75/mo
GPT-4.1 mini
openai
$0.041/request
$122.40/mo
GPT-5.4 mini
openai
$0.077/request
$231.75/mo
Claude Haiku 4.5
anthropic
$0.103/request
$307.50/mo
o4-mini
openai
$0.112/request
$336.60/mo
Gemini 2.5 Pro
$0.130/request
$390.00/mo
GPT-4.1
openai
$0.204/request
$612.00/mo
o3
openai
$0.204/request
$612.00/mo
Long-context features are usually used by a few power users.
The pattern we see most often: a small group of customers sends 80%+ of the long-context requests, and they're rarely on your top tier. PerUnit shows you who's sending the long prompts and which features they're hitting — so the gating and pricing decisions are obvious.
Get early access to PerUnitFrequently asked questions
- Why does long context cost so much more?
- Because input is the bulk of the bill at long context lengths. A 100K-token prompt with a 500-token response is 200× more input than output — even though input is priced cheaper per token, the volume swamps everything else. At 80K input + 500 output on a flagship model, input typically accounts for 90%+ of the per-request cost.
- Does prompt caching reduce long-context costs?
- A lot. OpenAI and Anthropic both offer explicit prompt caching; Gemini does an implicit version on Flash. For workloads where the same large context (system prompt, RAG corpus, conversation history) repeats across many requests, caching can drop the input portion of the bill by 50–90%. It only helps if the cacheable portion is genuinely repeated — one-off long-context requests get no benefit.
- Which model is cheapest for 200K-token requests?
- On price alone, Gemini 2.5 Flash at the cheap tier ($0.10/M input ≈ $0.020/request) and Gemini 2.5 Pro at the flagship tier ($1.25/M input ≈ $0.250/request) win. Claude tops out at 200K context and is more expensive per token but consistently strong on reasoning over long context. GPT-4.1 sits in the middle on both. Use the calculator above to plug in your output volume and see which one wins for your specific workload.
- What's the largest context window available right now?
- Gemini 2.5 Pro and Flash support 1M tokens across all tiers, with some configurations going to 2M. GPT-4.1 supports just over 1M. Claude Sonnet/Opus/Haiku 4.x cap at 200K. GPT-4o caps at 128K. Bigger isn't automatically better — long-context reasoning quality varies by model, and many models degrade past a certain depth even when they technically accept the input.
Get a monthly email when AI model pricing changes
One email a month. New model launches, price cuts, deprecations across OpenAI, Anthropic, and Google. No spam, unsubscribe any time.