How to Reduce OpenAI API Costs: We Cut Our Bill by 40%
Last quarter our OpenAI bill crossed $12k. Not crazy for a product with AI at its core, but it was growing 15% month-over-month and we had no idea where it was going. The dashboard said "usage by model" — useful, but not enough to make decisions. So we tried a few things. Here's what actually moved the needle.
1. We stopped using GPT-4o for everything
Our default was GPT-4o. It's great. It's also $2.50 per million input tokens. Our support team was using it to classify customer emails — "billing", "bug", "feature request". That doesn't need GPT-4o. We switched to GPT-4o mini ($0.15/M) and accuracy stayed the same. Saved about $800/month on that one feature.
The lesson: audit your features. We had six different AI-powered flows and four of them were defaulting to the flagship model. Classification, extraction, simple Q&A — none of those need GPT-4o. We built a small routing layer that sends complex reasoning to GPT-4o and everything else to mini. Took a week. Paid for itself in the first month.
2. Prompt caching for our 2,000-word system prompt
Every chat request was sending the same 2,000-word system prompt. We enabled prompt caching and saw input costs drop by roughly 40% on that flow. Took an afternoon to implement. No product changes.
If you're not familiar: prompt caching lets you cache static context (system prompts, RAG chunks that repeat) and pay a discounted rate for cached tokens. For us, that 2,000-word prompt was going out on every single request. At 50k requests a month, we were paying for 100 million input tokens just for the system prompt. Caching cut that in half. The setup is straightforward — you just structure your API calls so the static parts are in a cacheable block. OpenAI's docs walk you through it.
3. We truncated conversation history
Our chat feature was sending the full thread on every request. A power user with 50 messages in a thread? We were paying for all 50 every time they sent a new one. We switched to a sliding window — last 10 exchanges only — and added optional summarisation for longer threads. Users didn't notice. Our token usage on chat dropped by about 25%.
4. The real unlock: we finally saw who was spending what
The technical fixes helped. But the biggest shift came when we could actually attribute spend to customers. Turns out three enterprise accounts were each burning $2k+/month on our document analysis feature — way more than we'd modelled. And our free tier? 35% of total spend. Zero revenue.
We'd been optimising in the dark. We knew total spend. We knew which models were used. We didn't know which customers were driving it. Once we had that breakdown, we could have real conversations: raise limits for the heavy enterprise users (and charge for it), cap the free tier, and route simpler workflows to cheaper models. We ended up cutting the bill by about 40% while improving margins on our best customers.
The technical optimisations got us part of the way. Visibility into who was driving spend got us the rest. If you're in a similar spot — bill growing, dashboard not telling you enough — we built PerUnit to solve exactly that: cost per customer, per feature, per tier. So you can see the numbers before you start cutting or repricing. Want to estimate costs first? Try our free AI cost calculator or token counter — no sign-up required.