When GPT-4o Mini Is Good Enough (And When It Quietly Isn't)
Our document tagging feature was costing $1,200/month on GPT-4.1. We ran the same prompts through GPT-4o mini for a week and the outputs looked, to a human reader, identical. We switched. The next month it cost $180 — an 85% cut on one feature.
Riding the high, we tried the same trick on our support chatbot. That one was a mistake. The first week we shipped it, mini confidently told a customer they could cancel and get a full refund. Our policy said otherwise. The refund we issued, plus the churn that followed, was bigger than a year of mini savings on that workload.
The gap, in numbers
GPT-4o mini is $0.15/1M input and $0.60/1M output. GPT-4.1 is $2.00/1M input and $8.00/1M output. That's a 10–13× difference depending on input/output mix. On any high-volume feature, the savings are large enough that the question isn't "is mini cheaper" but "does mini hold up here?"
Where mini was clearly fine
Classification, extraction, summarisation of straightforward text, basic tag-this-email tasks. Document tagging, intent detection, "is this a support question or a billing question" routing. We moved three features. Thousands of requests a day. Quality complaints from users on those features: zero.
Where mini quietly wasn't
Anywhere a wrong answer was expensive. Support chat (the refund story above). Code generation where the output goes straight into a PR. Anything answering compliance or pricing questions. The mistakes mini made on these workloads weren't visible in spot-checking — they showed up later as support tickets, refunds, or a Slack thread starting with "did anyone else see this?"
The pattern was: mini fails gracefully on tasks where "close enough" is fine, and fails catastrophically on tasks where the model is the last line of defence before the customer.
How we decided
For each candidate feature we routed 10% of traffic to mini for a week and watched three things: support ticket volume tied to that feature, completion rate (did the user accept the AI output or redo it), and any human-review escalations. If all three held, we shipped. If any moved meaningfully wrong, we rolled back. Two of the three tests passed. One didn't. Same A/B framework, two different answers — which is the whole point.
To estimate what each feature would cost on each model before you run the test, the model comparison tool will give you the per-feature numbers. Once you know what each feature costs, PerUnit shows you which customers are actually using each — so the routing decision is based on real usage, not assumptions.