Emilio Carrión
The Bill Nobody Saw Coming: Business Case and Tokens to Scale Your AI Feature Without Going Broke
Integrating AI without a rigorous business case can turn a great feature into a cost nightmare. Here's how to estimate better and optimize tokens without sacrificing quality.
Monday, 9:07 AM. Slack message: "Hey, can we review the cost of the AI feature?" That message landed just one week after deployment, and it caught us still riding the high. During the pilot everything had gone smoothly: few users, contained consumption, and very positive feedback. It looked like one of those stories where everything clicks on the first try.
But when real traffic hit, the picture changed. More users showed up than expected, with higher usage frequency and prompts considerably longer than in the controlled environment. Consumption stopped looking anything like the pilot and, almost without noticing, we went from talking about the value of the feature to talking about the bill.
The frustrating part was discovering it wasn't a bug or a stability issue. The implementation worked. What failed was the economic design of the decision: we didn't have a solid business case for scaling.
In 60 seconds: Without a business case, an AI feature isn't a product: it's a bet. If you don't model cost per invocation, scaling scenarios, and spending limits, it'll blow up in production. The good news: with a couple of design decisions (prompt, context, and routing), you can reduce tokens aggressively without tanking quality.
The Business Case Isn't Bureaucracy, It's Protection
With AI there's a very common trap: a brilliant demo makes you feel like the problem is solved. And it makes sense, because when you see good, fast responses, the team goes into "let's ship it" mode.
The problem is that a demo doesn't pay for production, and a pilot almost never represents the behavior of a full user base. On top of that, the cost of an LLM isn't fixed like a traditional license: it depends on actual usage and the tokens you send in each call.
That's why the business case needs to answer three questions before deployment:
- How much does each invocation cost in a real scenario?
- What happens if usage multiplies by 2x or 3x?
- Where does it stop being profitable?
If you can't answer those three questions in writing, you're not ready to scale without surprises.
Three Mistakes We Made (and You Can Avoid)
1. Extrapolating from the Pilot Without Adjusting
We assumed that "if it works in the pilot, production will be similar." That was the first mistake.
In production, power users showed up, with more interactions per session and more context per request. Consumption didn't grow linearly; it started climbing in steps, and that broke our projections in very little time.
Practical rule: don't scale with a single estimate. Work with three scenarios (baseline, high, and stress) and make decisions based on the high one.
2. Not Breaking Down Cost by Component
At first we were only looking at an aggregate token metric and, from a distance, it seemed reasonable. When we forced ourselves to decompose by component, the real problem appeared: a huge system prompt and too much history in every call.
That's where we found an immediate improvement: we went from ~2,000 system prompt tokens to ~800, while maintaining very high perceived quality.
Practical rule: always separate system input, user context, and output. Otherwise, you don't know what to optimize.
3. Not Defining a Spending Ceiling
We also hadn't defined a "beyond this point it's not worth it." And without that number there are no clear alerts or automatic decisions; just reaction, discussion, and scrambling.
Practical rule: define unit economics per user and per transaction, and tie alerts to those limits.
Playbook to Cut Token Costs Without Killing Quality
The business case tells you how much you can spend; this section is about how to spend less without losing response quality.
Be Surgical with the System Prompt
Every system prompt token gets charged on every call. That's why it's one of the most powerful levers. If you're sending 2,000 system tokens across 100,000 calls/day, you're paying for 200 million tokens daily just in base instructions.
We trimmed from ~2,000 to ~800 tokens and perceived quality stayed virtually the same, with a clear cost reduction from day one.
When not to apply this: when you depend on extensive legal/compliance instructions that can't be summarized.
Manage Conversation History Wisely
Don't send the full history on every turn by default. In many cases it works better to maintain a cumulative summary of the context and add only the last 3-4 complete interactions.
When not to apply this: in flows where complete text traceability is a functional requirement.
Classify Before You Process
Not every request deserves the same model or the same context. If you introduce a lightweight router (rules or a small model), you can reserve the expensive model for complex cases and route simple queries through a much more efficient path.
This isn't just technical; it's product prioritization applied to cost.
When not to apply this: at the start of a POC, when simplicity matters more than efficiency.
Cache What You Can
If there are repetitive queries, a semantic cache can avoid full LLM calls entirely. Implementing it well isn't trivial, but when there's real repetition it pays for itself quickly and significantly reduces bill variability.
When not to apply this: if your inputs are highly volatile or hyper-personalized.
Measure, Measure, Measure
Without observability, you're flying blind. At minimum you should have tokens by request type, cost per active user, consumption percentiles, and threshold alerts. If you're not measuring that, you only find out about problems when they're already expensive.
Enjoying what you read?
Join other engineers who receive reflections on career, leadership, and technology every week.
Where Product and Engineering Meet
This is the big learning: AI costs aren't just a backend topic or just a product topic. They're a meeting point between both.
When engineering understands unit economics, they design better; when product understands tokens, they prioritize better. In our case, by trimming context and routing smarter, we didn't just cut costs: the responses also improved, with less noise and more focus.
What I'd Do Differently Next Time
If I were starting over, before shipping an AI feature to production I'd put this in writing:
About the business: What value does this feature generate per user? How much am I willing to pay for that value? What's my acceptable monthly spending ceiling? At what point does it stop being profitable?
About scale: How many users will use this? How frequently? What happens if usage is double or triple the estimate? Do I have control mechanisms (rate limiting, per-user quotas)?
About technical optimization: Can I reduce my system prompt? Am I sending unnecessary context? Can I classify requests to use different models? Do I have real-time consumption metrics?
Back to the 9:07 Slack message. I wish that heads-up had reached us before deployment, when it was still cheap to correct course.
Without a business case, an AI feature isn't a product. It's a bet.
Question for you: What's the metric that has helped you the most in controlling AI costs in production?
This content was first sent to my newsletter
Every week I send exclusive reflections, resources, and deep analysis on software engineering, technical leadership, and career development. Don't miss the next one.
Join over 5,000 engineers who already receive exclusive content every week
Related articles
Generating Is Easy. Verifying Is the Work.
Anthropic separated the agent that generates from the one that evaluates and quality skyrocketed. That pattern describes the future of software engineering: generation is commodity. Verification is craft.
AI and Cognitive Debt: What I've Learned Using It Daily
AI multiplies your analytical capacity, but it can atrophy your thinking if you don't use it with intention. Three scientific studies and real-world experience from a Staff Engineer who uses it every day.
Your AI Agent Doesn't Need to Think Better. It Needs to Know When It Screwed Up.
Teams getting real value from agents don't have magical models. They have verification loops that catch failures fast and force correction with external signals.
