The Architecture Tax: Why Falling Token Prices Won't Save Your AI Budget

Token prices are dropping. Enterprise AI bills are climbing. If that sounds contradictory, you're not looking at the right layer.
A CTO I work with told me her AI coding bill comes out of her headcount budget. Every dollar on tokens is a dollar she can't spend on people. So she pulled the data. Her engineers fell into three tiers: low, medium, and high token consumers. Her gut said the medium users were her best engineers. The low users weren't adopting fast enough. The high users were letting it rip with no discipline.
She was right. But knowing that didn't fix anything. Because the problem wasn't who was spending. It was that her systems had no way to tell productive consumption from waste. Most companies are in the same position. They either cap everyone, punishing their best people, or let it run and explain the bill to the CFO every month.
The real issue isn't pricing. It's architecture. The design decisions teams make in the first weeks of building agentic systems compound into cost structures that no vendor discount will fix. I've seen this across every agentic deployment we've built. The expensive mistakes happen early and surface late.
Your context window is not a free resource
Bloated context is the biggest silent cost driver in agentic systems. Most teams default to passing full conversation history into every agent call, treating context like it's free. It's not. A context window filled to capacity on every call is a design failure.
The fix is simple in concept: separate long-term memory from working memory. Long-term memory lives in a retrieval layer. Embeddings, summaries, structured indexes. Working memory is the scoped, minimal context the agent needs for the task in front of it right now.
Instead of passing 40 turns of raw conversation into a follow-up call, summarize the prior state into a compact representation. The model doesn't need the full transcript. It needs the current intent, the relevant constraints, and the last decision made. We've seen this single change drop per-call token counts by 10x.
One agent doing everything is your most expensive decision
A monolithic agent that handles planning, execution, validation, and error recovery in a single call works great in a demo. It burns tokens in production.
The reason is structural: a single agent with broad scope needs broad context. It carries instructions for tasks it won't perform on this call. It reasons through routing decisions that should happen upstream. Every token spent on irrelevant reasoning is waste, and it compounds across every invocation.
Break it up. A planning agent that decides what to do. An execution agent that does it. A validation agent that checks the output. Each one gets a narrow prompt, a scoped context window, and a lower per-call cost. The orchestration layer decides who gets called and when. Not the agent itself.
When we built an autonomous healthcare agent that routes patient interactions, the cost difference between the first monolithic version and the decomposed production version wasn't incremental. It was structural. A fundamentally different cost curve at scale.
Cache the comprehension, not just the data
If an agent is analyzing the same codebase, the same regulatory document, or the same dataset on every invocation, you're paying for the same comprehension work over and over.
Caching in agentic systems works at multiple layers: embedding caches prevent re-vectorizing unchanged content; tool output caches store deterministic results; intermediate reasoning caches capture the output of expensive analytical steps so downstream agents consume the conclusion without re-deriving it.
Same principle every backend engineer learns early. If you're hitting the same resource with the same query and getting the same result, cache it. The only difference is the "resource" is a model that charges per token, which makes caching economics even more favorable than traditional systems.
Your agents need to know when to stop
Always-on agents running autonomous loops without explicit exit conditions will consume tokens whether or not they're producing value. I've seen teams discover workflows that ran for weeks after the business need ended. Still looping, still burning budget.
Every autonomous workflow needs three things designed before it ships: an iteration cap, a cost checkpoint that triggers review at defined thresholds, and an explicit exit condition. If your agent's only definition of "done" is "the task is complete," you've given the model control over your spend. That's a governance problem, not a technical one.
Not every task needs your best model
A formatting task, a classification step, a structured extraction. These don't require frontier-model reasoning. Yet most teams route everything through their most capable (and most expensive) model by default, because that's what they prototyped with and never revisited.
Build model routing as a first-class architectural decision. Define task categories. Route simple tasks to fast, cheap models. Save frontier reasoning for multi-step analysis, ambiguous inputs, and novel problems. Then instrument and measure: push tasks down to cheaper models until quality degrades, and you've found your efficiency frontier.
We've done this on our own internal tooling. We use a cheap model for structured data extraction and reserve expensive models for scoring, prioritization, and judgment. The scan cost dropped 60-70% with no quality loss on the part that matters.
You can't optimize what you can't see
Most teams have better observability into their database queries than their token consumption. They can tell you which SQL query is slowest but can't tell you which agent is burning the most tokens, on which task, in which workflow.
Tag every API call with the workflow, the agent, and the task from day one. This gives you a per-agent, per-workflow cost map that answers the questions your finance team will eventually ask: where is the spend concentrated, which agents are inefficient, and which workflows cost more than the value they produce.
This isn't optional instrumentation. It's what makes everything else in this post possible. You can't right-size context windows without knowing which agents carry bloated context. You can't optimize routing without knowing which tasks go to expensive models unnecessarily. Build the attribution layer first. Optimize second.
The bottom line
Token cost efficiency is not a procurement problem. It's a design problem. Scoped context, agent decomposition, caching, exit conditions, model routing, instrumentation. These are structural decisions that compound over time. Get them right early and costs scale sublinearly with capability. Get them wrong and no pricing discount will save you.
You don't negotiate your way out of a bad architecture. You design your way out of an expensive one.



