Cheaper Tokens, Bigger Bill

Jul 1

The price of an AI token is falling fast, and your AI bill is still going up. If you set this year's budget by watching headline prices drop, you set it wrong. Price isn't the number that decides your bill. Consumption is. Consumption is climbing faster than price is falling, because the work is shifting from chat to agents, and an agent burns tokens at a rate a chat window never touched. We call that jump the consumption cliff: the point where a task moves from something you ask to something that runs, and its token draw climbs by orders of magnitude on a single design decision.

Why is my AI bill rising when token prices are falling?

Because you're buying far more tokens than the price drop can offset. Blended per-token prices have fallen something like 80% in under three years, from roughly $17 per million to around $2. Good news, on paper. Over that same stretch, enterprise AI spend didn't fall to match. It climbed, often several times over. The unit got cheaper and the invoice got bigger. You can watch this at the vendor level right now. Amazon is being moved off its old wholesale compute-hour rate onto token-based pricing like every other large customer, AWS just raised the price of its GPU capacity blocks, and contract rates for committed capacity keep rising even as spot prices soften. Part of what made AI feel cheap was a subsidy, and that subsidy is closing. Price is the reassuring number. It's also the wrong one to build a budget on.

What actually drives the cost of an agent?

Repetition. An agent doesn't answer once and stop. It reads the task, calls a tool, reads what came back, then re-reads the entire history before it decides the next move, and it does that over and over until the job is finished. Every step re-sends everything before it. The context snowballs, and since you pay for input tokens on each call, the cost compounds loop by loop. A single chat exchange is one round trip. A real agentic task can be dozens, sometimes hundreds. That's why measured token use for complex agent work lands anywhere from tens of times a chat exchange to roughly a thousand times it, depending on how far the thing loops before it lands. Same model, same per-token price. A bill that looks nothing alike. The consumption cliff has nothing to do with prices moving. It's the shape of the work changing underneath you. And it doesn't hold still. Run the same task twice and the token bill can swing hard, because the agent takes a different path each time and reads a different amount along the way. A cost you measured on a clean Tuesday won't hold when the same job runs against messier inputs on Friday.

Didn't cheaper AI mean cheaper operations?

No, and the reason catches people off guard. Falling prices don't only lower the cost of what you already run. They make new work economical that wasn't before, so you go do it. A workflow that made no sense at $17 per million pencils out at $2, so you build the agent, and that agent eats a hundred times what the old chat feature did. This is the move from the assistant to the workforce, and it carries a cost signature. The assistant answered questions at a seat price you could predict. The workforce does jobs, and it meters by how hard each job runs. Every efficiency you booked at the chat layer can vanish at the agent layer, because the agent consumes on a different order entirely. Same price tag, much larger appetite.

How do I know if I'm about to walk off it?

Look at how you quote your own AI costs. If the internal number you carry is a per-seat license or a price per million tokens, you're tracking the input and ignoring the part that actually scales. The figure that survives contact with production is cost per outcome: what one completed job costs, end to end, at real volume, once the loops and re-reads are counted. Most native billing dashboards won't hand you that. They show total spend, not spend per workflow or per outcome, so one runaway process hides inside an aggregate that looks fine until the quarter closes. If you can't attribute spend down to the workflow that caused it, you can't govern it, and you can't forecast it either. There's a second multiplier stacked on top of the first. One agent is a single workflow. A few hundred people each running agents across a dozen workflows is that per-task draw multiplied across the whole floor, every working day. The cliff repeats every time another workflow goes live, and the total climbs in steps you never see on a price sheet.

What should you do about the consumption cliff?

Ask your team this before the next agent ships: what does a single completed outcome cost, end to end, at production volume, counting every retry and every re-read? If nobody can answer, you don't have a budget yet, and the agent shouldn't scale until you do. Get the number first on a small, controlled run, then make it the gate every agent clears before it goes wide. Put a ceiling on how many steps a task may take and a kill condition that halts a loop that blows past it, so one stuck process can't run your bill into next month on its own. After that, watch cost per outcome as a live figure rather than a year-end surprise. The teams that get hurt here budgeted off the price sheet and never checked how much the work itself consumes. The token price was never the problem.

Want more insight on your AI costs and how to track them? Reach us at contact@theyor.com

Nik Mercado