The Token Shortage Is Here. The Companies That Win Will Waste the Least.

Jun 4

For two years the implicit advice to every company adopting AI was the same. Use more of it. Drive adoption. Get your people to lean in. Some firms ran internal leaderboards rewarding employees for burning the most tokens. That advice made sense in a world where AI was effectively subsidized and the only real risk was using too little. That world ended this spring. We are now in a structural token shortage, and the strategy that worked in the era of abundance is exactly the strategy that will sink you in the era of scarcity.

The shift showed up everywhere in May. Axios ran a piece titled AI sticker shock hits corporate America. Companies that had set up token-maxing programs started quietly scrapping them, Amazon among the most recent, not because of gaming concerns but because the math stopped working. The new constraint is not adoption. It is cost per unit of value. Here is what the token shortage actually is, why it is not going away soon, and how to compete inside it.

What is the AI token shortage?

It is a structural mismatch between how much AI the world wants to use and how much compute exists to produce it.

The simple version is that demand for AI is growing far faster than the supply of compute that serves it. Inference capacity is expanding quickly, but demand is expanding much faster, and that gap does not close just because everyone wishes it would. The result is that tokens, the basic unit of AI output, are scarce and therefore expensive. This is not a temporary spike caused by one busy quarter. It is the defining condition of the period we are entering, and it changes the fundamental economics of every AI decision a company makes.

For two years those economics were hidden. The major labs subsidized usage heavily, sometimes giving power users ten to twenty times the token value they paid for. That subsidy let companies treat AI as if it were close to free, which encouraged exactly the behavior you would expect. Use as much as possible. Do not think about the cost. The subsidy is now ending, the true cost of heavy usage is becoming visible, and a lot of AI strategies that were quietly built on free inference are about to discover what they actually cost.

Why won't the token shortage just resolve itself?

Because you cannot build compute as fast as people are finding new ways to consume it.

Everything that made AI more useful in the last year also made it consume dramatically more tokens. The move from simple chat to agentic workflows is the big one. An agent that reasons for ten seconds, holds a large context, calls tools, comes back, and verifies its own work consumes orders of magnitude more tokens than a quick question and answer. As agents spread out of coding and into every kind of knowledge work, that consumption multiplies across the entire economy at once. Supply is racing to catch up through enormous infrastructure buildouts, but those take years and physical capacity to deliver. Demand can spike the moment a company turns on a new agent.

This is why the token shortage has a long runway. It is not a supply chain hiccup that clears in a quarter. It is the predictable result of a technology whose appetite grows faster than the infrastructure that feeds it. Even the largest players are rationing. There were reports in May that the US government weighed in on access to the most powerful models in part because it understood the scarcity and wanted first claim on the tokens. When governments start thinking about who gets the compute, you are not in a temporary shortage. You are in a new normal.

How does scarcity change the AI advantage?

It flips the question from how much AI you use to how much value you get per token.

In the abundance era, the companies that looked like they were winning were the ones using the most AI. Adoption was the metric. That was always a flawed proxy, because tokens consumed is an input, not an output, and inputs tell you nothing about value created. The subsidy hid the flaw. Scarcity exposes it brutally. When every token has a real and rising cost, using more AI is not a sign of sophistication. It is a sign of waste unless that usage is producing proportionate value.

The new advantage is efficiency. Not using less AI for its own sake, but extracting more outcome from each token you spend. Two companies can run the same agent on the same model and get wildly different economics depending on how well the system is designed. One wastes tokens on bloated context, redundant calls, and work that never gets checked for value. The other is engineered so that every token is pointed at something that matters. In the abundance era those two companies looked the same, because the waste was free. In the scarcity era the efficient one has a structural cost advantage that compounds on every invoice, and the wasteful one is subsidizing its competitor's margins.

What does token efficiency actually look like in practice?

It is mostly engineering and measurement, not magic.

Match the model to the task. The biggest single source of waste is running expensive frontier models on work that a cheaper model handles just as well. The cheaper alternatives stopped being toys this spring. Cursor's Composer 2.5 matches frontier coding models on several benchmarks at roughly a tenth of the cost per task. Routing the easy majority of work to a cheap capable model and reserving the frontier for the genuinely hard cases can cut spend dramatically with no loss in quality where it counts.

Control the context. Tokens are consumed by what you feed the model as much as by what it produces. Systems that stuff entire documents and long histories into every call are burning money on context the model does not need. Tight, deliberate context is one of the highest-leverage efficiency moves available, and almost nobody does it well by default.

Measure value per token, not tokens consumed. If your dashboards show usage going up and you read that as success, you have the wrong dashboard for this era. The number that matters is the outcome you got divided by the tokens you spent to get it. You cannot optimize what you do not measure, and most companies are still measuring the input.

Stop rewarding consumption. If you set up incentives that reward your people for using more AI, you built an engine for waste and it is now running against you. Amazon and others spent May unwinding exactly these programs. The incentive should reward outcomes, which is harder to measure and far more valuable to get right.

Is the token shortage actually bad news?

For most companies, it is the opposite, and that is the part the panic coverage misses.

A token shortage is a sorting mechanism. It separates the companies that built real value on AI from the ones that were running on subsidized hype, and it does so by making waste expensive. If your AI use was genuinely creating value, scarcity barely touches you, because value comfortably exceeds the rising cost. If your AI use was theater funded by someone else's subsidy, scarcity exposes it. The companies that figure out how to operate efficiently in this period will pull away from the ones that cannot, and they will do it while their competitors are distracted by sticker shock and bubble talk.

There is a real opportunity in the discomfort. Scarcity forces the discipline that abundance let everyone skip. The teams that learn to get frontier-level outcomes on efficient budgets right now are building a capability their competitors will need and not have. When everyone else is opting out for a few months because the bills scared them, the company that leans in and learns to do AI efficiently is buying advantage at a discount.

What should you do about the token shortage this quarter?

Forget sorting yourself into a maturity bucket. The work is the same regardless of where you are. It is just more urgent the more you spend. Here it is, in the order that pays off fastest.

Start by changing the number you watch. If your AI dashboard shows tokens consumed and you read a rising line as progress, you are measuring the wrong thing for this era. Replace it with value per token, the outcome you got over what you spent to get it. Almost nobody tracks this, which is exactly why it is the highest-leverage change available. You cannot cut waste you cannot see, and right now the waste is invisible.

Then go after the two biggest sources of that waste, in this order. First, model-task fit. Most companies run their most expensive model on work a far cheaper one would handle identically. Move the easy majority of your workload down to a cheaper capable model and reserve the frontier for the cases that genuinely need it. Second, context discipline. Tokens are burned by what you feed the model as much as by what it returns, and most systems stuff far more into every call than the work requires. Tightening that is unglamorous, and it is where a surprising share of the bill hides.

Last, kill any incentive that rewards consumption. If you built a program that celebrates people for using more AI, you built an engine for waste, and it is now running against you while you read this. Amazon and others spent May quietly switching theirs off. Reward outcomes instead. It is harder to measure and it is the only thing worth measuring.

Do those four things and you will look, by the end of the year, like AI got cheaper for you. It did not. You just stopped wasting it.

The abundance era rewarded the companies that used the most AI. The scarcity era rewards the companies that waste the least. That is not a smaller game. It is a better one, and it favors the disciplined over the loud.

YOR.AI builds AI systems engineered for value per token, so your AI spend tracks your outcomes instead of your enthusiasm. If your costs are climbing faster than the results you can name, reach us at contact@theyor.com

Nik Mercado