Stop Buying Bigger Context Windows

The default response to rising agent cost is predictable: buy a model with a bigger context window.

It sounds sensible. If the agent forgets things, give it more room. If quality degrades over long runs, perhaps the model just needs to see more.

Usually it does not.

Most token burn is not a model-capability problem. It is an architecture problem. Teams stuff entire files into prompts, replay full transcripts by default, let every sub-agent inherit the same global sludge, and pass raw tool output straight through to the model. Then they act surprised when cost spikes and quality falls off a cliff.

The real bottleneck is often not the hard context limit. It is context rot: the gradual degradation that happens when relevant information is diluted by stale, duplicated, low-signal junk. Long before you hit the advertised million-token ceiling, your agent is already less reliable because you made it read too much nonsense.

Stop Buying Bigger Context Windows

The default response to rising agent cost is predictable: buy a model with a bigger context window.

It sounds sensible. If the agent forgets things, give it more room. If quality degrades over long runs, perhaps the model just needs to see more.

Usually it does not.

Stop Buying Bigger Context Windows

Stop Buying Bigger Context Windows

Stop Buying Bigger Context Windows

Stop Buying Bigger Context Windows

The wrong diagnosis

Context rot matters before hard limits

What the memory evidence actually says

The architecture patterns that actually reduce token burn

1. Compaction beats repeated lossy summarization

2. Retrieval beats replay

3. Context isolation is not optional

4. Split planners from workers

5. Use structured state for durable facts

6. Prune tool output at the source

Anti-patterns that quietly wreck quality and cost

A practical checklist for next week

Efficient agents are designed, not purchased

Sources

Matthew Gribben