Stop Buying Bigger Context Windows
The default response to rising agent cost is predictable: buy a model with a bigger context window.
It sounds sensible. If the agent forgets things, give it more room. If quality degrades over long runs, perhaps the model just needs to see more.
Usually it does not.
Most token burn is not a model-capability problem. It is an architecture problem. Teams stuff entire files into prompts, replay full transcripts by default, let every sub-agent inherit the same global sludge, and pass raw tool output straight through to the model. Then they act surprised when cost spikes and quality falls off a cliff.
The real bottleneck is often not the hard context limit. It is context rot: the gradual degradation that happens when relevant information is diluted by stale, duplicated, low-signal junk. Long before you hit the advertised million-token ceiling, your agent is already less reliable because you made it read too much nonsense.
That is what most “just buy a bigger window” thinking misses. Bigger windows are useful. They are not a substitute for memory design, tool discipline, and state management.
The wrong diagnosis
There is a comforting story that says agents are expensive because models are still too small. It shifts blame onto vendors and benchmarks. If only the window were larger, the architecture could stay sloppy.
But look at where the tokens actually go in production systems.
They go into repeated prompt boilerplate. They go into verbose reasoning traces no downstream step needs. They go into giant tool descriptions the model never uses. They go into raw JSON payloads, logs, HTML, search results, and full documents that should have been filtered before they ever touched the prompt. They go into conversations where every new step drags the whole past behind it like a caravan of broken furniture.
That is not intelligence. It is bad systems design.
Phil Schmid’s writing on context engineering gets this right: long-running agents are less about maximizing visible context and more about selecting, isolating, and compressing the right context for the current step. The model needs useful state, not the entire archaeological record.
The “Production-Grade Agentic AI Workflows” guidance lands in the same place from a different angle: keep systems simple, use tools deliberately, externalize state, and give components single responsibilities.
Context rot matters before hard limits
Teams talk about context windows like disk capacity: if you have more, you can safely store more. That is the wrong mental model.
Context behaves more like a cluttered workbench. Once it fills with half-useful debris, performance gets worse even if there is still technically space left. The model can attend to all of it in theory. In practice, attention is not free, relevance is uneven, and long prompts introduce noise, contradiction, and drift.
This is why context rot shows up before hard failure. The agent starts missing key instructions, over-weighting stale details, or anchoring on earlier tool outputs that are no longer relevant. It becomes less precise while still looking busy.
Zylos’s work on context compression is useful here because it treats compression as a control system, not a one-off summarization trick. You do not wait until the prompt is a landfill and then ask the model for a heroic summary. You compress iteratively, preserve anchors, and trigger compaction before drift compounds. That is much closer to how competent systems should behave.
A large context window can delay the crash. It does not prevent it.
Raw inputs
(chat history, files, tool output, retrieved docs)
|
v
+-------------------------------+
| Selection / compaction layer |
| - prune noise |
| - compress history |
| - retrieve relevant memory |
| - load structured state |
+-------------------------------+
|
v
Active prompt / working set
|
v
Model step
|
v
New outputs -> tools / memory updates / state changes
What the memory evidence actually says
The most interesting evidence in this area does not support brute-force replay. It supports memory hierarchies.
The paper Evaluating Long-Term Memory for Long-Context Question Answering is especially awkward for the “just stuff everything in” camp. The headline result is the one people should remember: systems using external memory and retrieval can cut token usage by more than 90 percent while staying competitive on long-context QA. That is not a rounding error. That is a different operating model.
MemGPT provides the conceptual frame. It treats model context like virtual memory: a limited fast tier backed by larger external stores. The trick is not pretending everything belongs in the active prompt. The trick is managing what gets promoted into working memory, what gets compacted, and what stays in cheaper storage until needed.
This is how competent systems work everywhere else. CPUs have caches. Databases have indexes. Operating systems page memory. Nobody keeps all state in the fastest, most expensive tier at all times. Yet agent systems routinely do exactly that with tokens.
Context should be the hot path, not the warehouse.
The architecture patterns that actually reduce token burn
1. Compaction beats repeated lossy summarization
Summarization is often treated as a cleanup button. The problem is that repeated naive summarization compounds information loss. Each pass strips nuance, removes anchors, and increases the odds that future steps inherit a neat but wrong version of the past.
Compaction is a better pattern. Keep durable facts, decisions, open threads, constraints, and references in a structured compact form. Drop chatter, duplicates, and completed branches. Preserve anchors that let the system recover source material when needed.
Good compaction is not “make this shorter.” It is “keep what remains operationally useful.”
2. Retrieval beats replay
If a fact might matter later, store it and retrieve it when relevant. Do not replay the entire prior conversation or dump full documents into every prompt just because something in there could matter.
This is where the long-term memory literature and MemGPT line up neatly. External memory is not just a cost optimization. It improves relevance selection. The current step gets the fragments that matter, not the entire haystack.
Retrieval only works if the stored state is sane. Raw transcript chunks are better than nothing, but structured memory is better still: decisions, entities, deadlines, assumptions, source links, unresolved questions. A retrieval system can work with that.
3. Context isolation is not optional
One of the worst habits in agent design is shared global context across every actor. The planner sees everything. The worker sees everything. The reviewer sees everything. Soon every prompt is bloated with state only one step needed.
Context isolation fixes that. Give each agent or step only the local state required to do its job. If a planning agent is deciding next actions, it does not need the raw HTML from a scraper. If a coding worker is editing one file, it does not need the full chat history that led to the ticket.
Schmid makes this point well: agent-as-tool and bounded contexts are usually better than one monolithic do-everything loop. Smaller, purpose-built contexts outperform giant shared ones because signal stays dense.
4. Split planners from workers
Planner-worker separation is one of the cleanest token-saving moves available.
The planner should think in goals, decomposition, acceptance criteria, and routing. Workers should execute narrow tasks with narrow context. If you let the planner drag all worker transcripts back into its own prompt every turn, you have built a bureaucracy simulator. If you let workers inherit the planner’s entire deliberation history, you have built a memory leak.
A planner should consume compact status updates, not raw transcripts. A worker should receive a scoped brief, not institutional history.
This is also where ReAct, Tree of Thoughts, and CoALA are useful as design influences, not doctrine. The practical takeaway is that reasoning, action, and memory can be separated into components with different context needs.
5. Use structured state for durable facts
If a fact matters across steps, it should not live only in prose.
Store decisions as decisions. Store constraints as constraints. Store tasks, entities, owners, timestamps, and unresolved issues in structured state. That makes compaction easier, retrieval better, and prompt assembly cheaper. It also reduces the model’s need to repeatedly infer the same facts from old text.
Unstructured transcripts are terrible databases. Stop using them as one.
6. Prune tool output at the source
This is the most common and least excusable failure mode.
Teams spend weeks tuning prompts while shoveling raw tool output into the model: full search result pages, full API responses, full logs, entire files, complete DOM trees. Then they wonder why every step costs a fortune.
The fix is upstream. Tools should return only what the model needs for the next decision. If the task is to confirm whether a deployment failed, return the failing job, error summary, and relevant log excerpt. Not 30,000 lines of everything. If the task is to read a document, chunk and filter it before prompt time. If the task is to inspect code, provide the target function and nearby context, not the whole repository.
Token discipline starts upstream.
Anti-pattern
Planner
-> inherits full history
-> reads raw worker transcripts
-> forwards giant tool payloads
-> asks worker with bloated brief
-> more noise every turn
Better pattern
Planner
-> reads compact goals + current state
-> sends scoped brief to worker
Worker
-> uses local context + trimmed tool output
-> returns concise status / artifacts
Planner
-> updates compact state, not full transcript replay
Anti-patterns that quietly wreck quality and cost
These show up everywhere:
- Dumping full files into prompts because selection feels hard
- Replaying full history by default instead of retrieving relevant state
- Keeping giant tool registries in every system prompt
- Writing verbose scratchpads that no later step will ever use
- Letting all agents share one global context blob
- Passing raw tool output directly to the model
- Summarizing over and over without preserving source anchors
- Waiting until the prompt is nearly full before compressing anything
Every one of these creates the same outcome: more tokens, less signal, worse decisions.
A practical checklist for next week
If you are running agents in production, do this:
- Measure token use by category: prompt boilerplate, tool schemas, history, retrieved context, tool output, model response
- Set compression triggers early, before prompts become crowded and messy
- Replace transcript replay with retrieval for durable information
- Introduce a compact state object for facts, decisions, constraints, and open tasks
- Separate planner prompts from worker prompts
- Bound each agent to a narrow context and a small toolset
- Trim tool responses at the source; do not rely on the model to clean up your mess
- Prefer compaction with anchors over repeated free-form summarization
- Keep source references so lost detail can be reloaded on demand
- Treat token efficiency as a product requirement, not a billing concern
That last point matters. Token efficiency is not just about saving money. It is about maintaining accuracy under load and reducing context rot. Wasteful prompts are usually brittle prompts.
Efficient agents are designed, not purchased
There is nothing wrong with using bigger context windows when the task genuinely needs them. Some workloads do. But too many teams are buying headroom to avoid fixing architecture.
The better pattern is clear by now.
Use context as working memory, not storage. Compact before rot sets in. Retrieve instead of replaying. Isolate contexts across roles. Split planners from workers. Keep durable facts in structured state. Prune tool output before it reaches the model.
Do that, and larger windows become a helpful capability instead of a crutch.
Ignore it, and you will keep paying premium prices to feed models stale logs, duplicated transcripts, and irrelevant junk. That is not sophistication. It is just expensive laziness.
Sources
- Evaluating Long-Term Memory for Long-Context Question Answering
- MemGPT
- A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows
- Context Engineering for AI Agents: Part 2
- AI Agent Context Compression: Strategies for Long-Running Sessions
- ReAct
- Tree of Thoughts
- CoALA