Memory for Agents Is a Systems Problem, Not a Context Window Problem | Matthew Gribben
Memory for Agents Is a Systems Problem, Not a Context Window Problem
Memory failures in agents are rarely solved by larger context windows. The real problem is systems design: how information is stored, selected, retrieved, compacted, and promoted into durable forms.
March 26, 202611 min read
An agent that forgets why it opened a ticket, repeats a dead-end plan from yesterday, and drags six thousand tokens of irrelevant chat into every prompt does not have a memory shortage. It has a memory design problem.
A lot of current agent work still treats memory as one of two things: keep more transcript, or retrieve more text. Both help. Neither is a real memory architecture. Long context can delay forgetting, and retrieval can patch over it, but neither tells a system what should be remembered, how it should be represented, when it should be updated, what should decay, or which memories are safe to trust.
That is the shift in the recent memory literature for agents. The important question is not how many tokens a model can ingest. It is how an agent should maintain different kinds of memory over time: what belongs in the active state for the current task, what should persist as experience, what should become durable world knowledge, and what should be compiled into reusable know-how. Memory is not one bucket. It is a stack with different roles, write paths, retrieval rules, and failure modes.
The papers are converging on that point from different angles. Memory in the Age of AI Agents separates memory by form, function, and temporal dynamics. AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents gives a lifecycle view: encode, consolidate, retrieve, update, forget, secure. Evaluating Long-Term Memory for Long-Context Question Answering adds an empirical reminder that bigger context windows are a wasteful substitute for good memory design. MemGPT frames the problem as tiered memory management. Reflexion shows how compact post hoc self-critique can serve as a useful episodic trace. Memp makes the procedural point explicit: durable performance gains often come from storing and reusing successful methods, not just facts or transcripts. CoALA helps tie these into a modular agent architecture rather than a single prompt trick.
If you want a long-running agent that gets better instead of noisier, you need at least four distinct memory types. The useful way to think about them is as a layered system rather than one giant context bucket.
Rendering diagram…
Working memory is the short-horizon active state. Episodic memory stores notable experiences. Semantic memory stores durable facts. Procedural memory stores reusable ways of doing things. Those layers should interact, but they should not collapse into one undifferentiated store.
Working memory: small, typed, and aggressively curated
Working memory is not “whatever fits in the prompt.” It is the minimal active state needed to do the current step well. Plans, subgoals, the current user objective, a few constraints, and a short scratchpad belong here. Everything else is a liability.
Many agent systems go wrong here first. They treat the prompt as an evidence dump and hope the model will sort it out. That is cheap to build and expensive to run. More importantly, it degrades behavior. Irrelevant context raises the chance of distraction, contradiction, and instruction bleed. The agent looks forgetful because it cannot keep the right things salient.
Working memory should therefore be typed and bounded. A production agent should know the difference between:
current objective
current plan
blocking issues
decisions already made in this session
tools and resources currently in play
transient observations worth using for the next step only
That state should be rewritten continuously, not appended forever. In human terms, this is not autobiography. It is the whiteboard. Wipe it often.
ReAct and Tree of Thoughts are useful here mostly as contrast. They help structure reasoning and action inside a task, but they do not by themselves solve long-term memory. They are control patterns for deliberation, not memory systems. If you keep confusing the two, you get elegant traces and bad continuity.
Episodic memory: experiences, outcomes, and reflections
Episodic memory is where the agent stores what happened: a task it attempted, the context it saw, what actions it took, what outcome followed, and what lesson was extracted. Not every turn deserves an episode. Most should vanish. The point is to preserve events that are likely to matter again.
This is where Reflexion remains useful. Its core contribution is not mystical self-improvement. It is the mundane idea that a compact reflection after success or failure can become a much better future cue than a raw transcript. “Last time the API returned 401 because the token belonged to staging” is more valuable than three pages of trial and error.
Good episodic memory is selective and outcome-linked. Useful fields include:
task or goal
relevant entities and environment
action sequence summary
result
confidence or quality signal
extracted lesson
timestamp and scope
That last part matters. Episodes age. A workaround for a broken dependency last week may be wrong after a deploy. Episodic memory should therefore support decay, archival, and explicit invalidation. If you only ever append, you are not building memory. You are building sediment.
A useful way to picture the write path is that raw trajectories do not become long-term memory directly. They get filtered, compressed, and split by purpose.
Rendering diagram…
That is why transcript retention is not the same as memory. The valuable artifact is usually the compact event summary, the lesson, or the promoted fact or procedure, not the verbatim interaction trace.
Semantic memory: durable facts with provenance
Semantic memory is the agent’s store of durable facts: user preferences, environment details, system configurations, stable business rules, product facts, and other world knowledge that should outlive a single session.
This is where generic RAG tends to overpromise. Retrieval over unstructured notes can surface relevant facts, but semantic memory needs stronger guarantees than “the chunk looked similar.” A usable fact memory should carry provenance, timestamps, scope, confidence, and conflict handling. “Matthew prefers short answers in WhatsApp” is not the same kind of item as “the staging database host is db-stg.internal,” and neither should be stored with the same write policy as a one-off observation from a failed run.
Semantic memory needs explicit promotion rules. Facts should usually graduate into it only after repeated observation, trusted source confirmation, or human instruction. This is one place where the neuroscience framing is genuinely helpful: encoding and consolidation should be separate. Do not let every passing claim become part of the agent’s long-term model of the world.
The long-context QA work reinforces the point from the other side. If relevant information can be stored in a structured long-term memory and retrieved precisely, you do not need to keep reloading giant histories just to preserve continuity. Token budgets fall, latency improves, and recall can actually get better because retrieval is sharper.
Procedural memory: the missing piece
Procedural memory is reusable know-how: how to perform a task, not just what happened before or what facts are true. This is where a lot of agent systems are still weak.
An agent that has solved a task ten times should not merely remember ten episodes. It should distill a procedure. That is the contribution of Memp: procedural memory deserves first-class treatment. If an agent learns that the reliable way to debug a flaky CI failure is to check the failed job, isolate the earliest deterministic error, verify whether the workspace changed, and only then rerun, that pattern should become reusable policy.
This matters because facts alone do not compound into competence. Experience only turns into better future behavior when the system can abstract successful strategies, attach them to contexts, and decide when to apply them. CoALA is useful here because it pushes toward a modular architecture in which procedural memory sits alongside semantic and episodic memory rather than being smuggled into a giant prompt.
Procedural memory also has the sharpest failure mode. Bad procedures can fossilize. If the agent learns an overfit workaround, or internalizes a pattern that “usually passes,” it may lock in brittle or unsafe behavior. Procedural updates therefore need a higher bar than episodic writes: repeated success, environment checks, provenance, and often human review.
Where graph knowledge bases help, and where they do not
Graph knowledge bases are useful and often misapplied.
They help when the world has stable entities and relations that need consistent querying over time. Users, projects, services, repositories, ownership, dependencies, permissions, and canonical document links fit well. Graphs are strong when you care about identity resolution, provenance, constraints, and multi-hop retrieval.
They also help when facts need to be merged rather than duplicated. A graph can represent one user preference with update history instead of fifty similar notes.
But graph KBs are overkill for noisy or narrative memory. Raw session traces, speculative reflections, one-off troubleshooting notes, and ephemeral working state do not naturally want to become triples. Forcing them into a graph often creates brittle schemas and false precision.
Use graphs for stable entities and durable relations; documents or event logs for episodes; compact typed state for working memory; and separate policy stores or skill artifacts for procedures. Do not make the graph pretend to be all four.
What breaks in production
The common failures are not subtle.
First, stale memory. A fact was once true and is now wrong, but still outranks newer evidence because it looks authoritative. Second, prompt pollution: too much retrieved memory, poorly filtered, drowns the current task. Third, retrieval mismatch: the right memory exists but is indexed or scoped badly, so the system fetches something adjacent instead. Fourth, privacy and security leakage: sensitive memories are stored too broadly, retrieved into the wrong context, or retained longer than they should be. Fifth, procedural drift: the agent starts reusing a strategy that worked locally but should never have become a default.
These failures get worse when the system has no clear write criteria. If every turn can update long-term memory, the agent becomes gullible. If nothing is ever removed, the agent becomes superstitious. If memory items have no provenance, operators cannot tell whether a fact came from a trusted document, user instruction, model inference, or another hallucinated memory.
Production guidance
A workable memory system needs explicit policy, not just storage.
Start with separate stores, or at least separate schemas, for working, episodic, semantic, and procedural memory. Give each one a different write path and review threshold. Make writes sparse by default.
Require metadata on durable memories: source, timestamp, scope, confidence, sensitivity, and last verification time. Support conflict states instead of forcing a premature merge. Build retrieval as a ranking problem with type-aware filters, not a blind similarity search.
The retrieval path should also be explicit. The agent should not dump all candidate memory into the prompt. It should select, compact, and stage only what the current task can actually use.
Rendering diagram…
Add lifecycle controls. Memories should be refreshable, suppressible, and deletable. Security rules should be part of retrieval, not an afterthought. Sensitive memory should be scoped to the right user, tenant, or environment before the model ever sees it.
Finally, evaluate memory as a system. Measure whether the agent retrieves the right memory type for the task, whether memory helps more than it distracts, whether procedures improve success rates over time, and how quickly stale information is corrected. The point is not to maximize remembrance. It is to maximize useful continuity.
The full lifecycle is broader than retrieval. Memory systems also need gates around what gets written, where it lands, and when it should disappear.
Rendering diagram…
That is the underlying lesson across the recent work. Agent memory is not a bigger context window, and it is not a vector database with nicer branding. It is the machinery by which a system decides what to keep, abstract, trust, and forget. Teams that get this right will build agents that become more reliable with time. Teams that do not will keep shipping agents with excellent recall for the wrong things.
Chief Technology Officer writing about AI systems, software architecture, cyber security, cryptography, and the practical realities of technology leadership.