Why agent memory should start with work trajectories, not conversations
Most agent memory systems still inherit a chatbot worldview, where you take a transcript, chunk it, embed it, maybe summarize it, and call that memory. That mostly works for preferences, biographies, and long support threads. It works far less well for actual work.
Work is not a transcript, and an agent completing a task does not just exchange messages. It inspects files, runs commands, hits APIs, retries after failures, asks for approval, gets blocked by permissions, switches plans, checkpoints state, and resumes later. The useful thing to remember is not the chat history that happened to surround that work, but the trajectory of the work itself.
That is the substrate agent memory should be built on: structured work trajectories, segmented into episodes, then distilled into durable memory.
Not bigger context windows, not “just store all the messages”, and not another vector index full of chat chunks with no causal structure.
Chat logs are the wrong substrate
A chat transcript is a lossy view of the task because sometimes it contains too much. Pleasantries, repeated instructions, partial thoughts, malformed attempts. Sometimes it contains too little. The real work happened in tools, files, shell commands, HTTP requests, or side effects outside the visible dialogue.
If an agent spends 20 minutes debugging a deployment, the transcript may only show:
- “checking logs”
- “found the issue”
- “fixed it”
- “redeployed”
That is useless as memory. What mattered was:
- which service failed
- which command exposed the issue
- what the error string was
- what change resolved it
- whether approval was needed
- what remained risky afterward
Raw chat history is optimized for turn-taking, while memory is optimized for reuse.
Those are different jobs, which is also why simply throwing more tokens at the prompt is not a real answer. Research on long-horizon memory keeps landing in roughly the same place: long context helps, but it does not solve retrieval, abstraction, salience, or consistency by itself. MemGPT framed this clearly by treating context as a constrained working-memory tier rather than pretending the whole problem goes away with a larger window. MemoryBank and later Mem0 make a similar point from the opposite direction: explicit memory extraction and management beat naïve replay on long conversational tasks. LoCoMo exists because very long-term conversational coherence is still hard, even when the full history is technically available.
The bottleneck is not just capacity but representation.
Agent work is a trajectory, not just a conversation
Distributed systems figured this out years ago, because if you want to understand what happened during a request, you do not stare at an undifferentiated pile of logs. You capture a trace. A trace gives you a causal path through the system. It contains spans for units of work, events for notable moments, timing, hierarchy, metadata, and links across boundaries.
OpenTelemetry’s trace model is a very good starting point for agents. Not because agents are web servers, but because both have the same core problem: reconstructing what happened across a sequence of dependent actions.
A practical mapping looks like this:
- a user task becomes a trace
- each agent step becomes a span
- prompt assembly, tool calls, approvals, retries, warnings, and exceptions become events
- bulky things like full prompts, command stdout, screenshots, fetched pages, diffs, or retrieved docs become artifacts
- state transitions and outputs become attributes
- cross-task dependencies become links
That model is already close to what observability tooling for LLM apps has converged on. OpenTelemetry defines the general trace structure. GenAI semantic conventions and adjacent efforts such as OpenInference add AI-specific fields for model calls, messages, tool use, retrieval, and token accounting. Phoenix, Langfuse, LangSmith, OpenAI’s tracing stack, and OpenLLMetry all circle the same idea: if you want to debug or understand an agent, capture a structured execution trace, not just the final answer.
That should not stop at debugging, because the same structure can become the source material for memory.
Traces are the raw material, not the final memory
A raw trajectory is still too detailed to stuff back into context, and if you store every span forever and retrieve them wholesale, you have built observability, not memory. Useful memory requires compression, segmentation, and selection.
This is where many systems get sloppy by treating memory as a uniform blob store. In practice, different kinds of memory should come out of a trajectory.
At minimum, you want three distinct products.
1. Episode summaries
These capture what happened during a bounded chunk of work. An episode summary is not “the chat from 2:00 to 2:15 PM.” It is something like:
Investigated failing deploy for
api-worker. Root cause was missing env var after secret rotation. Verified via startup logs and local repro. Fixed secret mapping in Helm values, redeployed successfully, left note that staging still has drift.
This is compact, chronological, and tied to a task arc.
2. Durable facts
These are stable facts extracted from one or more episodes. Examples:
api-workerdepends onPAYMENTS_WEBHOOK_SECRET- staging secrets drift from production after manual rotations
- the user prefers deploy confirmations with exact service names and commit SHAs
These should be versioned, attributable, and revocable. Facts are not just shorter summaries. They have a different shape and should survive many episode boundaries.
3. Lessons and procedures
This is the part most memory systems underbuild, because a lot of agent value comes from procedural learning:
- when you see error X, inspect file Y first
- this repo requires approval before touching production configs
- for this CLI, instrument the internal loop instead of scraping terminal output
- checkpoint before running migrations so resume can pick up cleanly after approval or failure
These are reusable policies, heuristics, and playbooks. They matter more than remembering that a particular conversation happened on Tuesday.
Segment by episodes, not token windows
Rendering diagram…
The natural boundary for memory is not “every 2,000 tokens” but the episode: a coherent slice of work with a purpose, local state, and some outcome.
Sometimes one trace maps to one episode. Sometimes a long trace has several. A bug hunt might split into initial triage, failed fix, second investigation, and final repair. A task resumed after six hours and a permission approval is probably not one uninterrupted episode just because the trace ID stayed constant.
Episode segmentation should use structural signals, not only textual similarity:
- task and subtask boundaries
- plan revisions
- long idle gaps
- approval gates
- tool-mode changes
- error bursts followed by strategy change
- checkpoint and resume markers
- explicit outcome transitions like blocked, completed, abandoned
This is one place where treating agent activity as a trace pays off. You already have timestamps, parent-child structure, event types, and artifacts. You are not guessing boundaries from chat prose alone.
Salience scoring is not optional
Not everything deserves to become memory. This sounds obvious, but a lot of memory pipelines still behave like hoarders. If the system sees it, it stores it. Later it tries to retrieve from a junk drawer.
A decent salience model should ask at least:
- is this likely to matter again?
- is it stable or ephemeral?
- did it affect outcome, cost, latency, or correctness?
- was it surprising?
- did it require human approval or intervention?
- does it update an existing fact or contradict one?
- is it specific enough to retrieve usefully later?
The Generative Agents work used a combination of recency, relevance, and importance for retrieval and reflection. That basic shape still holds up. For work agents, “importance” should lean toward operational consequence: failures, fixes, irreversible actions, repeated patterns, and human overrides should score high. Verbose but low-stakes chatter should not.
You also want decay. MemoryBank’s use of reinforcement and forgetting is directionally right. Some memories should weaken unless they keep proving useful. Others, especially procedures or durable environment facts, should harden over time.
Retrieval should be hybrid: semantic, temporal, relational
Rendering diagram…
Pure semantic search is not enough if an agent is resuming a deploy issue from yesterday, because time matters. If it is trying to reuse a prior fix in the same repo, relationships matter. If it is debugging an error message, lexical overlap matters. If it is handling “that thing we did after the OAuth redirect broke,” event chains matter.
So retrieval should combine at least three signals:
- semantic: embedding similarity against summaries, facts, procedures, artifacts
- temporal: recency, session proximity, task chronology, before/after relations
- relational: same repo, same service, same user, same tool, same trace lineage, same dependency graph
Mem0’s graph-flavoured direction is interesting here because some memories are naturally relational, not just textual. “Service A depends on secret B” is a graph edge. “This procedure supersedes the older one” is a graph edge. “This episode resolved the incident introduced by that change” is a graph edge.
A serious memory system should stop pretending a single vector database is enough.
For terminal-first agents, instrument the loop, not just the terminal
This matters more than people think. If you are building terminal-first agents or CLIs like Pi, the temptation is to capture terminal output and call it observability. That gives you a recording rather than an understanding.
The interesting structure usually lives one layer deeper, inside the agent loop:
- goal received
- plan updated
- tool selected
- command prepared
- approval requested
- command executed
- output classified
- retry policy triggered
- checkpoint saved
- memory write proposed
- resume invoked
If you only instrument stdin and stdout around the shell, you miss intent, decision points, and state transitions. You also make privacy and redaction much harder because the terminal stream is a giant mixed blob.
Instrument the runtime loop itself, then attach terminal I/O as artifacts or span events where needed. You want the shell transcript available, but not as the primary semantic object.
TTY capture is good for demos. Structured events are what make debugging, memory, and recovery actually work.
Checkpoints and resume are part of memory
A lot of useful agent work is interrupted. The user goes away. Approval is pending. The network dies. A long build is still running. The model chooses to stop and ask. If the agent cannot checkpoint and resume cleanly, memory becomes partly fictional because the system reconstructs continuity from whatever text it can find.
Checkpointing should capture enough state to continue a trajectory without replaying the world:
- current plan
- open subgoals
- pending approvals
- relevant handles or resource IDs
- partial outputs
- working files and diffs
- tool state
- cursor positions in long-running tasks
This is one of the places LangGraph’s persistence model is worth paying attention to. It treats checkpointing as a first-class part of agent execution rather than a convenience feature bolted on later. For coding agents and operator-style assistants, that is the right instinct.
Observability tells you what happened. Checkpoints let you re-enter the work.
Where this data should actually live
Once you buy the idea that trajectories are the raw material for memory, the next question is obvious: where do you put all of this stuff? The short answer is that it should not live in one place. Trajectories, artifacts, checkpoints, and distilled memory have very different shapes. Trying to force them into a single store usually produces something expensive, awkward, and hard to govern. A good system splits them by function.
A practical layout looks like this:
- trace store for execution structure
- run IDs, span IDs, timing, parent-child relationships, status, token and cost counters
- object store for bulky immutable artifacts
- prompts, model outputs, screenshots, files, diffs, tool payloads, serialized checkpoints
- relational database for indexed metadata
- run records, artifact metadata, retention tags, hashes, URIs, summaries, lineage edges
- checkpoint store for resume and time-travel
- usually checkpoint metadata in a database, payloads in object storage
- optional event log for fan-out and replay
- useful once ingestion, analytics, and downstream consumers need decoupling
That is not overengineering; it is just respecting the data model.
Execution traces want causal structure, artifacts want cheap durable blob storage, metadata wants indexed queries and transactions, and checkpoints want fast resume without pretending every intermediate byte belongs in your main database.
If you are already using OpenTelemetry-style traces, this split becomes pretty natural. Traces give you the execution graph. Postgres gives you practical indexing and retrieval. An S3-compatible object store gives you a cheap home for the big immutable payloads. If the system is small, you can stop there. If it grows, you can add an append-only event log for replay, fan-out, and recomputation.
The retention story also gets cleaner once these layers are separated.
- hot tier: recent runs, full traces, full artifacts, and dense checkpoints
- warm tier: compacted summaries, metadata, sparse checkpoints, colder object storage
- cold tier: final outputs, provenance anchors, hashes, and analytics-friendly archives
Rendering diagram…
That lets you delete or redact sensitive payloads without destroying your indexes, your summaries, or your provenance graph.
Git is a ledger, not a warehouse
There is a tempting version of this idea where agent memory simply lives “alongside the code” in git. That instinct is not completely wrong, but it does need discipline.
Git is very good at a few things:
- storing small, reviewable, textual state
- linking information to commits, branches, and pull requests
- preserving durable decisions in a tamper-evident history
- carrying lightweight provenance close to the code it describes
Git is bad at a different set of things:
- high-frequency event streams
- raw token-by-token trajectories
- large binary artifacts
- mutable checkpoints
- retention-heavy telemetry
- query patterns that look more like analytics than version control
So the right pattern is not “put the whole run in git”. It is:
- keep code in git
- keep thin provenance in git
- keep heavy artifacts and fast-changing run data outside git
- link them together with commit SHAs, content hashes, run IDs, and artifact digests
That gives you the part people actually want when they say “store it in git”: code-adjacent memory with strong provenance. Useful things to keep in git include:
- agent task summaries
- evaluation deltas that affect merge decisions
- small provenance manifests
- pointers to artifact bundles
- occasional reviewed notes on what changed and why
Useful things to keep out of git include:
- raw trajectories
- screenshots and video traces
- embeddings and model checkpoints
- noisy telemetry
- ephemeral working state
There are a few patterns here that are worth mentioning explicitly.
- thin manifest in git, blob out of git
- probably the default answer for most teams
- git notes for commit-linked provenance
- good when you want metadata attached to a commit without polluting the tree
- sidecar branches or refs for coarse run summaries
- reasonable if you really want git transport and ACLs, but still not a great fit for high-volume logs
- content-addressed artifact refs
- a clean bridge between git history and external artifact storage
Rendering diagram…
The useful mental model is simple: git is the ledger of durable decisions, not the RAM dump an agent leaves behind.
Privacy matters fast
Trajectory capture gets sensitive very quickly, because prompt text can contain internal documents. Shell output can leak secrets. Retrieved artifacts can include credentials, personal data, or proprietary code. Full transcripts are often more dangerous than people realize because they blend everything together.
So the memory pipeline needs structure and restraint:
- keep raw traces separate from distilled memory
- redact at multiple layers
- store bulky payloads as referenced artifacts, not inline everywhere
- use hashes and pointers where full payloads are not needed
- apply TTLs to raw artifacts if long-term retention is unnecessary
- avoid storing chain-of-thought by default
The goal is not total retention but useful retention.
A practical architecture
Rendering diagram…
If I were building this today, I would separate it into four layers.
1. Capture layer
This is where the agent emits structured execution data.
- instrument the agent loop
- emit a canonical event stream
- map that stream to traces, spans, events, and artifacts
- export through OpenTelemetry or a compatible pipeline
- store checkpoints alongside traces, not in some unrelated side channel
2. Processing layer
This is where raw trajectories turn into memory candidates.
- segment traces into episodes
- extract facts, lessons, and procedures
- score salience
- detect updates and contradictions
- attach provenance links back to the raw trajectory
3. Memory layer
This is where the outputs live in distinct stores.
- raw trace store for audit and replay
- episode store for task-shaped summaries
- fact memory store for durable assertions
- procedure store for reusable workflows
- vector index for semantic retrieval
- timeline or graph index for temporal and relational retrieval
4. Inference layer
This is where the system decides what to pull back in.
- classify the recall need
- retrieve semantically similar memories
- retrieve temporally nearby episodes
- retrieve linked entities, services, repos, or tools
- re-rank by salience and reliability
- assemble a compact, provenance-aware context
That is a much better shape than “stuff the whole chat into embeddings and hope.”
What this buys you
This approach is more work than summarising transcripts, but it buys you something concrete:
- better personalization
- fewer repeated mistakes
- stronger long-horizon continuity
- clearer debugging
- better post-hoc analysis
- more reliable human handoff
- actual procedural learning instead of vague “memory”
The same infrastructure that helps you debug an agent run can also help the agent remember what mattered.
The useful part is that observability and memory are not separate problems so much as adjacent views of the same underlying data.
The common failure modes
There are a few predictable ways to get this wrong:
- storing everything and retrieving noise
- over-summarising and losing edge cases
- treating token windows as episode boundaries
- embedding raw tool chatter and chain-of-thought indiscriminately
- having no delete or update path for stale memories
- ignoring contradiction handling
- relying on vector search alone
- losing provenance during summarisation
- never evaluating whether memory improved actual task performance
A lot of “agent memory” systems look impressive right up until you ask them to explain what changed, when it changed, and why the current belief should be trusted, which is also why traces matter, because they preserve the path back to evidence.
The real point
The future of agent memory is not “remember the chat” but “understand the work,” and that means capturing trajectories instead of transcripts, segmenting them into episodes instead of token windows, and turning those episodes into facts, procedures, and lessons that can actually guide future behaviour.
Bigger context windows will help, better retrieval will help, and better summarisation will help, but none of those fix a bad substrate.
If the thing you store is a transcript, the memory system built on top of it will stay shallow, whereas if the thing you store is a structured record of work, memory starts to look less like a chatbot trick and more like a real system.
References
- OpenTelemetry traces: https://opentelemetry.io/docs/concepts/signals/traces/
- OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
- OpenInference / Phoenix tracing: https://arize.com/docs/phoenix/tracing/llm-traces
- LangSmith observability: https://docs.langchain.com/langsmith/observability
- Langfuse observability: https://langfuse.com/docs/observability/overview
- OpenAI Agents tracing: https://openai.github.io/openai-agents-python/tracing/
- OpenLLMetry: https://www.traceloop.com/docs/openllmetry/introduction
- Generative Agents: https://arxiv.org/abs/2304.03442
- MemoryBank: https://arxiv.org/abs/2305.10250
- MemGPT: https://arxiv.org/abs/2310.08560
- LoCoMo: https://arxiv.org/abs/2402.17753
- Mem0 research: https://mem0.ai/research