From Trajectories to Memories: How Agents Should Remember and Learn From Their Work

Why agent memory should start with work trajectories, not conversations

Most agent memory systems still inherit a chatbot worldview, where you take a transcript, chunk it, embed it, maybe summarize it, and call that memory. That mostly works for preferences, biographies, and long support threads. It works far less well for actual work.

Work is not a transcript, and an agent completing a task does not just exchange messages. It inspects files, runs commands, hits APIs, retries after failures, asks for approval, gets blocked by permissions, switches plans, checkpoints state, and resumes later. The useful thing to remember is not the chat history that happened to surround that work, but the trajectory of the work itself.

That is the substrate agent memory should be built on: structured work trajectories, segmented into episodes, then distilled into durable memory.

Not bigger context windows, not “just store all the messages”, and not another vector index full of chat chunks with no causal structure.

Chat logs are the wrong substrate

A chat transcript is a lossy view of the task because sometimes it contains too much. Pleasantries, repeated instructions, partial thoughts, malformed attempts. Sometimes it contains too little. The real work happened in tools, files, shell commands, HTTP requests, or side effects outside the visible dialogue.

If an agent spends 20 minutes debugging a deployment, the transcript may only show:

“checking logs”
“found the issue”
“fixed it”
“redeployed”

That is useless as memory. What mattered was:

which service failed
which command exposed the issue
what the error string was
what change resolved it
whether approval was needed
what remained risky afterward

Raw chat history is optimized for turn-taking, while memory is optimized for reuse.

Those are different jobs, which is also why simply throwing more tokens at the prompt is not a real answer. Research on long-horizon memory keeps landing in roughly the same place: long context helps, but it does not solve retrieval, abstraction, salience, or consistency by itself. MemGPT framed this clearly by treating context as a constrained working-memory tier rather than pretending the whole problem goes away with a larger window. MemoryBank and later Mem0 make a similar point from the opposite direction: explicit memory extraction and management beat naïve replay on long conversational tasks. LoCoMo exists because very long-term conversational coherence is still hard, even when the full history is technically available.

The bottleneck is not just capacity but representation.

Agent work is a trajectory, not just a conversation

Distributed systems figured this out years ago, because if you want to understand what happened during a request, you do not stare at an undifferentiated pile of logs. You capture a trace. A trace gives you a causal path through the system. It contains spans for units of work, events for notable moments, timing, hierarchy, metadata, and links across boundaries.

OpenTelemetry’s trace model is a very good starting point for agents. Not because agents are web servers, but because both have the same core problem: reconstructing what happened across a sequence of dependent actions.

A practical mapping looks like this:

a user task becomes a trace
each agent step becomes a span
prompt assembly, tool calls, approvals, retries, warnings, and exceptions become events
bulky things like full prompts, command stdout, screenshots, fetched pages, diffs, or retrieved docs become artifacts
state transitions and outputs become attributes
cross-task dependencies become links

That model is already close to what observability tooling for LLM apps has converged on. OpenTelemetry defines the general trace structure. GenAI semantic conventions and adjacent efforts such as OpenInference add AI-specific fields for model calls, messages, tool use, retrieval, and token accounting. Phoenix, Langfuse, LangSmith, OpenAI’s tracing stack, and OpenLLMetry all circle the same idea: if you want to debug or understand an agent, capture a structured execution trace, not just the final answer.

That should not stop at debugging, because the same structure can become the source material for memory.

Traces are the raw material, not the final memory

A raw trajectory is still too detailed to stuff back into context, and if you store every span forever and retrieve them wholesale, you have built observability, not memory. Useful memory requires compression, segmentation, and selection.

This is where many systems get sloppy by treating memory as a uniform blob store. In practice, different kinds of memory should come out of a trajectory.

At minimum, you want three distinct products.

1. Episode summaries

These capture what happened during a bounded chunk of work. An episode summary is not “the chat from 2:00 to 2:15 PM.” It is something like:

Investigated failing deploy for api-worker. Root cause was missing env var after secret rotation. Verified via startup logs and local repro. Fixed secret mapping in Helm values, redeployed successfully, left note that staging still has drift.

This is compact, chronological, and tied to a task arc.

2. Durable facts

These are stable facts extracted from one or more episodes. Examples:

api-worker depends on PAYMENTS_WEBHOOK_SECRET
staging secrets drift from production after manual rotations
the user prefers deploy confirmations with exact service names and commit SHAs

These should be versioned, attributable, and revocable. Facts are not just shorter summaries. They have a different shape and should survive many episode boundaries.

3. Lessons and procedures

This is the part most memory systems underbuild, because a lot of agent value comes from procedural learning:

when you see error X, inspect file Y first
this repo requires approval before touching production configs
for this CLI, instrument the internal loop instead of scraping terminal output
checkpoint before running migrations so resume can pick up cleanly after approval or failure

These are reusable policies, heuristics, and playbooks. They matter more than remembering that a particular conversation happened on Tuesday.

Segment by episodes, not token windows

Loading diagram…

The natural boundary for memory is not “every 2,000 tokens” but the episode: a coherent slice of work with a purpose, local state, and some outcome.

Sometimes one trace maps to one episode. Sometimes a long trace has several. A bug hunt might split into initial triage, failed fix, second investigation, and final repair. A task resumed after six hours and a permission approval is probably not one uninterrupted episode just because the trace ID stayed constant.

Episode segmentation should use structural signals, not only textual similarity:

task and subtask boundaries
plan revisions
long idle gaps
approval gates
tool-mode changes
error bursts followed by strategy change
checkpoint and resume markers
explicit outcome transitions like blocked, completed, abandoned

This is one place where treating agent activity as a trace pays off. You already have timestamps, parent-child structure, event types, and artifacts. You are not guessing boundaries from chat prose alone.

Salience scoring is not optional

Not everything deserves to become memory. This sounds obvious, but a lot of memory pipelines still behave like hoarders. If the system sees it, it stores it. Later it tries to retrieve from a junk drawer.

A decent salience model should ask at least:

is this likely to matter again?
is it stable or ephemeral?
did it affect outcome, cost, latency, or correctness?
was it surprising?
did it require human approval or intervention?
does it update an existing fact or contradict one?
is it specific enough to retrieve usefully later?

The Generative Agents work used a combination of recency, relevance, and importance for retrieval and reflection. That basic shape still holds up. For work agents, “importance” should lean toward operational consequence: failures, fixes, irreversible actions, repeated patterns, and human overrides should score high. Verbose but low-stakes chatter should not.

You also want decay. MemoryBank’s use of reinforcement and forgetting is directionally right. Some memories should weaken unless they keep proving useful. Others, especially procedures or durable environment facts, should harden over time.

Retrieval should be hybrid: semantic, temporal, relational

Loading diagram…

Pure semantic search is not enough if an agent is resuming a deploy issue from yesterday, because time matters. If it is trying to reuse a prior fix in the same repo, relationships matter. If it is debugging an error message, lexical overlap matters. If it is handling “that thing we did after the OAuth redirect broke,” event chains matter.

So retrieval should combine at least three signals:

semantic: embedding similarity against summaries, facts, procedures, artifacts
temporal: recency, session proximity, task chronology, before/after relations
relational: same repo, same service, same user, same tool, same trace lineage, same dependency graph

Mem0’s graph-flavoured direction is interesting here because some memories are naturally relational, not just textual. “Service A depends on secret B” is a graph edge. “This procedure supersedes the older one” is a graph edge. “This episode resolved the incident introduced by that change” is a graph edge.

A serious memory system should stop pretending a single vector database is enough.

For terminal-first agents, instrument the loop, not just the terminal

This matters more than people think. If you are building terminal-first agents or CLIs like Pi, the temptation is to capture terminal output and call it observability. That gives you a recording rather than an understanding.

The interesting structure usually lives one layer deeper, inside the agent loop:

goal received
plan updated
tool selected
command prepared
approval requested
command executed
output classified
retry policy triggered
checkpoint saved
memory write proposed
resume invoked

If you only instrument stdin and stdout around the shell, you miss intent, decision points, and state transitions. You also make privacy and redaction much harder because the terminal stream is a giant mixed blob.

Instrument the runtime loop itself, then attach terminal I/O as artifacts or span events where needed. You want the shell transcript available, but not as the primary semantic object.

TTY capture is good for demos. Structured events are what make debugging, memory, and recovery actually work.

Checkpoints and resume are part of memory

A lot of useful agent work is interrupted. The user goes away. Approval is pending. The network dies. A long build is still running. The model chooses to stop and ask. If the agent cannot checkpoint and resume cleanly, memory becomes partly fictional because the system reconstructs continuity from whatever text it can find.

Checkpointing should capture enough state to continue a trajectory without replaying the world:

current plan
open subgoals
pending approvals
relevant handles or resource IDs
partial outputs
working files and diffs
tool state
cursor positions in long-running tasks

This is one of the places LangGraph’s persistence model is worth paying attention to. It treats checkpointing as a first-class part of agent execution rather than a convenience feature bolted on later. For coding agents and operator-style assistants, that is the right instinct.

Observability tells you what happened. Checkpoints let you re-enter the work.

Where this data should actually live

Once you buy the idea that trajectories are the raw material for memory, the next question is obvious: where do you put all of this stuff? The short answer is that it should not live in one place. Trajectories, artifacts, checkpoints, and distilled memory have very different shapes. Trying to force them into a single store usually produces something expensive, awkward, and hard to govern. A good system splits them by function.

A practical layout looks like this:

trace store for execution structure
- run IDs, span IDs, timing, parent-child relationships, status, token and cost counters
object store for bulky immutable artifacts
- prompts, model outputs, screenshots, files, diffs, tool payloads, serialized checkpoints
relational database for indexed metadata
- run records, artifact metadata, retention tags, hashes, URIs, summaries, lineage edges
checkpoint store for resume and time-travel
- usually checkpoint metadata in a database, payloads in object storage
optional event log for fan-out and replay
- useful once ingestion, analytics, and downstream consumers need decoupling

That is not overengineering; it is just respecting the data model.

Execution traces want causal structure, artifacts want cheap durable blob storage, metadata wants indexed queries and transactions, and checkpoints want fast resume without pretending every intermediate byte belongs in your main database.

If you are already using OpenTelemetry-style traces, this split becomes pretty natural. Traces give you the execution graph. Postgres gives you practical indexing and retrieval. An S3-compatible object store gives you a cheap home for the big immutable payloads. If the system is small, you can stop there. If it grows, you can add an append-only event log for replay, fan-out, and recomputation.

The retention story also gets cleaner once these layers are separated.

hot tier: recent runs, full traces, full artifacts, and dense checkpoints
warm tier: compacted summaries, metadata, sparse checkpoints, colder object storage
cold tier: final outputs, provenance anchors, hashes, and analytics-friendly archives

Loading diagram…

That lets you delete or redact sensitive payloads without destroying your indexes, your summaries, or your provenance graph.

Git is a ledger, not a warehouse

There is a tempting version of this idea where agent memory simply lives “alongside the code” in git. That instinct is not completely wrong, but it does need discipline.

Git is very good at a few things:

storing small, reviewable, textual state
linking information to commits, branches, and pull requests
preserving durable decisions in a tamper-evident history
carrying lightweight provenance close to the code it describes

Git is bad at a different set of things:

high-frequency event streams
raw token-by-token trajectories
large binary artifacts
mutable checkpoints
retention-heavy telemetry
query patterns that look more like analytics than version control

So the right pattern is not “put the whole run in git”. It is:

keep code in git
keep thin provenance in git
keep heavy artifacts and fast-changing run data outside git
link them together with commit SHAs, content hashes, run IDs, and artifact digests

That gives you the part people actually want when they say “store it in git”: code-adjacent memory with strong provenance. Useful things to keep in git include:

agent task summaries
evaluation deltas that affect merge decisions
small provenance manifests
pointers to artifact bundles
occasional reviewed notes on what changed and why

Useful things to keep out of git include:

raw trajectories
screenshots and video traces
embeddings and model checkpoints
noisy telemetry
ephemeral working state

There are a few patterns here that are worth mentioning explicitly.

thin manifest in git, blob out of git
- probably the default answer for most teams
git notes for commit-linked provenance
- good when you want metadata attached to a commit without polluting the tree
sidecar branches or refs for coarse run summaries
- reasonable if you really want git transport and ACLs, but still not a great fit for high-volume logs
content-addressed artifact refs
- a clean bridge between git history and external artifact storage

Loading diagram…

The useful mental model is simple: git is the ledger of durable decisions, not the RAM dump an agent leaves behind.

Privacy matters fast

Trajectory capture gets sensitive very quickly, because prompt text can contain internal documents. Shell output can leak secrets. Retrieved artifacts can include credentials, personal data, or proprietary code. Full transcripts are often more dangerous than people realize because they blend everything together.

So the memory pipeline needs structure and restraint:

keep raw traces separate from distilled memory
redact at multiple layers
store bulky payloads as referenced artifacts, not inline everywhere
use hashes and pointers where full payloads are not needed
apply TTLs to raw artifacts if long-term retention is unnecessary
avoid storing chain-of-thought by default

The goal is not total retention but useful retention.

A practical architecture

Loading diagram…

If I were building this today, I would separate it into four layers.

1. Capture layer

This is where the agent emits structured execution data.

instrument the agent loop
emit a canonical event stream
map that stream to traces, spans, events, and artifacts
export through OpenTelemetry or a compatible pipeline
store checkpoints alongside traces, not in some unrelated side channel

2. Processing layer

This is where raw trajectories turn into memory candidates.

segment traces into episodes
extract facts, lessons, and procedures
score salience
detect updates and contradictions
attach provenance links back to the raw trajectory

3. Memory layer

This is where the outputs live in distinct stores.

raw trace store for audit and replay
episode store for task-shaped summaries
fact memory store for durable assertions
procedure store for reusable workflows
vector index for semantic retrieval
timeline or graph index for temporal and relational retrieval

4. Inference layer

This is where the system decides what to pull back in.

classify the recall need
retrieve semantically similar memories
retrieve temporally nearby episodes
retrieve linked entities, services, repos, or tools
re-rank by salience and reliability
assemble a compact, provenance-aware context

That is a much better shape than “stuff the whole chat into embeddings and hope.”

What this buys you

This approach is more work than summarising transcripts, but it buys you something concrete:

better personalization
fewer repeated mistakes
stronger long-horizon continuity
clearer debugging
better post-hoc analysis
more reliable human handoff
actual procedural learning instead of vague “memory”

The same infrastructure that helps you debug an agent run can also help the agent remember what mattered.

The useful part is that observability and memory are not separate problems so much as adjacent views of the same underlying data.

The common failure modes

There are a few predictable ways to get this wrong:

storing everything and retrieving noise
over-summarising and losing edge cases
treating token windows as episode boundaries
embedding raw tool chatter and chain-of-thought indiscriminately
having no delete or update path for stale memories
ignoring contradiction handling
relying on vector search alone
losing provenance during summarisation
never evaluating whether memory improved actual task performance

A lot of “agent memory” systems look impressive right up until you ask them to explain what changed, when it changed, and why the current belief should be trusted, which is also why traces matter, because they preserve the path back to evidence.

The real point

The future of agent memory is not “remember the chat” but “understand the work,” and that means capturing trajectories instead of transcripts, segmenting them into episodes instead of token windows, and turning those episodes into facts, procedures, and lessons that can actually guide future behaviour.

Bigger context windows will help, better retrieval will help, and better summarisation will help, but none of those fix a bad substrate.

If the thing you store is a transcript, the memory system built on top of it will stay shallow, whereas if the thing you store is a structured record of work, memory starts to look less like a chatbot trick and more like a real system.

References

OpenTelemetry traces: https://opentelemetry.io/docs/concepts/signals/traces/
OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
OpenInference / Phoenix tracing: https://arize.com/docs/phoenix/tracing/llm-traces
LangSmith observability: https://docs.langchain.com/langsmith/observability
Langfuse observability: https://langfuse.com/docs/observability/overview
OpenAI Agents tracing: https://openai.github.io/openai-agents-python/tracing/
OpenLLMetry: https://www.traceloop.com/docs/openllmetry/introduction
Generative Agents: https://arxiv.org/abs/2304.03442
MemoryBank: https://arxiv.org/abs/2305.10250
MemGPT: https://arxiv.org/abs/2310.08560
LoCoMo: https://arxiv.org/abs/2402.17753
Mem0 research: https://mem0.ai/research

Why agent memory should start with work trajectories, not conversations

That is the substrate agent memory should be built on: structured work trajectories, segmented into episodes, then distilled into durable memory.

Not bigger context windows, not “just store all the messages”, and not another vector index full of chat chunks with no causal structure.

Chat logs are the wrong substrate

If an agent spends 20 minutes debugging a deployment, the transcript may only show:

“checking logs”
“found the issue”
“fixed it”
“redeployed”

That is useless as memory. What mattered was:

which service failed
which command exposed the issue
what the error string was
what change resolved it
whether approval was needed
what remained risky afterward

Raw chat history is optimized for turn-taking, while memory is optimized for reuse.

The bottleneck is not just capacity but representation.

Agent work is a trajectory, not just a conversation

A practical mapping looks like this:

a user task becomes a trace
each agent step becomes a span
prompt assembly, tool calls, approvals, retries, warnings, and exceptions become events
bulky things like full prompts, command stdout, screenshots, fetched pages, diffs, or retrieved docs become artifacts
state transitions and outputs become attributes
cross-task dependencies become links

That should not stop at debugging, because the same structure can become the source material for memory.

Traces are the raw material, not the final memory

This is where many systems get sloppy by treating memory as a uniform blob store. In practice, different kinds of memory should come out of a trajectory.

At minimum, you want three distinct products.

1. Episode summaries

These capture what happened during a bounded chunk of work. An episode summary is not “the chat from 2:00 to 2:15 PM.” It is something like:

Investigated failing deploy for api-worker. Root cause was missing env var after secret rotation. Verified via startup logs and local repro. Fixed secret mapping in Helm values, redeployed successfully, left note that staging still has drift.

This is compact, chronological, and tied to a task arc.

2. Durable facts

These are stable facts extracted from one or more episodes. Examples:

api-worker depends on PAYMENTS_WEBHOOK_SECRET
staging secrets drift from production after manual rotations
the user prefers deploy confirmations with exact service names and commit SHAs

These should be versioned, attributable, and revocable. Facts are not just shorter summaries. They have a different shape and should survive many episode boundaries.

3. Lessons and procedures

This is the part most memory systems underbuild, because a lot of agent value comes from procedural learning:

when you see error X, inspect file Y first
this repo requires approval before touching production configs
for this CLI, instrument the internal loop instead of scraping terminal output
checkpoint before running migrations so resume can pick up cleanly after approval or failure

These are reusable policies, heuristics, and playbooks. They matter more than remembering that a particular conversation happened on Tuesday.

Segment by episodes, not token windows

Loading diagram…

The natural boundary for memory is not “every 2,000 tokens” but the episode: a coherent slice of work with a purpose, local state, and some outcome.

Episode segmentation should use structural signals, not only textual similarity:

task and subtask boundaries
plan revisions
long idle gaps
approval gates
tool-mode changes
error bursts followed by strategy change
checkpoint and resume markers
explicit outcome transitions like blocked, completed, abandoned

Salience scoring is not optional

A decent salience model should ask at least:

is this likely to matter again?
is it stable or ephemeral?
did it affect outcome, cost, latency, or correctness?
was it surprising?
did it require human approval or intervention?
does it update an existing fact or contradict one?
is it specific enough to retrieve usefully later?

Retrieval should be hybrid: semantic, temporal, relational

Loading diagram…

So retrieval should combine at least three signals:

semantic: embedding similarity against summaries, facts, procedures, artifacts
temporal: recency, session proximity, task chronology, before/after relations
relational: same repo, same service, same user, same tool, same trace lineage, same dependency graph

A serious memory system should stop pretending a single vector database is enough.

For terminal-first agents, instrument the loop, not just the terminal

The interesting structure usually lives one layer deeper, inside the agent loop:

goal received
plan updated
tool selected
command prepared
approval requested
command executed
output classified
retry policy triggered
checkpoint saved
memory write proposed
resume invoked

Instrument the runtime loop itself, then attach terminal I/O as artifacts or span events where needed. You want the shell transcript available, but not as the primary semantic object.

TTY capture is good for demos. Structured events are what make debugging, memory, and recovery actually work.

Checkpoints and resume are part of memory

Checkpointing should capture enough state to continue a trajectory without replaying the world:

current plan
open subgoals
pending approvals
relevant handles or resource IDs
partial outputs
working files and diffs
tool state
cursor positions in long-running tasks

Observability tells you what happened. Checkpoints let you re-enter the work.

Where this data should actually live

A practical layout looks like this:

trace store for execution structure
- run IDs, span IDs, timing, parent-child relationships, status, token and cost counters
object store for bulky immutable artifacts
- prompts, model outputs, screenshots, files, diffs, tool payloads, serialized checkpoints
relational database for indexed metadata
- run records, artifact metadata, retention tags, hashes, URIs, summaries, lineage edges
checkpoint store for resume and time-travel
- usually checkpoint metadata in a database, payloads in object storage
optional event log for fan-out and replay
- useful once ingestion, analytics, and downstream consumers need decoupling

That is not overengineering; it is just respecting the data model.

The retention story also gets cleaner once these layers are separated.

hot tier: recent runs, full traces, full artifacts, and dense checkpoints
warm tier: compacted summaries, metadata, sparse checkpoints, colder object storage
cold tier: final outputs, provenance anchors, hashes, and analytics-friendly archives

Loading diagram…

That lets you delete or redact sensitive payloads without destroying your indexes, your summaries, or your provenance graph.

Git is a ledger, not a warehouse

There is a tempting version of this idea where agent memory simply lives “alongside the code” in git. That instinct is not completely wrong, but it does need discipline.

Git is very good at a few things:

storing small, reviewable, textual state
linking information to commits, branches, and pull requests
preserving durable decisions in a tamper-evident history
carrying lightweight provenance close to the code it describes

Git is bad at a different set of things:

high-frequency event streams
raw token-by-token trajectories
large binary artifacts
mutable checkpoints
retention-heavy telemetry
query patterns that look more like analytics than version control

So the right pattern is not “put the whole run in git”. It is:

keep code in git
keep thin provenance in git
keep heavy artifacts and fast-changing run data outside git
link them together with commit SHAs, content hashes, run IDs, and artifact digests

That gives you the part people actually want when they say “store it in git”: code-adjacent memory with strong provenance. Useful things to keep in git include:

agent task summaries
evaluation deltas that affect merge decisions
small provenance manifests
pointers to artifact bundles
occasional reviewed notes on what changed and why

Useful things to keep out of git include:

raw trajectories
screenshots and video traces
embeddings and model checkpoints
noisy telemetry
ephemeral working state

There are a few patterns here that are worth mentioning explicitly.

thin manifest in git, blob out of git
- probably the default answer for most teams
git notes for commit-linked provenance
- good when you want metadata attached to a commit without polluting the tree
sidecar branches or refs for coarse run summaries
- reasonable if you really want git transport and ACLs, but still not a great fit for high-volume logs
content-addressed artifact refs
- a clean bridge between git history and external artifact storage

Loading diagram…

The useful mental model is simple: git is the ledger of durable decisions, not the RAM dump an agent leaves behind.

Privacy matters fast

So the memory pipeline needs structure and restraint:

keep raw traces separate from distilled memory
redact at multiple layers
store bulky payloads as referenced artifacts, not inline everywhere
use hashes and pointers where full payloads are not needed
apply TTLs to raw artifacts if long-term retention is unnecessary
avoid storing chain-of-thought by default

The goal is not total retention but useful retention.

A practical architecture

Loading diagram…

If I were building this today, I would separate it into four layers.

1. Capture layer

This is where the agent emits structured execution data.

instrument the agent loop
emit a canonical event stream
map that stream to traces, spans, events, and artifacts
export through OpenTelemetry or a compatible pipeline
store checkpoints alongside traces, not in some unrelated side channel

2. Processing layer

This is where raw trajectories turn into memory candidates.

segment traces into episodes
extract facts, lessons, and procedures
score salience
detect updates and contradictions
attach provenance links back to the raw trajectory

3. Memory layer

This is where the outputs live in distinct stores.

raw trace store for audit and replay
episode store for task-shaped summaries
fact memory store for durable assertions
procedure store for reusable workflows
vector index for semantic retrieval
timeline or graph index for temporal and relational retrieval

4. Inference layer

This is where the system decides what to pull back in.

classify the recall need
retrieve semantically similar memories
retrieve temporally nearby episodes
retrieve linked entities, services, repos, or tools
re-rank by salience and reliability
assemble a compact, provenance-aware context

That is a much better shape than “stuff the whole chat into embeddings and hope.”

What this buys you

This approach is more work than summarising transcripts, but it buys you something concrete:

better personalization
fewer repeated mistakes
stronger long-horizon continuity
clearer debugging
better post-hoc analysis
more reliable human handoff
actual procedural learning instead of vague “memory”

The same infrastructure that helps you debug an agent run can also help the agent remember what mattered.

The useful part is that observability and memory are not separate problems so much as adjacent views of the same underlying data.

The common failure modes

There are a few predictable ways to get this wrong:

storing everything and retrieving noise
over-summarising and losing edge cases
treating token windows as episode boundaries
embedding raw tool chatter and chain-of-thought indiscriminately
having no delete or update path for stale memories
ignoring contradiction handling
relying on vector search alone
losing provenance during summarisation
never evaluating whether memory improved actual task performance

The real point

Bigger context windows will help, better retrieval will help, and better summarisation will help, but none of those fix a bad substrate.

References

OpenTelemetry traces: https://opentelemetry.io/docs/concepts/signals/traces/
OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
OpenInference / Phoenix tracing: https://arize.com/docs/phoenix/tracing/llm-traces
LangSmith observability: https://docs.langchain.com/langsmith/observability
Langfuse observability: https://langfuse.com/docs/observability/overview
OpenAI Agents tracing: https://openai.github.io/openai-agents-python/tracing/
OpenLLMetry: https://www.traceloop.com/docs/openllmetry/introduction
Generative Agents: https://arxiv.org/abs/2304.03442
MemoryBank: https://arxiv.org/abs/2305.10250
MemGPT: https://arxiv.org/abs/2310.08560
LoCoMo: https://arxiv.org/abs/2402.17753
Mem0 research: https://mem0.ai/research

Chat logs are the wrong substrate

Agent work is a trajectory, not just a conversation

Traces are the raw material, not the final memory

1. Episode summaries

2. Durable facts

3. Lessons and procedures

Segment by episodes, not token windows

Salience scoring is not optional

Retrieval should be hybrid: semantic, temporal, relational

For terminal-first agents, instrument the loop, not just the terminal

Checkpoints and resume are part of memory

Where this data should actually live

Git is a ledger, not a warehouse

Privacy matters fast

A practical architecture

1. Capture layer

2. Processing layer

3. Memory layer

4. Inference layer

What this buys you

The common failure modes

The real point

References

Matthew Gribben

Chat logs are the wrong substrate

Agent work is a trajectory, not just a conversation

Traces are the raw material, not the final memory

1. Episode summaries

2. Durable facts

3. Lessons and procedures

Segment by episodes, not token windows

Salience scoring is not optional

Retrieval should be hybrid: semantic, temporal, relational

For terminal-first agents, instrument the loop, not just the terminal

Checkpoints and resume are part of memory

Where this data should actually live

Git is a ledger, not a warehouse

Privacy matters fast

A practical architecture

1. Capture layer

2. Processing layer

3. Memory layer

4. Inference layer

What this buys you

The common failure modes

The real point

References

Matthew Gribben