Bounded Execution and Selective Memory

TL;DR

Ambient, long-lived "agents" whose continuity lives in a chat buffer or a growing pile of previous outputs are convenient for demos and structurally hostile to everything that compounds.
The alternative is to model work as explicit, bounded, typed units (Objectives and Jobs), executed in discrete Runs that produce inspectable evidence (Steps, ToolCalls, Artefacts, Traces).
Context for each unit of work is assembled deliberately — preferably from pre-compiled, policy-stamped artefacts plus a minimal run-local delta — rather than inherited.
Memory is not "everything we have ever touched." It is the result of a deliberate promotion step that turns selected evidence into typed, provenance-backed, applicability-scoped knowledge objects.
Applicability (the coordinates under which a claim is intended to hold: repository, branch, commit, package version, environment) is not metadata. It is the primary mechanism that stops the system from serving the model plausible but false facts when the world has moved on.

If you only change one thing about how agentic work is structured, change the unit of continuity.

The Ambient Agent Anti-Pattern

The default mental model for a "powerful agent" is still a long-running entity with a persistent conversation or memory buffer. It wakes up, sees the current state of "its" context, does some work, appends the result, and goes back to sleep. Over time the buffer grows or gets summarized by the same mechanism that wrote it.

This shape has several fatal properties for any work whose value lasts longer than the current session:

Continuity is tied to the model's ability to interpret an ever-larger, ever-less-structured transcript.
There is no first-class boundary between one piece of work and the next. Dependencies, readiness, and completion are social conventions the model is supposed to maintain.
Evidence of what actually happened is mixed with the model's narrative of what it thinks it was doing.
Memory and state are the same thing: whatever survived the last compaction or whatever the model decided was worth writing down this time.

The system feels alive and continuous. It also becomes impossible to audit, impossible to resume cleanly after a policy change, and increasingly likely to act on conclusions that were true under conditions that no longer hold.

Bounded Units + Explicit Evidence

The corrective shape decomposes work along natural boundaries that exist independently of any one model invocation:

An Objective captures the intended change and completion criteria.
A Job is a bounded, typed, executable unit of work in service of exactly one Objective. It has a clear start state and exit condition.
A Run is one attempt to execute a Job. It is evidence, not planning structure.
Inside the Run are Steps, ToolCalls, Artefacts, and Traces — the observable record of what was done, with what inputs, producing what outputs.

The model (or an agent loop) operates inside a Run, given an explicit ContextPack. When the Run ends, its outputs become candidates for later use. The Job may spawn child jobs or depend on other jobs. The Objective is closed when the acceptance criteria are satisfied, not when a conversation feels done.

This is more structure than "just let the agent work on the ticket." It is also the only structure that lets you answer, later, "what was the exact context this action was based on, what policy was in force, what evidence was produced, and which of those outcomes were later promoted into durable knowledge?"

Explicit Context per Unit

Every Run receives a ContextPack that was assembled for it. The preferred path is to load a previously compiled artefact (the output of the context compiler for a matching purpose and applicability) and append only the run-local delta (new artefacts or observations since that compilation). Staged retrieval is a fallback for when no suitable compiled artefact exists yet, not the normal operating mode.

The model does not "have access to the workspace." It is given a deliberately shaped, policy-checked, provenance-carrying bundle for this specific execution.

This is the practical application of the compilation thesis at runtime. It is also what makes resumption, replay, and post-incident analysis tractable.

Evidence Before Narrative

Inside the Run, what happened is recorded by the shell:

Every tool invocation is a ToolCall with its arguments, result, and timing.
Significant outputs are Artefacts (with content-addressable storage, not just "the model mentioned a file").
The model's own messages and reasoning are part of the trace, but they are not the source of truth for what the system did.

Later, when someone (human or another process) wants to understand why a particular state exists or why a decision was made, they look at the evidence graph, not at the most fluent summary the model produced at the time.

This inverts the usual priority. The model is still allowed — encouraged — to be fluent and synthetic. The system around it is required to be boringly literal about what actually occurred.

Selective Memory Promotion

Not everything that happens in a Run deserves to become durable memory. Most of it is transient execution detail.

The act of promotion — turning a subset of the evidence into typed Memory objects (Decision, Fact, Pattern, Constraint, FailureInsight, TechnicalPlan, etc.) — is a distinct step with its own gates.

Promoted memory carries:

Provenance back to the Runs and Artefacts that grounded it.
Applicability scope (the coordinates under which the claim is asserted to hold).
Supersession and status lifecycle.

This is the deliberate, lossy compression step that turns an ever-growing execution trace into a queryable, trustworthy substrate for future work. It is also the place where human judgment (or another deterministic policy) gets to say "this conclusion is now foundational for us; treat it accordingly."

Without this step, "memory" is just more data the next agent has to wade through, with the same trustworthiness problems as the original execution.

Applicability Is Not Optional

In software delivery — and in most real domains — truth is conditional on state that changes faster than the calendar.

A fact that was true on feature/auth-v2 before the merge may be actively misleading on main. A constraint that holds for consumers of [email protected] does not automatically hold for 3.0. A design decision made for the reporting service may not apply to the new edge component.

A memory system that does not model applicability as a first-class dimension will, with high probability, cause future agents to act on the most recently asserted version of a claim even when the conditions under which it was asserted no longer obtain.

The retrieval policy must therefore prefer exact applicability match, then lineage-compatible matches, then explicit global knowledge — and must make the match class visible rather than silently blending incompatible truth.

This is not a niche requirement for one vertical. It is what stops the compounding flywheel from turning into a compounding source of plausible errors.

The Flywheel Only Spins When the Layers Are Present

With bounded jobs, explicit per-run context packs drawn from compiled artefacts, first-class evidence, gated memory promotion, and applicability-aware retrieval, the loop becomes:

Loading diagram…

Better compiled context and stronger evidence → higher-signal promoted memory → future jobs and runs start from a substrate that is both richer and more precisely scoped → the next round of evidence is higher quality still.

The system gets better at the actual work you care about, not merely better at producing longer, more confident traces.

The alternative — ambient agents + accumulating unstructured memory + retrieval as the only context discipline — produces the opposite curve. More data, more surface area for drift, more opportunities for the model to be locally brilliant while the overall state of the system slowly becomes less reliable.

That is the architectural difference between impressive agent demos and agentic systems that are still coherent and improving after a year of real use.

Part 5 of "The Deterministic Shell." (final post in the arc)

The primitives described here (Job as the executable boundary, Run as evidence, ContextPack assembled from compiled artefacts, Memory promotion as an explicit act, ApplicabilityContext as a core property of every claim) are the concrete shape of the deterministic shell in the control plane we have been building. They are what make the stochastic contributions usable rather than merely novel.

Bounded Execution and Selective Memory

TL;DR

Ambient, long-lived "agents" whose continuity lives in a chat buffer or a growing pile of previous outputs are convenient for demos and structurally hostile to everything that compounds.
The alternative is to model work as explicit, bounded, typed units (Objectives and Jobs), executed in discrete Runs that produce inspectable evidence (Steps, ToolCalls, Artefacts, Traces).
Context for each unit of work is assembled deliberately — preferably from pre-compiled, policy-stamped artefacts plus a minimal run-local delta — rather than inherited.
Memory is not "everything we have ever touched." It is the result of a deliberate promotion step that turns selected evidence into typed, provenance-backed, applicability-scoped knowledge objects.
Applicability (the coordinates under which a claim is intended to hold: repository, branch, commit, package version, environment) is not metadata. It is the primary mechanism that stops the system from serving the model plausible but false facts when the world has moved on.

If you only change one thing about how agentic work is structured, change the unit of continuity.

The Ambient Agent Anti-Pattern

This shape has several fatal properties for any work whose value lasts longer than the current session:

Continuity is tied to the model's ability to interpret an ever-larger, ever-less-structured transcript.
There is no first-class boundary between one piece of work and the next. Dependencies, readiness, and completion are social conventions the model is supposed to maintain.
Evidence of what actually happened is mixed with the model's narrative of what it thinks it was doing.
Memory and state are the same thing: whatever survived the last compaction or whatever the model decided was worth writing down this time.

Bounded Units + Explicit Evidence

The corrective shape decomposes work along natural boundaries that exist independently of any one model invocation:

An Objective captures the intended change and completion criteria.
A Job is a bounded, typed, executable unit of work in service of exactly one Objective. It has a clear start state and exit condition.
A Run is one attempt to execute a Job. It is evidence, not planning structure.
Inside the Run are Steps, ToolCalls, Artefacts, and Traces — the observable record of what was done, with what inputs, producing what outputs.

Explicit Context per Unit

The model does not "have access to the workspace." It is given a deliberately shaped, policy-checked, provenance-carrying bundle for this specific execution.

This is the practical application of the compilation thesis at runtime. It is also what makes resumption, replay, and post-incident analysis tractable.

Evidence Before Narrative

Inside the Run, what happened is recorded by the shell:

Every tool invocation is a ToolCall with its arguments, result, and timing.
Significant outputs are Artefacts (with content-addressable storage, not just "the model mentioned a file").
The model's own messages and reasoning are part of the trace, but they are not the source of truth for what the system did.

This inverts the usual priority. The model is still allowed — encouraged — to be fluent and synthetic. The system around it is required to be boringly literal about what actually occurred.

Selective Memory Promotion

Not everything that happens in a Run deserves to become durable memory. Most of it is transient execution detail.

Promoted memory carries:

Provenance back to the Runs and Artefacts that grounded it.
Applicability scope (the coordinates under which the claim is asserted to hold).
Supersession and status lifecycle.

Without this step, "memory" is just more data the next agent has to wade through, with the same trustworthiness problems as the original execution.

Applicability Is Not Optional

In software delivery — and in most real domains — truth is conditional on state that changes faster than the calendar.

This is not a niche requirement for one vertical. It is what stops the compounding flywheel from turning into a compounding source of plausible errors.

The Flywheel Only Spins When the Layers Are Present

With bounded jobs, explicit per-run context packs drawn from compiled artefacts, first-class evidence, gated memory promotion, and applicability-aware retrieval, the loop becomes:

Loading diagram…

The system gets better at the actual work you care about, not merely better at producing longer, more confident traces.

That is the architectural difference between impressive agent demos and agentic systems that are still coherent and improving after a year of real use.

Part 5 of "The Deterministic Shell." (final post in the arc)

Bounded Execution and Selective Memory

Bounded Execution and Selective Memory

TL;DR

The Ambient Agent Anti-Pattern

Bounded Units + Explicit Evidence

Explicit Context per Unit

Evidence Before Narrative

Selective Memory Promotion

Applicability Is Not Optional

The Flywheel Only Spins When the Layers Are Present

Matthew Gribben

Bounded Execution and Selective Memory

Bounded Execution and Selective Memory

TL;DR

The Ambient Agent Anti-Pattern

Bounded Units + Explicit Evidence

Explicit Context per Unit

Evidence Before Narrative

Selective Memory Promotion

Applicability Is Not Optional

The Flywheel Only Spins When the Layers Are Present

Matthew Gribben