The Overstochastic Default
TL;DR
- The industry is in a gold rush on the stochastic layer: better models, cleverer prompting, longer contexts, richer tool use, more elaborate "agent frameworks."
- Very little comparable investment is going into the deterministic machinery required to make any of that useful beyond a single impressive trace.
- LLMs are powerful exactly because they are stochastic. Treating them as the foundation for reliable, auditable, long-lived behaviour is not a temporary limitation of current models. It is a category error.
- The systems that will compound are the ones that treat the LLM as an extraordinarily capable component inside a real control plane, not as the control plane itself.
Everyone building agents right now is optimizing the part that is easiest to demo and hardest to trust.
We have an explosion of runtimes, skill libraries, multi-agent orchestration frameworks, memory stores, and "agent operating systems." The conversation is almost entirely about making the model do more, remember more, call tools more reliably, or stay coherent over longer horizons.
This is not a bad thing. The stochastic engine is genuinely impressive. The problem is what we are not building at the same rate: the surrounding deterministic system that decides what the model is allowed to see, what it is allowed to do, what counts as evidence, what survives across time, and what must be re-verified before it is allowed to influence the next decision.
Call it the overstochastic default. The assumption, usually implicit, that if we just make the LLM half better and more convenient, the rest of the properties we care about (safety, auditability, repeatability, correct composition over months, graceful degradation) will emerge as pleasant side effects.
They don't.
The Local Maximum
The current pattern looks like this:
You have a powerful model. You give it tools. You give it "memory" (usually retrieval over some growing pile of previous outputs or documents). You write some system instructions or a set of skills (frequently in markdown). You add a loop or a graph or a cron that wakes the thing up. You add human approval gates where the risk feels obvious.
On a good day with a well-scoped task and fresh context, this produces magic. The model proposes the right decomposition, calls the right tools in the right order, and writes a reasonable summary.
On a normal day, six weeks later, the magic is harder to find. The markdown skills have drifted from the actual behaviour of the tools they describe. The accumulated "memory" contains statements that were true on a different branch, for a different version, under different constraints. The long-running "agent" has state scattered across chat histories, sidecar notes, and whatever the last successful run happened to write to disk. An approval happened, but nobody can easily point to the exact compiled context and policy version that was in force when the decision was made.
The system hasn't become less intelligent. The model is probably better. What has happened is that the deterministic invariants the work actually depends on were never first-class.
This is not a prompt engineering problem. It is an architecture problem.
Stochastic Is a Feature, Not a Bug in the Wrong Place
The power of large language models comes from their ability to operate productively in the space of uncertainty, partial information, and conflicting signals. They are excellent at proposing plans, extracting structure from messy input, synthesizing across sources, and doing useful work when the correct next action is not mechanically derivable from the inputs.
Those same properties make them poor foundations for the parts of a system that must be reliable by construction:
- Maintaining the boundary between "what this agent is allowed to know for this purpose" and everything else.
- Enforcing that a tool is only ever called with arguments that match its declared contract.
- Guaranteeing that provenance is preserved when claims move from source artefact through multiple transformations into a final decision.
- Knowing, definitively, which previous conclusions are still valid under the current branch, commit, package version, and policy set.
- Producing an audit trail that can be explained without asking the model to narrate what it thinks it did.
When you ask the stochastic component to also perform these functions, you are not "leveraging the model." You are using the part of the model that makes it useful for one set of problems to paper over the absence of real machinery for a different set.
The result is systems whose correctness depends on the model staying lucky in exactly the same way every time the context shifts.
What "Deterministic Shell" Actually Means
The phrase is not aesthetic. It describes a concrete set of responsibilities that sit around the stochastic core:
- Explicit, typed boundaries for what constitutes a unit of work.
- First-class representation of intent, execution, and evidence as separate concerns.
- A compilation or assembly process for context that can enforce invariants (provenance, policy, contradictions) before the model ever sees the tokens.
- Tool contracts that are real contracts, not suggestions the model is supposed to remember to follow.
- Governance that is evaluated by the system, not performed by the model describing its own behaviour.
- Memory that is selectively promoted from evidence rather than accumulated indiscriminately.
- Applicability and temporal scoping treated as first-class properties of knowledge, not footnotes.
None of these are "the model will be better at this next year" problems. They are systems problems. Solving them requires the same kind of deliberate, often unglamorous engineering that went into operating systems, type systems, database transaction boundaries, and configuration management.
The people who treat these as afterthoughts (or as something that can be approximated by writing more detailed instructions in markdown) are optimizing for the local maximum of "the demo looked incredible."
The Expensive Work Nobody Wants to Do
Building the deterministic half is expensive in the short term. It means:
- Defining real aggregates and lifecycles instead of letting everything live in one long context window.
- Implementing lowering pipelines and invariant checks instead of "just embed it."
- Making approval and confirmation first-class domain events with their own state machines instead of UI buttons that set a flag.
- Treating context as a compiled, versioned, policy-stamped artefact rather than whatever the retrieval layer happened to return this time.
- Accepting that some of the most valuable work an agent does will produce structured evidence and candidate memory that a human or another deterministic process must still ratify before it becomes durable truth.
This work is invisible in a product demo. It shows up months later when the system is still coherent, when you can explain why a particular action was taken, when a change in policy or a branch merge does not silently poison future reasoning, and when the "agent" gets better at the actual work you care about instead of just getting better at sounding plausible.
Most teams are not doing it because the reward function of the current moment heavily favors the stochastic surface.
We Started from the Other End
The work that produced the context compilation kernel and the control plane for agentic software delivery began with a different question: what would have to be true, structurally, for agentic work to remain trustworthy and compounding over long periods instead of decaying after the first few impressive traces?
The answers led to a small number of non-negotiable ideas:
- Context must be a compiled artefact with a typed intermediate representation, not assembled at prompt time from whatever is convenient.
- Governance of what may enter context is a compile-time concern, distinct from governance of what actions may be taken at runtime.
- Work must be decomposed into bounded, typed, executable units that carry explicit context rather than inheriting ambient history.
- Execution must produce inspectable evidence. That evidence is the input to selective memory promotion.
- Truth in software systems is usually scoped (by repository, branch, commit, version, environment). Any memory system that does not treat applicability as first-class will eventually serve the model plausible but incorrect facts.
These are not constraints we placed on the model. They are the deterministic machinery we placed around the model so that its stochastic strengths could be used safely and repeatedly.
The rest of this series unpacks what that machinery looks like in practice and why the popular alternatives keep producing systems that rot.
Next: why markdown-defined workflows, pure LLM knowledge bases, and similar patterns are not merely immature implementations of the right idea. They are structurally mismatched to the problem.
This is Part 1 of "The Deterministic Shell."
The views here are shaped by actually building the layers: a context compilation kernel that treats provenance, policy, and graph invariants as first-class during lowering, and a control plane that models intent, bounded execution, evidence, and promoted memory as separate first-class concerns. The argument is not "the model isn't good enough yet." The argument is that we have been building on the exciting half while the necessary half was left as an exercise for the reader.