Build a Deterministic AI Agent With Structural Gates

TL;DR: A deterministic AI agent gets its reliability from structure, not from a better prompt. You move the control flow into graph edges the model cannot override, add a critic gate that checks status instead of asking the model to self-check, checkpoint every step so a crash resumes cleanly, and constrain output with a grammar instead of a retry loop. This guide shows the exact patterns and code.

The phrase “deterministic AI agent” sounds like a contradiction, since the model underneath is probabilistic. The trick is that you do not make the model deterministic. You take the important decisions away from it.

I have lost more hours than I want to admit to agents that worked in a demo and fell apart in production. The fix that finally stuck was not a smarter prompt. It was moving the decisions that mattered out of the prompt entirely and into code the model has no vote in.

This guide walks through the four structural patterns that do that: a critic gate the model cannot talk its way past, phase isolation with checkpoints, grammar-constrained output, and a spec that drives the whole thing.

Each section has working code you can adapt. Run through all four and you will stop relying on the model to behave and start enforcing it.

Why Prompt-Based Agents Drift

Prompt-based agents drift because the prompt is a suggestion, and the model is free to ignore it the moment it gets confident or hyper-focused.

You can write “always run the tests before finishing” ten different ways and still watch the agent skip the step.

The problem gets worse on smaller local models. A frontier model can usually self-determine when to stop researching and start writing. A 30B-class local model like Qwen3-30B lacks the thinking scope to redirect itself, so it gets hyper-focused and single-minded, repeating the same tool call or barreling past a failed check.

What I have come to believe is that this is not a prompt-quality problem. It is a control-flow problem. If the decision to proceed lives inside the model’s reasoning, the decision is only as reliable as the model on its worst run.

The reasons agents fail in production almost always trace back to a decision that should have been structural sitting in the prompt instead.

Move the Control Flow Out of the Prompt

Determinism starts when the code, not the model, owns the sequence of steps, the branching, and the error handling. The model becomes one modular step inside a flow it does not orchestrate.

The way I think about it now is a role swap. In a prompt-driven agent, the model is the manager deciding what happens next. In a deterministic agent, the model is a worker that gets handed one well-scoped task, returns a result, and the surrounding code decides what happens next.

A graph framework like LangGraph makes this concrete: nodes are steps, edges are the allowed transitions, and the model runs inside a node rather than choosing the edges.

If you would rather not write graph code, a low-code platform like Make.com can enforce some of the same routing through its scenario logic, though you trade fine control for speed of setup. Either way, the principle holds: the path is defined in the system, not improvised by the model.

This is the same lesson behind making an agent reliable. Structure is cheaper than retries.

Build the Structural Critic Gate

A structural critic gate is a conditional edge that inspects an objective status field and routes accordingly, so the model cannot approve its own work. This is the single most important pattern in the whole setup.

Structural critic gate conditional edge routing

Here is the difference in practice, because I used to think a strong prompt did the same job.

Before: Your system prompt says “Always verify the output and do not proceed if the tests fail.” The model agrees, then proceeds anyway on a run where it decides the failure is minor.
After: A conditional edge reads a status field your verify step wrote. If it is not PASSED, the graph routes to a review state and stops. The model has no say in the routing.

The pattern, modeled on the SPINE harness that runs SPECIFY then PLAN then CRITIC_PLAN then IMPLEMENT then VERIFY, looks like this:

MAX_RETRIES = 2

def critic_gate(state):
    # The model does not vote here. The code reads the status.
    if state["phase_result"].status == "PASSED":
        return "next_phase"
    if state["retries"] >= MAX_RETRIES:
        return "needs_review"   # routes to END, a human, not back to the model
    return "retry"

graph.add_conditional_edges("verify", critic_gate)

Follow this sequence when you add a gate to any phase:

Make the work step write a structured PhaseResult with an explicit status, never a free-text “looks good”.
Add a conditional edge after the step that reads only that status.
Cap retries with a counter in state, not a prompt instruction.
Route exhausted retries to a terminal needs_review state, not back into the model loop.
Log the status at every gate so you can see exactly where a run stopped.

The reason this works is that verification becomes a hard constraint enforced by the hosting code. A prompted self-check is a request the model can rationalize past. A gate is a wall.

Isolate Phases and Checkpoint Every Step

Phase isolation with checkpointing means each stage runs as its own subgraph with its own saved state, so a crash in one phase never corrupts another and any run can resume.

This is what turns a fragile script into something you can trust overnight.

Phase isolation checkpoint resume flow diagram

LangGraph saves a snapshot of graph state at every super-step, the point where all parallel nodes in a tick finish. If a node fails mid-step, it stores the outputs of the nodes that already completed, so on resume those are not re-run. You keep the tokens you already spent.

from langgraph.checkpoint.sqlite import SqliteSaver

graph = builder.compile(checkpointer=SqliteSaver.from_conn_string("plan.sqlite"))

# After a crash, resume the same thread from the last good super-step:
graph.invoke(None, config={"configurable": {"thread_id": "build-42"}})

What I would do beyond the default is give each phase its own checkpoint file. PLAN and IMPLEMENT writing to separate stores means a corrupted implement run cannot poison your plan state.

The production infrastructure patterns for agents lean on exactly this kind of isolation. The official LangGraph persistence docs cover the checkpoint API in full.

Constrain the Output With a Grammar, Not a Retry Loop

Grammar-constrained generation forces the model to produce valid structured output by filtering invalid tokens during sampling, which removes an entire class of parse-and-retry failures.

Most builders reach for a retry loop here, and that is the slower, less reliable path.

The counterintuitive part is the speed. A library like Outlines compiles your schema into a state machine and filters tokens at the logit level, and because it fast-forwards through positions where only one token is valid, it often runs faster than unconstrained generation, not slower.

There is no post-generation validation step because malformed output is impossible to generate.

import outlines

model = outlines.models.transformers("Qwen/Qwen3-30B")
generator = outlines.generate.json(model, schema='''
{"type":"object","properties":{
  "action":{"enum":["search","write","stop"]}},
  "required":["action"]}
''')

result = generator(prompt)   # action is always one of three values

From my experience, this single change removes most of the “the model returned almost-JSON” bugs that plague tool-calling agents. If you are still fighting malformed tool calls, the structural fix beats the four-section prompt approach because the model physically cannot emit a bad call.

Drive It With a Spec, Not Vibes

Spec-driven development makes a written specification the source of truth and has the agent generate against it, which gives the model an unambiguous contract instead of a vague request.

In its strictest form, called spec-as-source, the human edits only the spec and never hand-edits the generated code.

What sold me on this is that it eliminates spec drift by design. When behavior needs to change, you change the spec and regenerate, so the documentation and the implementation can never disagree. The academic case for this is laid out in a 2026 spec-driven development paper that frames the spec as the new source code.

A spec the agent can act on is concrete and verifiable, not a paragraph of intent:

# spec.md  (the only file a human edits)
FEATURE:  invoice parser
INPUT:    a PDF file path
OUTPUT:   JSON {vendor, total, date}
RULE:     total must equal sum(line_items)
VERIFY:   run tests/test_invoice.py, every test passes

Feeding that to the agent as the contract for the IMPLEMENT phase gives the critic gate something objective to check against in the VERIFY phase. The spec and the gate reinforce each other.

A Quick Comparison of the Four Determinism Levers

Each lever removes a different source of randomness, and a reliable agent uses all four together. Here is how they line up so you can decide what to add first.

Lever	What it removes	Tool to start with
Structural critic gate	The model approving its own bad work	LangGraph conditional edges
Phase isolation plus checkpoints	Crashes corrupting state, lost progress	LangGraph SqliteSaver
Grammar-constrained output	Malformed JSON and bad tool calls	Outlines or Guidance
Spec-driven development	Vague requests and spec drift	A plain spec.md contract

If I could only add one this week, it would be the critic gate. It is the cheapest to retrofit and the one that catches the most embarrassing failures.

Frequently Asked Questions

How do I stop my AI agent from looping forever?

Move the exit decision out of the prompt and into a structural gate. Cap retries with a counter in your graph state, and route exhausted retries to a terminal review state instead of back into the model loop, so a stuck model cannot keep itself alive.

Can I get guaranteed valid JSON from a local model?

Yes. Use grammar-constrained generation with a library like Outlines, which filters invalid tokens at the logit level during sampling. The model physically cannot produce output that violates your schema, so you do not need a parse-and-retry loop.

Why does my 30B model keep repeating the same step?

Smaller local models lack the thinking scope to redirect themselves and get hyper-focused. The fix is a supportive harness that owns the transitions between phases, so the structure decides when to move on rather than the model.

What is a structural critic gate?

It is a conditional edge in your agent graph that reads an objective status field and decides routing in code. Unlike a prompted self-check the model can rationalize past, the gate is enforced by the surrounding code, so the model cannot approve its own failed work.

How does an agent resume after a crash?

Use checkpointing, such as LangGraph’s SqliteSaver, which saves state at every super-step boundary. On resume, completed nodes are not re-run, so the agent picks up from the last good step without redoing work or burning tokens twice.

Is spec-driven development worth it for a solo builder?

Yes, even solo. A concrete spec gives the agent an unambiguous contract and gives your critic gate something objective to verify against, which removes the back-and-forth of vague requests and keeps your code and intent aligned.

Quick Takeaways

A deterministic AI agent gets its reliability from structure, not from a better-worded prompt.

Build a critic gate as a conditional edge that reads a status field, so the model cannot approve its own work.

Checkpoint every super-step and isolate each phase so a crash resumes cleanly instead of corrupting state.

Constrain output with a grammar like Outlines, which is often faster than unconstrained generation, not slower.

Drive the whole flow from a written spec so the contract and the verification gate reinforce each other.