Reflection for RAG and Agents: Evidence-gated answers in regulated systems

11 Feb 2026

In regulated and enterprise contexts, a plausible but unsupported answer is worse than a slow answer because it creates audit failure and downstream decision risk. You cannot defend it, you cannot reproduce it, and you often learn it is wrong only after an end user or subject matter expert highlights it.

In these environments, “close enough” is not a harmless failure mode. The pattern is consistent: coherent prose that cites the wrong page, misses the primary document, or blends an outdated policy snapshot with a current one. The output reads well, but it does not survive a basic reviewer question: “Where did that come from, exactly?”

Evidence-gated RAG turns that failure mode into a bounded engineering problem by enforcing a simple contract: key claims must map to primary sources the system actually retrieved.

The Pattern

The root cause is architectural: in a single-generation pass, the model must generate and implicitly validate at the same time, using whatever context it happened to retrieve. In baseline runs, retrieval pipelines often surface secondary summaries because they are shorter, denser, and rank more highly under embedding similarity. The model then preferentially uses them because they are easier to paraphrase.

Evidence-gated RAG is a control-layer pattern where a grounded critic (restricted to tool outputs rather than model priors) enforces claim-to-evidence alignment before an answer is emitted. If the critic can justify that the evidence contract (every required claim must map to a primary source ID and a retrievable span) is met, the system continues. If it can name a concrete evidence gap, the system runs one targeted follow-up retrieval and regenerates. You incur extra tokens and latency, accepting that downside in exchange for a measurable lift in factual reliability from the additional retrieval.

Self-critique prompting reviews the model’s own draft without adding new evidence, so it tends to improve structure more than correctness. ReAct interleaves reasoning and acting, which helps the model use tools, but does not by itself enforce claim-to-evidence alignment or a stop condition. Planner and executor systems introduce orchestration complexity, but you can layer them on later. Evidence-gated RAG is the smallest architectural addition that turns retrieval into an enforceable contract.

The pattern generalises and is useful anywhere documents drift over time and readers care about provenance: legal document summarisation, policy comparison across versions, enterprise search with version drift, and safety-critical copilots where the cost of a single unsupported claim is higher than one additional retrieval round.

Control Loop

In production terms, reflection is a control loop with a stop condition. The generator runs first, followed by a critic that inspects what the agent actually observed from tools, not how fluent the prose is. Essentially the critic asks: "Do I have enough information to complete the task?". If the critic can justify finalising, the system returns the answer. If it finds a specific information gap, it asks for one targeted follow-up retrieval, then the system regenerate and finalise. The mechanics are simple and the discipline is in the constraints: the loop must be grounded, budgeted, and stoppable.

The primary weakness in this architecture is correlated model failure: the critic is also an LLM. There is a non-zero probability it hallucinates the existence of evidence to satisfy the generator, which is equivalent to grading your own work. The mitigation is two-fold: use a different model for the critic, and restrict draft visibility until context sufficiency is validated. First it evaluates the query against the retrieved context and returns a structured list of missing facts and missing primary sources (the authoritative PDF/page that defines the policy, not commentary). Only if the context is sufficient should it review the draft, and even then the draft review is constrained to evidence mapping, not prose quality.

The separation of concerns matters because it keeps the system reimplementable. If you blur roles, you get a clever demo and a brittle deployment.

Component	Responsibility	Does NOT Do
Generator	Draft an answer and propose tool calls using the retrieved context	Final approval
Retriever	Fetch relevant primary sources and return passages plus IDs	Judge correctness
Critic	Validate claim-to-evidence alignment and name missing facts or sources	Generate new claims
Loop controller	Enforce budgets and stop conditions, decide retry or halt	Rewrite answers

The stop conditions should be explicit, not emergent behaviour. This is what I ship:

Max two passes, default one.
Halt if the critic cannot name a new primary source to fetch.
Halt if retrieval returns no new primary sources.
Halt when evidence coverage meets the configured threshold across required claims.

flowchart LR
  Q["User question"] --> LC["Loop controller (budgets + stop conditions)"]
  LC --> G["Generator (draft answer + tool calls)"]
  G --> R["Retriever and tools (search, fetch, execution)"]
  R --> C["Critic (claim to evidence audit)"]
  C -->|Contract met| F["Final answer"]
  C -->|Named evidence gap| LC

Here is the loop in the smallest implementable form. The key is that the critic is grounded in tool history and the controller enforces stop conditions, rather than “reflecting” indefinitely.

const MAX_PASSES = 2;

for (let pass = 0; pass < MAX_PASSES; pass++) {
  const draft = generator({ question, retrieved });

  const decision = critic({
    question,
    retrieved,
    draft: pass === 0 ? null : draft, // earn the right to see the draft
  });

  if (decision.contractMet) return draft;
  if (!decision.missingEvidenceQueries.length) break;

  const next = retriever(decision.missingEvidenceQueries);
  if (!next.addedPrimarySources) break;
  retrieved = merge(retrieved, next);
}

throw new Error("Evidence threshold not met");

What changes is not how answers sound, but whether the system finalises on secondary summaries when primary documents are available. The trade-off is one additional retrieval round and a predictable latency increase, which can be capped via a single reflection pass.

This is rational, not expensive. In regulated deployments, latency is a UX variable. Unsupported claims are a liability variable. Evidence-gated RAG trades a small, bounded increase in compute for a large reduction in audit exposure.

There are a few variants of this pattern and they are not interchangeable. A basic self-critique loop can improve structure and formatting, but it rarely moves factual accuracy because the critic is still constrained by missing evidence. Grounded reflection is where things get interesting: the critic sees the tool history and flags what is missing, then the system retrieves that missing piece. For harder tasks you can go further and score multiple action paths before committing to one, but that pushes you into much higher compute and complexity, so I treat it as a specialised tool rather than the default. If you want a quick pointer to the research direction, it is consistent with how the field evolved. ReAct established the value of interleaving reasoning and acting (ReAct), Reflexion showed strong gains from explicit feedback loops (Reflexion), and CRITIC focused on verifying and correcting outputs using external signals rather than self-confidence (CRITIC). None of that implies every reflection loop works. It does imply that critique improves outcomes when it is grounded in observations and constrained by an operational contract. That contract is the part most people skip. I do not want free-form critique. I want a machine-actionable decision that answers one question: do we have enough evidence to finalise?

{
  "requires_more_context": true,
  "reason": "Coverage gap in the latest policy update.",
  "follow_up_instruction": "Fetch primary source and verify effective date.",
  "suggested_query": "official policy update effective date site:gov"
}

This keeps the loop deterministic and safe to automate. The generator cannot talk the critic into accepting a weak answer, and the critic cannot quietly drift into style feedback when what you need is evidence.

Worked Example

Here is a synthetic trace that mirrors a real production run. The point is the mechanics: evidence gap detection, targeted follow-up retrieval, then a revision that is actually supported by sources.

Suppose the user asks: “What is the effective date of the new policy and what changed?”. The agent does an initial web search:

{
  "tool": "web_search",
  "arguments": { "query": "new policy changes effective date" }
}

The tool returns two secondary summaries:

<result https://example.com/blog/policy-summary id=1>...summary of changes...</result>
<result https://example.com/news/policy-announcement id=2>...mentions policy...</result>

At this point, the generator can write a plausible answer, but it cannot satisfy the evidence contract for the effective date claim. That is a correctness risk, not a writing problem. The critic should therefore reject it and request a targeted follow-up that forces primary evidence:

{
  "requires_more_context": true,
  "reason": "No primary source for the effective date. Only secondary summaries are present.",
  "follow_up_instruction": "Find the primary policy document and extract the effective date and change list.",
  "suggested_query": "policy effective date filetype:pdf site:example.gov"
}

The agent runs the follow-up retrieval:

{
  "tool": "web_search",
  "arguments": { "query": "policy effective date filetype:pdf site:example.gov" }
}

The follow-up retrieval now returns a primary document:

<result https://example.gov/policy/2026-02-policy.pdf id=1>...effective date: ...; change list: ...</result>

On regeneration, the effective date and change list claims map directly to the primary source. The citations are now evidential rather than decorative.

Implementation and Evaluation

This is also why I keep the implementation minimal. I separate roles into generator and critic. I pass the critic only what it needs (question and tool history). I only loop when the critic can name a concrete missing input. The code below is runnable in the sense that you can copy it and execute it, but it uses stubs for the LLM and tools. Swap the stubs for your provider client and your tool layer.

Evidence gating becomes real when you define a contract that the loop can enforce. I treat “claim” as an object that must map to a primary source, plus a traceable span that a reviewer can locate later. The exact fields vary with your document store, but the shape matters.

{
  "claim": "Eligibility requires 12 months continuous service.",
  "evidence": {
    "doc_id": "policy_2025_01.pdf",
    "doc_version_ts": "2025-01-18",
    "page": 14,
    "quote": "Applicants must have completed 12 months of continuous service..."
  }
}

from dataclasses import dataclass
from typing import Any

@dataclass
class ReflectionDecision:
    requires_more_context: bool
    reason: str
    follow_up_instruction: str
    suggested_query: str | None = None

def llm_generate(question: str, tool_context: str) -> str:
    return f"Draft answer for: {question}\n\nEvidence:\n{tool_context}"

def reflect(question: str, tool_history: list[dict[str, Any]]) -> ReflectionDecision:
    evidence = "\n".join(item["output_preview"] for item in tool_history if item.get("output_preview"))
    missing_primary = "example.gov" not in evidence
    if missing_primary:
        return ReflectionDecision(
            requires_more_context=True,
            reason="Only secondary sources present. Primary source missing for key claims.",
            follow_up_instruction="Retrieve the primary document and extract effective date and change list.",
            suggested_query="policy effective date filetype:pdf site:example.gov",
        )
    return ReflectionDecision(
        requires_more_context=False,
        reason="Primary source present for key claims.",
        follow_up_instruction="Finalise.",
        suggested_query=None,
    )

def web_search(query: str) -> str:
    if "site:example.gov" in query:
        return "<result https://example.gov/policy/2026-02-policy.pdf id=1>...effective date...; change list...</result>"
    return "\n".join(
        [
            "<result https://example.com/blog/policy-summary id=1>...summary...</result>",
            "<result https://example.com/news/policy-announcement id=2>...announcement...</result>",
        ]
    )

def run_agent(question: str, max_reflection_rounds: int = 1) -> str:
    tool_history: list[dict[str, Any]] = []
    rounds = 0

    while True:
        if not tool_history:
            tool_output = web_search("new policy changes effective date")
            tool_history.append(
                {
                    "name": "web_search",
                    "arguments": {"query": "new policy changes effective date"},
                    "output_preview": tool_output[:400],
                }
            )

        tool_context = "\n".join(item["output_preview"] for item in tool_history)
        draft_answer = llm_generate(question, tool_context)

        if rounds >= max_reflection_rounds:
            return draft_answer

        decision = reflect(question, tool_history)
        rounds += 1

        if not decision.requires_more_context:
            return draft_answer

        if not decision.suggested_query:
            return draft_answer

        follow_up_output = web_search(decision.suggested_query)
        tool_history.append(
            {
                "name": "web_search",
                "arguments": {"query": decision.suggested_query},
                "output_preview": follow_up_output[:400],
            }
        )

The guardrails are what stop this becoming an expensive confidence amplifier. I cap reflection rounds (usually one, occasionally two in higher-risk flows). I cap how much tool output can be re-injected into context, I set time budgets per tool call and per overall turn, I validate the critic output against a strict schema, and I stop the loop when there is no suggested query, no new evidence, or the budget is exhausted.

Reflection without evaluation is theatre - if it does not beat baseline on evidence coverage, remove it. I keep reflection variants only when evaluation shows they beat baseline. The quickest way is to build a small golden set that matches your real distribution: straightforward factual questions, multi-hop retrieval questions, prompts that trigger stale-information traps, and prompts where a primary source exists and must be cited. I store both expected-answer notes and required-evidence notes per prompt, because the latter catches the failure mode that matters most in RAG and agent workflows: key claims that are not backed by primary evidence.

When I compare variants, I measure unsupported claim rate, incorrect citation rate, primary-source coverage for required claims, latency delta versus baseline, and reflection loop hit rate. Evidence coverage is the fraction of required claims that are backed by primary source IDs from your tool results. Citation precision is whether the claim text is supported by the cited passage, not merely adjacent to the same topic. For latency, I care about median and P95, because reflection primarily impacts tail latency.

I print a before and after snapshot for the team because it turns architecture arguments into a trade-off you can reason about. The exact numbers are workload-specific, so treat this table as the shape, not a benchmark.

Metric	Baseline RAG	Evidence-gated RAG
Unsupported effective-date claims (golden set)	Non-zero	0
Primary-source coverage (policy-style prompts)	Inconsistent	Near-complete
Incorrect citation rate	Your metric here	Your metric here
Median and P95 latency	Your metric here	Baseline + bounded extra round
Reflection loop hit rate	Your metric here	Your metric here

In ablation tests, grounded reflection consistently improves evidence coverage, while ungrounded self-critique mostly improves structure without moving factual accuracy. I run a tight ablation: baseline with no reflection, self-critique only, grounded reflection using tool history, then grounded reflection with a single follow-up retrieval. The most complex version only survives if the lift is large enough to justify the extra runtime cost.

This pattern has clear fit boundaries. Evidence-gated RAG is justified when the cost of an unsupported claim exceeds the cost of one additional retrieval round. It is usually a poor trade in ultra-low-latency UX, low-risk tasks where occasional inaccuracy is acceptable, and workflows where there is no external feedback signal to ground critique.

Use Evidence-gated RAG when

Claims must be traceable and reviewable.
Primary sources exist and can be fetched.
Multi-hop retrieval risk is real and costly.

Avoid it when

The UX must be ultra-low-latency.
The domain is low-risk and occasional errors are acceptable.
There is no external feedback signal to ground critique.

There are a few common ways this fails. If the critic cannot inspect tool evidence, it produces plausible feedback that does not change correctness. If the rubric is too rigid, the agent learns to satisfy checklists instead of outcomes. If you rely on judge-only evaluation, a fluent wrong answer can score well when evidence mapping is weak. If you apply reflection everywhere, you inflate cost and latency without moving accuracy on simple queries.

The critic hallucination vector deserves to be treated as first-class risk, not a footnote. The mitigations are boring and effective: the critic is restricted to tool outputs, it is explicitly forbidden from using external knowledge, it runs under strict schema validation, and it can be duplicated with an independent second critic model in the highest-risk flows.

If you are running reflection at scale, cost control becomes an inference problem, and it is worth treating your model and quantisation choices as part of the system design. See my post on Model Quantization, Fine-Tuning, and Picking the Right GGUF.

In a multi-tenant platform, evidence thresholds, reflection rounds, and latency budgets become configuration parameters, not hard-coded behaviour. Once you treat them as first-class configuration, reliability becomes something you can productise and reason about rather than something you hope emerges from prompt craftsmanship.

My default operating model is straightforward: one reflection round, structured critic output, explicit evidence gaps, one targeted follow-up retrieval when needed, and ablation-driven evaluation to prove that the extra calls are buying accuracy rather than theatre. Enterprise AI systems should be engineered as bounded control systems with explicit feedback loops, enforced evidence gates, and deterministic stop conditions.