RAG improves what your model knows. It doesn't control what it does.

The category error

When a language model produces output that is wrong, out of scope, structurally broken, or confidently incorrect, the instinct in most teams is to improve the model’s inputs. Add more context. Retrieve better documents. Enrich the prompt with relevant examples. This instinct is reasonable — better inputs often produce better outputs — but it treats a generation problem as a retrieval problem, and that substitution has limits that are worth understanding precisely.

RAG (retrieval-augmented generation) solves one problem well: it gives the model access to information it was not trained on, at query time. A model that does not know your internal product documentation can be given relevant passages before it answers. A model that might hallucinate facts about a recent event can be grounded with retrieved text. RAG addresses the gap between what the model learned during training and what it needs to know right now.

What RAG does not do is constrain what the model is allowed to conclude. You can give it perfect context and it will still reason outside the boundaries you intended, produce output in the wrong format, contradict domain rules it was not explicitly given, or confidently synthesise something plausible but wrong from the documents you provided. Hallucination from missing context and hallucination from unconstrained generation are different failure modes. RAG addresses the first. It does not touch the second.

What output control actually requires

Controlling generation requires a different set of mechanisms, and they operate at different points in the pipeline.

Constrained decoding works at the token level. You specify a grammar, schema, or format — JSON, a specific set of categories, a defined response structure — and the model is prevented from generating tokens that would violate it. Tools like Outlines, Guidance, and most modern LLM inference servers support this. If your problem is output format reliability, constrained decoding solves it more reliably than prompt engineering. The model cannot produce malformed JSON if malformed JSON is not in the generation space.

Fine-tuning shifts the model’s behaviour distribution. If you need the model to consistently stay within a domain, adopt a tone, or follow patterns that prompting alone does not reliably produce, fine-tuning on examples of correct behaviour trains the model toward those patterns. The tradeoff is cost and the risk of degrading general capability. Fine-tuning for output control works when you have a well-defined correct behaviour and enough examples of it.

Rules-based validation operates after generation. The model produces output; a separate logic layer evaluates whether that output satisfies the constraints of the problem. If it does not, the output is rejected, the model is prompted to try again, or the pipeline escalates. This is the mechanism that produces genuine output reliability — not by shaping what the model generates, but by enforcing whether what it generates is acceptable. The model reasons freely; the validator enforces correctness.

The distinction matters because rules-based validation can express constraints that are difficult or impossible to encode in a prompt. Domain rules that involve logical relationships, constraint satisfaction across multiple output fields, or formal correctness criteria are better expressed as logic than as natural language instructions. A prompt that says “ensure field A is consistent with field B given condition C” is a probabilistic instruction. A logic layer that checks the same relationship is a deterministic one.

Why prompting alone has a ceiling

Brute-force prompting — adding more instructions, more examples, more explicit constraints — works up to a point. For simple, well-defined tasks with limited output space, a well-written prompt can produce reliable results. The ceiling becomes visible when the task involves:

Logical consistency across multiple outputs. A model that must produce ten related fields that all satisfy interdependent constraints will eventually violate one of them, no matter how carefully you specify the rules in the prompt. The model does not reason about constraint satisfaction the way a rules engine does — it predicts plausible tokens.
Hard correctness requirements. Applications where a wrong answer is not just unhelpful but actively harmful — medical, legal, financial, safety-critical — need correctness verification that goes beyond “the model usually gets this right.”
Domain rules that are complex or numerous. Prompts that enumerate dozens of constraints become fragile. Rules change, exceptions accumulate, and the model’s compliance degrades as the instruction set grows. Encoding those rules in a separate layer that the model’s output is verified against is more maintainable and more reliable.

This is where the architecture that most production AI systems eventually reach starts to look less like “LLM with a good prompt” and more like “LLM in the middle, logic on both sides.”

The architecture that actually works

A pattern that generalises well: constrain the problem space before the model sees it, let the model reason within that space, and validate the output against rules before it leaves the system.

The input side defines what the model is working with — structured context, relevant retrieval, problem framing. RAG lives here, and it is genuinely useful here. If the model needs domain knowledge to reason correctly, retrieval provides it. GraphRAG specifically is useful when the context involves relational knowledge — entities and their relationships — rather than just similar documents. The retrieval layer is doing real work: giving the model a well-formed view of the problem.

The generation step is where the model does what language models do well: synthesising, reasoning, producing natural language that integrates the provided context. This step should be given room to operate. Constraining it too tightly defeats the purpose of using a generative model.

The output validation step checks whether the result satisfies the problem’s correctness requirements. This is where rules-based logic earns its place. Not prompting the model to check its own work — models are unreliable self-evaluators — but a separate layer that applies deterministic criteria to the output. If the output passes, it goes forward. If it does not, the pipeline handles the failure.

The separation of concerns is the point. Retrieval improves what the model knows. Generation produces the response. Validation enforces whether the response is correct. These are three different problems, and mixing them into a single prompting step is why output reliability is hard to push past a certain ceiling.

Where RAG earns its place

None of this is an argument against RAG. Retrieval is the right solution to the problem it solves. A model that has access to accurate, relevant context will reason better than one operating from training data alone. For applications where the primary failure mode is “the model didn’t know the right answer,” RAG directly addresses the root cause.

The mistake is reaching for RAG when the failure mode is “the model knew the right answer but produced the wrong output anyway” — inconsistent format, violated domain constraints, logically incoherent results. More retrieval does not fix generation. A validation layer does.

The question to ask before adding retrieval infrastructure is: what is the failure mode I’m trying to fix? If the answer is “the model lacks context,” RAG is the right lever. If the answer is “the model’s output isn’t reliable,” look at constrained decoding, validation, or fine-tuning first. Those mechanisms work closer to the actual problem.

Building for output reliability by improving retrieval is like proof-reading by improving your library. Better sources help. They do not catch the errors you introduce yourself.