The fastest reliable AI QA loop is the one that doesn't think
A coding agent that ships needs a check, every turn, that says: "this is wrong, fix it here." Without that check the agent generates plausible code, declares victory, and hands you a broken page.
There are three reasonable places to put that check, and they're not equivalent. You can put it in the prompt — rules, skills, guidelines. You can put it in a QA subagent — another LLM whose whole job is to grade the output. Or you can put it in static analysis — code that runs over the generated code and flags things deterministically.
For the slice of quality issues that have a deterministic shape, static analysis wins on the only two axes that actually matter inside an agent loop: it's reliable, and it's fast. Everything else follows.
Three places to put a guardrail
1. In the prompt of the main agent
The cheapest move. The agent makes a mistake, you add a line to the system prompt. No infra, no code, just words. For a while it works. A snippet of what this looks like in real life:
## Code style rules
- Never hardcode colors. Always reference theme tokens like
var(--color-accent), var(--color-muted).
- Every <Button> must have an onClick handler.
- Use Chakra v3 syntax: `open` not `isOpen`, `disabled` not `isDisabled`,
`colorPalette` not `colorScheme`.
- Don't nest <a> inside <a>. Don't put block elements inside <p>.
- For useEffect, every value used inside the callback must appear in the
dependency array.
- ...
The flow is: the rules go in once at the start of the conversation, the agent generates, and that's it. There's no checkpoint that says "did you follow rule #3 just now?" The rules are passive context.
This breaks in two predictable ways. First, compliance degrades with prompt length. The agent reads the rules at turn one, and by turn fifteen they're buried under tool results and intermediate code. The third rule gets forgotten. The tenth rule was never internalized in the first place. Second, even when the agent does follow a rule, it's silent — there's no signal back to you about which rules fired and which didn't. You only find out the rule failed when something visibly breaks downstream.
You can scale this up to skill files, guideline docs, retrieval-augmented rule packs. They all have the same shape. The check is the agent's own attention, and attention drifts.
2. A QA subagent
The next move, when prompt rules stop being enough, is to spin up a second LLM call whose whole job is to grade the first one's output. The subagent prompt looks something like:
You are a code reviewer for a frontend coding agent. You will be shown
the files that were just generated. Identify any issues with:
- Accessibility (color contrast, missing aria labels, keyboard nav)
- Component correctness (missing handlers, broken props)
- Framework usage (Chakra v3 syntax, valid CSS properties)
- Visual quality (layout issues, overflow, broken nesting)
Return a JSON list of findings. Each finding must include:
{ "path": "...", "line": <number>, "severity": "error" | "warning",
"message": "<what's wrong and how to fix>" }
Only report concrete issues. Do not editorialize.
The flow is two-pass: main agent generates, you invoke the subagent with the diff, the subagent reads the code and returns a verdict, the main agent reads the verdict and revises. It's genuinely smarter than a prompt rule — the subagent can reason about intent, catch things you didn't think to write down, and explain itself in natural language.
But it's still an LLM, and that means three things you can't argue with. The verdict is non-deterministic: same input, different result on a different sample. The latency is seconds per check, sometimes more if the diff is large, which means you can't run it on every turn — you batch it, and the agent piles more turns of code on top of unflagged mistakes in the meantime. And it's expensive: every QA pass is a full inference call, and you're doing it on every revision.
The harder problem is that when the subagent is wrong, you're debugging an LLM that's debugging an LLM. There's no ground truth. The subagent flags something the main agent disagrees with, the main agent argues back, and you're now reading two model outputs trying to decide who's right.
3. Static analysis
The boring third option. A rule looks at the code, applies a pattern or an AST query, and returns a finding or doesn't. Same answer every time. Runs in milliseconds. Points at the exact line and column. No sampling, no context window, no temperature.
The thing worth saying out loud: everyone already does this. Every TypeScript build is static analysis. Every ESLint pass is static analysis. The whole frontend ecosystem already runs a deterministic check pipeline on every code change — the build fails on a type error and the agent reads the message and fixes it. That loop already works. It works because the check is fast, the verdict doesn't drift, and the error message points at the line.
The question this post is really asking is: what else has the same shape as a type error? What other classes of mistake are deterministic enough to deserve their own check, instead of living as a hopeful line in the system prompt or a judgment call for the QA subagent?
The surprising answer is: a lot of things you wouldn't immediately
classify as "lint." Design system guidelines (theme token usage,
contrast ratios, spacing scales). SDK and API contract checks — does
this column exist on this table, does this endpoint accept this field,
each with a hand-written error message that tells the agent exactly
what's valid. Functionality checks: a button without a handler, a form
without a submit, a <Select> whose value isn't one of its options.
Once you start looking, the boundary of "this is deterministic enough
to lint" sits a lot further out than the conventional ESLint ruleset
would suggest.
The flow is the same as the type-check flow you already have. The agent generates, the analyzer runs, findings come back as structured records:
src/components/Settings.jsx:42:7
vibe/no-hardcoded-colors
Color "#3366ff" is hardcoded. Use var(--color-accent) instead.
src/components/Settings.jsx:58:3
vibe/button-without-onclick
<Button> is missing an onClick handler.
src/components/Settings.jsx:84:11
vibe/deprecated-chakra-props
"isDisabled" is deprecated in Chakra v3. Use "disabled" instead.
The agent reads that the same way it reads a TypeScript error. Path, line, what's wrong, how to fix. No prompt language, no judgment, no inference call. The fix is unambiguous, so the next turn converges instead of negotiating.
The shift isn't that static analysis is new — you already trust it for types. The shift is treating it as the default home for any deterministic quality check, and only escalating to prompt rules or subagents when the check actually needs reasoning.
Why reliable beats smart in a feedback loop
The agent loop runs the check on every iteration. That changes the math.
A guardrail that's correct 95% of the time sounds great in isolation. Inside a loop that runs ten times, it's wrong on average half the time somewhere in the run. And the agent has no way to tell whether the finding it's looking at is the real one or the phantom one. So it either chases ghosts and rewrites working code, or learns to mistrust the check entirely and ignores real findings.
A prompt rule's reliability degrades with context length. A subagent's verdict
drifts with sampling temperature. A regex doesn't drift. An AST query doesn't
drift. The 100th run of validateThemeTokens returns exactly what the first
run returned.
That's not a small advantage. In a feedback loop, it's the difference between convergence and oscillation. The agent can trust the finding, fix it, and move on — instead of relitigating whether the finding is real.
Speed is structural, not just a perf number
The other thing that changes inside a loop is what speed buys you.
A check that runs in 50ms can run every turn, on every file, before the agent moves on. A subagent check that runs in 3 seconds can't — you start batching it, sampling it, deferring it to the end of the run. Now the agent has piled ten more turns of code on top of a mistake before anyone catches it, and the fix has to unwind all of them.
The fast check changes the shape of the loop. Instead of generate a lot → grade → rewind, you get generate → catch → fix → generate. Tight cycles. The agent self-corrects mid-task instead of after. That's not a cost optimization — it's a different architecture. The QA subagent never gets to operate this way no matter how good your eval scores are, because the latency forbids it.
What this looks like in practice
The kinds of mistakes that have a deterministic shape are more common than you'd think. A few we run today, as illustration:
Contract checks. When the agent calls a method on an SDK, it can invent
arguments that look plausible — column names that don't exist on the user's
board, fields that aren't on the schema. The type system can't catch these
because the schema is dynamic. A short rule that parses the SDK calls
(.withColumns(), .where({...}), .orderBy(...), aggregate functions) and
cross-references the actual board schema catches every one of them
deterministically. The agent gets back: "line 42, column priority does not
exist on board X — valid columns are A, B, C." That's a fix the agent can
make on the next turn without thinking.
Clickable buttons. A <Button> rendered without an onClick handler is
almost always a bug — the agent meant to wire it up and forgot, or stubbed
it out and moved on. A four-line AST check finds them all. No prompt rule
gets you the same coverage; the agent will write a button, write a few more,
forget which ones it wired up.
Color contrast. Whether two theme tokens meet WCAG AA contrast is a math problem on the HSL values, not a judgment call. Same with detecting "neutrals" that are pure gray when the design system expects a tinted gray. You can put "remember accessibility" in the prompt forever and it won't catch a 3.2:1 ratio. A 30-line rule will, every time.
Framework-specific syntax. Chakra v3 renamed half its props from v2.
isOpen became open. isDisabled became disabled. leftIcon and
rightIcon became child elements. colorScheme became colorPalette. The
LLM was trained on a lot of v2 code and keeps reverting. You can put the
migration table in the prompt — it's a few hundred tokens, and the agent
still writes <Button isDisabled> half the time. A custom ESLint rule with
a lookup table makes the regression structurally impossible. The agent gets
back: "isDisabled is deprecated in Chakra v3, use disabled," and fixes it.
The system prompt never has to mention Chakra.
Hardcoded colors, invalid CSS properties, invalid DOM nesting (<a>
inside <a>, block elements inside <p>), useEffect with missing
dependencies — all the same shape. Patterns the agent gets wrong, that
have a deterministic answer, that don't belong in the prompt.
A quick aside on tooling: most of the rules above are cheap to write
because of ast-grep. It's a
polyglot, tree-sitter-based code search and lint tool — you write
patterns that look like the code you want to match, with metavariables
for the parts that vary. A rule for "button without onClick" is
basically <Button $$$PROPS>$$$</Button> with a not has clause for
the onClick attribute, expressed as a few lines of YAML. No ESLint
plugin scaffold, no AST visitor, no Babel parser config. It runs in
Rust so it's fast enough to invoke per-turn on hundreds of files, and
because it's pattern-on-AST instead of regex, you don't get bitten by
whitespace, attribute order, or string-literal vs template-literal
quirks. The reason "write a new rule" stays cheap as the rule list
grows is largely that this tool exists.
The point isn't any specific rule. It's that adding one is a code change, not a prompt change. No regression risk to other behaviors. No re-eval of the whole system prompt. Just: write the rule, run the loop, watch the finding go away.
When this is the wrong tool
Not everything has a deterministic shape, and pretending otherwise is the failure mode. Static analysis can't tell you whether the agent picked the right component for the user's intent, whether the empty state copy is any good, whether a layout makes sense for the data being shown. Those are judgment calls. They need a brain. They're the QA subagent's job.
The right framing isn't static analysis vs. subagent — it's that the subagent should only be deciding the things that actually need a brain. Every deterministic check you move out of the subagent and into a rule makes the subagent faster, cheaper, and more focused. The two layers compose.
The shape of the answer
If a quality issue has a deterministic shape, write a rule. If it doesn't, let an LLM grade it. Don't put either in the system prompt and hope.
The reason static analysis keeps winning the deterministic slice isn't that it's clever. It's that it doesn't think. It can't drift, can't get distracted, can't decide today is the day to be creative. In a loop that runs ten times, that's not a limitation. That's the whole point.