8 min read

Emilio Carrión

Your AI Agent Doesn't Need to Think Better. It Needs to Know When It Screwed Up.

Teams getting real value from agents don't have magical models. They have verification loops that catch failures fast and force correction with external signals.

aitestingsoftware quality

This week Karpathy published autoresearch. He left an agent running solo on a GPU overnight. 126 experiments. It discarded 102 changes that improved nothing, kept 23 that did. By morning, the model had improved more than a researcher would have achieved in weeks of manual tuning.

And the first thing I thought was: this isn't about the agent being smarter. It's about having a verification loop that doesn't let it cheat.

I've been thinking for a while about why some teams get real results with agents while others end up in a cycle of "try it, looks like it works, ship it, it breaks." And I believe the difference isn't the model they use or the framework. It's whether they've built a system that tells the agent "this is wrong, try again" with real data. Not with good intentions.

Reflection Isn't What You Think It Is

There's an idea that sounds great: "have the model review its own work and it'll improve." Like telling a student to double-check the exam before handing it in. The problem is that a survey published in TACL reviewed all the literature and its conclusion was blunt: there's no evidence that LLMs self-correct successfully using only their own introspection. In Huang et al.'s paper (2023), asking GPT-4 to review its answers on reasoning tasks made results worse. The model changed correct answers to incorrect ones more often than it fixed anything.

And someone will say: "but coding agents self-correct all the time." They're right. But those agents aren't "reflecting." They're receiving something they didn't have before: a stack trace, a failing test, a compilation error. That's not introspection. That's verification. They're different things.

Vadim Nicolai explained it well a few days ago: the industry has conflated two concepts. Introspection is the model re-reading its output. Verification is the model reacting to an external signal. Remove the compiler, remove the test suite, remove the search engine, and the improvements vanish.

I don't want to be absolutist about this. There's recent research from DeepMind showing that a model can verify its own plan step by step against explicit task rules, without external signal, and that raised planning success from 50% to 89%. But notice: even there, what works is checking against concrete rules. It's not the model asking itself "is this right?" It's the model checking a list.

The Numbers That Explain Why This Matters

A moment of honesty with the math, because I think this is the root of the issue.

If each step in an agentic workflow has 95% reliability (and that's already generous), a 20-step process has a 36% chance of completing correctly. It's Lusser's law: the reliability of a chain is the product of individual reliabilities. Even at 99% per step, 20 steps give you 82%. One in five processes fails. In production. With real data.

And the worst part isn't that it fails. It's how it fails. An agent that picks the wrong tool at step 3 drags that error through the entire chain. Everything that follows operates on a broken foundation. Hassabis from DeepMind calls it "compound interest in reverse." A recent O'Reilly article puts numbers on it: with agents at 98% individual accuracy, after 5 steps you're already below 90%.

Spotify: 1,500 PRs and a Judge That Vetoes the Agent

The best-documented case I've seen of verification loops in production is Spotify's. They published three articles detailing their agent "Honk," which has merged over 1,500 PRs.

What I find interesting isn't the agent itself, but the scaffolding they've built around it. They have independent verifiers that activate automatically based on the project (a Maven verifier kicks in when it detects a pom.xml, for example). The agent doesn't know what each verifier does internally. It just knows it can call "verify" and gets pass or fail.

One detail I liked: the verifiers parse logs with regular expressions to return only the relevant errors. Very short success messages. All to avoid burning the agent's context window with noise.

But the best part comes next. Spotify discovered that some agents got "creative": they refactored code nobody asked them to touch, or disabled flaky tests on their own. So they added an LLM as a judge that compares the diff against the original prompt and vetoes changes that go out of scope. Roughly 25% of proposed changes get vetoed. In half of those cases, the agent manages to course-correct.

Sound familiar? An agent that gets too clever and touches things it shouldn't?

Weekly Newsletter

Enjoying what you read?

Join other engineers who receive reflections on career, leadership, and technology every week.

Autoresearch: The Walls Matter More Than the Agent

Back to Karpathy, because I think his design teaches something that goes beyond ML.

Autoresearch works with three constraints: a single file the agent can touch (train.py), a single metric that decides whether a change was an improvement (val_bpb), and a fixed budget of 5 minutes per experiment. No partial credit. The change improves the metric or gets discarded entirely.

If the agent could modify both the code and the definition of success, the loop would be useless. A system that can rewrite both the exam and the answers always passes. The valuable part of this design isn't the agent's intelligence. It's the walls Karpathy put up.

And this applies beyond ML. If you have a measurable fitness signal, an experiment you can repeat, and an automatic keep/discard criterion, you can build this loop. A/B testing. CI pipeline optimization. Infrastructure configuration tuning. The pattern is the same.

When You Don't Have Tests, the Loop Doesn't Start

Here's the truth: for content, strategy, documentation, or any task where there's no compiler to say "this fails," the picture gets murkier. AWS documents a pattern where one agent generates and another evaluates against a rubric, iterating until convergence. It works reasonably well as a first filter, but an LLM evaluating another LLM has its own biases. And after two or three iterations, returns drop fast.

But notice what this implies for code: if your project has a good test suite, you already have a verification loop for free. The agent generates, the tests say whether it works, the agent corrects. It's good old TDD, except the one implementing isn't a human. As I wrote a while back, every automated test is like a robot working for you for free for the entire life of the project. Well, it turns out those robots now also work supervising the agent.

And conversely: if you don't have tests, the agent has nothing to verify against. No loop. No self-correction. You just have a code generator that sounds convincing and sometimes gets it right. A few months ago I wrote about the verification debt that accumulates when we generate code with AI without verifying it. Well, this is the flip side: teams that were already investing in testing are in a much better position to leverage agents than those that weren't.

The practical question before adopting an agent isn't "which model should I use?" It's "can I set up automated verification for this task?" If you can't, you're going to need a human in the loop. And that's fine. But knowing it upfront saves you months of frustrating attempts.

As an Anthropic report on agentic coding puts it: engineers use AI for tasks that are easily verifiable. Verifiability is the constraint that filters where agents work and where they don't.

In My Day-to-Day

I support seven engineering teams. And what I see in practice is exactly this: the teams getting the most out of AI assistants are the ones that already had solid test suites and CI before agents came along. They didn't do anything special to "adapt to AI." Their verification infrastructure was already there. The tests they wrote two years ago to validate business logic now also validate what the agent generates.

Teams that were weaker on testing are having a very different experience. They generate more code, yes. But they spend more time reviewing it manually and more time chasing bugs that slip through. An article in TechTarget put it plainly: organizations with good engineering practices channel agent velocity into productivity. Those without them generate chaos faster.

What I'm sure of is that 2025 was the year of seeing how fast agents could go. 2026 is the year of asking whether what they produce can be deployed with confidence. I'm not claiming this is the definitive answer on how it gets solved, but I'm confident in the direction.

Question for you: What's the riskiest task you've handed off to an agent without automated verification?

Newsletter Content

This content was first sent to my newsletter

Every week I send exclusive reflections, resources, and deep analysis on software engineering, technical leadership, and career development. Don't miss the next one.

Join over 5,000 engineers who already receive exclusive content every week

Emilio Carrión
About the author

Emilio Carrión

Staff Engineer at Mercadona Tech. I help engineers think about product and build systems that scale. Obsessed with evolutionary architecture and high-performance teams.