Generating Is Easy. Verifying Is the Work.

A study by METR had 16 experienced developers complete real tasks in their own repositories, randomizing whether they could use AI or not. Small sample, but rigorous experimental design: real tasks, their own repos, random assignment. Those who used AI took 19% longer. But the interesting part is that, before starting, they estimated AI would make them 24% faster. And after finishing (having been objectively slower), they still believed they'd been 20% faster.

They believed they were faster while being slower. And these weren't juniors. They were developers with years of experience in those specific repositories.

A single data point proves nothing. But when you see the same pattern in experienced developers, in agents self-evaluating, and in automated code reviews, the pattern starts to carry weight. An experiment by Alexey Pelykh confirms it: 449 code reviews done by AI, self-evaluated at 98.6% validity. When an independent evaluator reviewed them against the actual code, the number dropped to 68.9%.

I don't think it's a data point about AI tools. It's a data point about a fundamental human bias: we're terrible at evaluating the quality of what we just produced. And that bias is about to reshape our profession.

Humans and machines, similar practical outcome even if the causes differ. LLMs can't self-evaluate by design. Humans theoretically can, but the overconfidence bias documented by METR suggests we're bad at it in practice. If that's not an argument for separating the generator from the verifier, I don't know what is.

The Separation That Changed Everything

A few days ago, Anthropic published a post on harness design that connects directly to this. They built a three-agent system (one that plans, one that generates code, and one that evaluates) to create complete applications autonomously over hours.

When they asked the agent to evaluate its own work, it gave itself high marks even when quality was mediocre. The fix was separating the generator from the evaluator. One agent creates, another judges. And they found that tuning the evaluator to be skeptical is much easier than getting the generator to be self-critical.

Their evaluator didn't read the code line by line. It navigated the application with Playwright, used it like a real user, tested flows, and evaluated against predefined criteria. Each sprint had a "contract" (a prior agreement on what "done" meant) and the evaluator verified against that contract.

That pattern (generator and evaluator as separate roles) is becoming a standard. AWS documents it as an architecture pattern. OpenAI applies it in their harness engineering model, where a small team built a million lines of code without writing a single line by hand.

But I'm not just interested in it as an agent pattern. I'm interested because it describes where our profession is heading.

The Contract That's Breaking

For decades, the implicit contract of software engineering was: you generate and you verify. You write code, you test it, you review it, you deploy it. Understanding the system comes bundled with the act of building it.

That contract is breaking. And I'm watching it happen in real time.

Now AI generates. A PM with Claude Code ships a feature. A junior with Cursor accepts suggestions they don't fully understand. An autonomous agent implements an entire sprint without supervision. The artifact arrives the same (a PR, a deploy, a working feature), but the understanding behind it is radically different.

I saw this a few weeks ago in one of the teams I support. An AI-assisted PR, clean, well-structured, tests passing. Looked perfect. A senior engineer blocked it because they knew that inventory query endpoint depended on an external service that, under the load of marketing campaigns, goes from responding in 50ms to accumulating multi-second latencies that don't show up in any test. They knew because they lived through the incident two years ago, on a Friday at 11 PM. That context wasn't in the code, wasn't in the documentation, wasn't anywhere an agent could find it.

That's verification. And it's what AI, today, cannot do.

The Three Layers of Verification

A useful way to think about it, observing both agent systems and human teams, is in three layers:

Functional verification: does it do what it should? The most obvious and the most automatable. Tests, CI/CD, linters, type checking. Anthropic's evaluator agents use Playwright to navigate the application and verify that flows work. Necessary but not sufficient.

Criteria verification: does it do it well? "It works" is not the same as "it's good." Is it maintainable? Does it scale? Is it secure? Anthropic found they had to define explicit criteria ("design quality", "originality", "craft", "functionality") because without criteria, the evaluator tends to approve whatever looks reasonable. In human teams, these criteria are rarely explicit. They live in seniors' heads as intuition. As invisible heuristics.

Context verification: does it fit the real system? This is the hardest layer to solve. Will this code behave well alongside the rest of the system? Is there something in the system's history (a past incident, a fragile dependency, a business constraint) that this code ignores? That's exactly what happened with that inventory endpoint PR. The first layer said "all green." The second would say "clean code." Only the third caught the problem. This layer requires architectural memory, exactly the kind of knowledge that seniors have and don't know they have.

Tools cover the first layer. Explicit criteria cover the second. The third is only covered by people with experience and context. And that's where the real engineering work lives.

The Asymmetry That Defines Us

Generating is a production problem: given a specification, produce something that satisfies it. LLMs are extraordinarily good at this. They produce code that compiles, passes tests, looks reasonable. That's what's been commoditized. What hasn't: decomposing the problem, choosing the right abstractions, designing the specification. That's still craft. Verifying is a judgment problem: given something that appears to work, determine whether it truly works in every case that matters, whether it's maintainable, whether the design decisions are sustainable.

And verifying requires something that generating doesn't: context. Knowing that endpoint will get 10x more traffic at Christmas. Knowing that service has a fragile dependency on a legacy system. Knowing that "it works" is not the same as "it works in production at 3 AM under real load."

CodeRabbit's VP of AI puts it this way: AI-generated code is more cognitively demanding to review than human-written code. And existing QA pipelines were built for the human pace, not for the AI-amplified pace.

The data confirms the asymmetry. A report by Sonar with over 1,100 developers found that 42% of all committed code already has significant AI assistance, but 96% don't fully trust that code to be functionally correct. And only 48% always verify it before committing. A study by Harness with 900 engineers completes the picture: 63% of organizations ship code to production faster with AI, but 45% of deploys tied to AI-generated code cause issues. They call it the "AI Velocity Paradox": we go faster, but we break more things.

Both reports come from vendors with skin in the game, with the caveats that deserves. But the numbers are consistent with what I'm seeing in the teams I support. We've 10x'd the speed at which we generate, but the building inspectors are still reviewing at the same pace. Something's going to give.

The Map That Makes It Visible

When I plot each role on a generation vs. verification axis, the pattern becomes obvious. It's an observational model, not a study. But when I put it in front of a team, the conversation it sparks is always productive. The roles that generate the most are the ones that verify the least, and vice versa. AI has pushed everyone to the right (more generation) without anyone moving upward (more verification). The empty diagonal is the problem.

Generate vs Verify Map

Today

TodayTrend

Senior Engineer

Junior (pre-AI)

Junior Engineer

PM with Claude Code

Autonomous Agent

Staff Engineer

The diagonal marks equilibrium. To its right, generation outpaces verification.
That distance is the asymmetry the article discusses, and where future incidents live.

The Engineer as Evaluator

A line from the Anthropic post: "The space of interesting harness combinations doesn't shrink as models get better. It moves."

I think the same applies to engineers. The work doesn't shrink. It moves. From generating to verifying. From writing code to defining the criteria that determine whether code deserves to exist.

I won't pretend there isn't a part of me that resists this. I got into this profession because I loved writing code. But what I'm discovering is that verifying with judgment is intellectually more demanding work, not less. It requires more experience, more context, more judgment. It's not a step back. It's a step up.

That said, there's a tension I don't want to ignore. The context that makes someone a good verifier comes from having built things. The senior who caught the problem in that PR caught it because they lived through the incident two years ago. If we stop generating entirely, where does the judgment of future verifiers come from? I don't have the full answer. But I think it has to do with generation and verification not being fixed roles, but modes of work that every engineer needs to practice.

What I am clear on is this:

Generation is commodity. Verification is craft.

The invisible heuristics of seniors, the quality criteria nobody has written down, the architectural memory that lives only in the heads of those who've been operating the system for years — all of that is verification infrastructure. And it's the most valuable infrastructure your team has. If your quality criteria live only in the heads of three seniors, you don't have a verification system. You have three single points of failure.

So here's what I'm proposing you do this week. Not next quarter, not when you have time. This week:

Pick one quality criterion that lives only in someone's head and write it down. Not as documentation nobody reads, but as an item in your PR template checklist or a rule in your linter. One criterion, made explicit, already changes how the team verifies.

Define "done" before the next feature starts, not after. Sit with the team for 10 minutes and agree: what does this feature need to ship? Write it in the ticket. That's your verification contract.

In your next code review, ask a question that tests can't answer. "What happens when this service has 500ms of latency?" "Does this design decision conflict with something we decided six months ago?" That's the third layer of verification. And right now, only you can provide it.

None of this requires new tools. None of this requires budget approval. It requires deciding that verification is work that deserves the same intentionality you give to building features.

This is the fifth article in the Operating Blind series. So far I've spent more time diagnosing the problem than showing how to build the solution. I want to change that — there's a lot to explore about how to build verification infrastructure in practice. If you want to follow along, the newsletter is right below.

Newsletter Content

This content was first sent to my newsletter

Every week I send exclusive reflections, resources, and deep analysis on software engineering, technical leadership, and career development. Don't miss the next one.

Join over 5,000 engineers who already receive exclusive content every week

Generating Is Easy. Verifying Is the Work.

The Separation That Changed Everything

The Contract That's Breaking

The Three Layers of Verification

The Asymmetry That Defines Us

The Map That Makes It Visible

Generate vs Verify Map

The Engineer as Evaluator

This content was first sent to my newsletter

Related articles

Discipline Doesn't Scale. Verification Needs Infrastructure.

Your LLM passes the benchmark and fails in production

When LLMs Generate Thousands of Tokens per Second, What Matters Won't Be the Code

Emilio Carrión