11 min read

Emilio Carrión

Generating Is Easy. Verifying Is the Work.

Anthropic separated the agent that generates from the one that evaluates and quality skyrocketed. That pattern describes the future of software engineering: generation is commodity. Verification is craft.

aiengineeringarchitectureleadershipverification

A study by METR had 16 experienced developers complete real tasks in their own repositories, randomizing whether they could use AI or not. Those who used AI took 19% longer. But the devastating part isn't that. It's this: before starting, they estimated AI would make them 24% faster. And after finishing (having been objectively slower), they still believed they had been 20% faster.

Read that sentence again. They believed they were faster while being slower. And these weren't juniors. They were developers with years of experience in those specific repositories.

You know what? That data point has been chasing me for weeks. Because I don't think it's a data point about AI tools. It's a data point about a fundamental human bias: we are incapable of properly evaluating the quality of what we just produced. And that bias is about to reconfigure our profession.

A few days ago, Anthropic published a post on harness design that connects exactly with this. They built a three-agent system (one that plans, one that generates code, and one that evaluates) to create complete applications autonomously over hours.

When they asked the agent to evaluate its own work, it gave itself a high score even when quality was mediocre. It trusted what it had generated.

An independent experiment confirms it: 449 AI-made code reviews, self-evaluated at 98.6% validity. When an external evaluator reviewed them against the actual code, that number dropped to 68.9%. A 30-point gap between "looks correct" and "is correct."

Humans and machines, the same blind spot. If that's not an argument for separating the generator from the verifier, I don't know what is.

The Separation That Changed Everything

Anthropic's solution was to separate the generator from the evaluator. One agent creates, another judges. And they discovered that tuning the evaluator to be skeptical is much easier than getting the generator to be self-critical.

Their evaluator didn't read the code line by line. It navigated the application with Playwright, used it like a user, tested flows, and evaluated against predefined criteria. Each sprint had a "contract" (a prior agreement on what "done" meant) and the evaluator verified against that contract.

That pattern (generator and evaluator as separate roles) is becoming a standard. AWS documents it as an architecture pattern. OpenAI applies it in their harness engineering model, where a small team built a million lines of code without writing a single line by hand.

But I'm not just interested in it as an agent pattern. I'm interested because it describes where our profession is heading.

The Contract That's Breaking

For decades, the implicit contract of software engineering was: you generate and you verify. You write code, you test it, you review it, you deploy it. The same person (or team) that creates is the one that ensures quality. Understanding the system comes bundled with the act of building it.

That contract is breaking. And I'm watching it happen in real time.

Now AI generates. A PM with Claude Code ships a feature. A junior with Cursor accepts suggestions they don't fully understand. An autonomous agent implements an entire sprint without supervision. The artifact arrives the same (a PR, a deploy, a working feature), but the understanding behind it is radically different.

And someone has to verify that what arrived is production-quality.

I saw it a few weeks ago in one of the teams I support. An AI-assisted PR, clean, well-structured, tests passing. It looked perfect. A senior engineer stopped it because they knew that endpoint depended on a service that, under real load, has intermittent latency issues that don't show up in any test. They knew because they lived through the incident two years ago. That context wasn't in the code, wasn't in the documentation, wasn't anywhere an agent could find it. It was in someone's head who had been operating the system for years.

That is verification. And it's what AI, today, cannot do.

The Three Layers of Verification

Observing both agent systems and human teams, I see that effective verification operates in three layers:

Functional verification: Does it do what it should? The most obvious and the most automatable. Tests, CI/CD, linters, type checking. Anthropic's evaluator agents use Playwright to navigate the application and verify that flows work. In human teams, it's what we already have (or should have). Necessary but not sufficient.

Judgment verification: Does it do it well? This is where things get interesting. "It works" is not the same as "it's good." Is it maintainable? Does it scale? Is it secure? Do the abstractions make sense? Anthropic found they had to define explicit criteria for this ("design quality," "originality," "craft," "functionality") because without criteria, the evaluator (whether agent or human) tends to approve whatever looks reasonable. In human teams, these criteria are rarely explicit. They live in seniors' heads as intuition. As invisible heuristics.

Context verification: Does it fit the real system? This is the layer that keeps me up at night. No tool can automate it yet. Will this code behave well alongside the rest of the system? Are the design decisions coherent with the existing architecture? Is there something in the system's history (a past incident, a fragile dependency, a business constraint) that this code ignores? It's exactly what happened with that PR the senior stopped. The first layer said "all green." The second would say "clean code." Only the third caught the problem. This layer requires architectural memory, exactly the kind of knowledge that seniors have and don't know they have.

Tools cover the first layer. Explicit criteria cover the second. The third is only covered by people with experience and context. And that's where the real engineering work is.

What This Means for Your Team

If verification is the new core job, there are practical consequences. And they start tomorrow, not in some abstract future.

Quality criteria must be explicit. Anthropic couldn't make their evaluator work until they defined concrete, scorable criteria. "Is this design good?" is a question that can't be answered consistently. "Does this design follow our principles of design quality, originality, craft, and functionality?" can.

The same applies to human teams. A concrete example: instead of reviewing a PR thinking "does this look alright?", define three questions that every review must answer. Something like: "Could someone diagnose this at 3 AM without help?", "Are the design decisions consistent with the domain architecture?", "Are the business edge cases covered?"

If your quality criteria live only in your seniors' intuition, nobody else can verify anything. You need to make them explicit. Not as bureaucratic documentation, but as operational contracts that everyone understands.

The "done" contract is defined before generating. In Anthropic's system, the generator and the evaluator negotiate what "done" means before a single line of code is written. That eliminates ambiguity.

In human teams, we rarely do this. Code arrives, gets reviewed on the fly, and "done" means "I don't see anything weird." Imagine the difference if before implementing a feature, the team agreed: "This feature is done when: it passes load tests at 5x expected traffic, degradation of the external service returns a correct fallback, and an engineer who didn't implement it can explain the flow by reading only the code and comments."

That turns verification into a checklist against an agreement, not a subjective opinion.

Verification has to scale beyond people. Explicit criteria and "done" contracts are the first step. But the next is codifying that verification as infrastructure: contract testing between services, property-based testing, mutation testing. Verification that doesn't depend on a senior being available at 3 AM but is embedded in the system. The direction is clear: if generation accelerates 10x, verification must be automated in everything that can be automated, so that humans can focus on the third layer — the one that requires context and judgment.

Why Verifying Is Harder Than Generating

This seems counterintuitive. Generating a complete system should be harder than reviewing it. But it's not, and the reason matters.

Generating is a production problem: given a specification, produce something that satisfies it. LLMs are extraordinarily good at this. They produce code that compiles, passes tests, looks reasonable.

Verifying is a judgment problem: given something that appears to work, determine if it really works in all the cases that matter, if it's maintainable, if it will behave well under pressure, if the design decisions are sustainable.

And verifying requires something that generating doesn't: context. Knowing that endpoint will receive 10x more traffic at Christmas. Knowing that service has a fragile dependency on a legacy system. Knowing that design decision was made for a reason that isn't in the code. Knowing that "works" is not the same as "works in production at 3 AM under real load."

The VP of AI at CodeRabbit describes it this way: AI-generated code is more cognitively demanding to review than human-written code. And existing QA pipelines were built for the human pace, not for the AI-amplified pace.

That context is exactly what AI doesn't have. And it's exactly what seniors have been accumulating for years without being aware of it. It's what I described in The Code Nobody Understands Is Already in Production: AI doesn't charge for writing code, but someone is going to pay for operating it.

The Asymmetry That Defines Us

We're in a temporal asymmetry: the capacity to generate has taken a 10x leap, but the capacity to verify has barely moved. It's as if we multiplied the speed at which we build buildings tenfold, but the building inspectors are still reviewing at the same pace.

The data backs this up. A report by Sonar with over 1,100 developers found that 42% of all committed code already has significant AI assistance, and they expect it to reach 65% by 2027. But 96% of those same developers don't fully trust that code to be functionally correct. And only 48% always verify it before committing.

A study by Harness with 900 engineers completes the picture: 63% of organizations say they ship code to production faster since adopting AI. But 45% of deploys linked to AI-generated code cause problems. They call it the "AI Velocity Paradox": we go faster, but we break more things.

Something has to give. And I think what's going to give is the current model where everyone generates and nobody has time to verify.

This is what I see when I map each role:

Generate vs Verify Map

Today
TodayTrend
Senior Engineer
Junior (pre-AI)
Junior Engineer
PM with Claude Code
Autonomous Agent
Staff Engineer

The diagonal marks equilibrium. To its right, generation outpaces verification. That distance is the asymmetry the article discusses, and where future incidents live.

The Engineer as Evaluator

There's a line from Anthropic's post that stuck with me: "The space of interesting harness combinations doesn't shrink as models improve. It moves."

I think the same applies to engineers. The work doesn't shrink. It moves. From generating to verifying. From writing code to defining the criteria that determine whether code deserves to exist. From implementing to designing the quality contracts that others (humans or machines) must fulfill.

I won't pretend there isn't a part of me that resists. I got into this profession because I loved writing code. But what I'm discovering is that verifying with judgment is intellectually more demanding work, not less. It requires more experience, more context, more judgment. It's not a step back. It's a step up.

That said, there's a tension I don't want to ignore. The context that makes a good verifier comes from having built things. The senior who caught the problem in that PR caught it because they lived through the incident two years ago. If we stop generating entirely, where does the judgment of future verifiers come from? How do you build architectural memory if you've never operated a system that breaks at 3 AM?

I don't have the complete answer. But I know that naming the question is more honest than pretending the path is clear. And I believe the answer has to do with the fact that generating and verifying aren't fixed roles, but modes of work that every engineer needs to exercise, each in their own proportion depending on the moment and maturity.

What I am sure of is this:

Generation is commodity. Verification is craft.

The invisible heuristics of seniors, the quality criteria nobody has written down, the architectural memory that lives only in the heads of those who have been operating the system for years — all of that is verification infrastructure. And it's the most valuable infrastructure your team has.

If your quality criteria live only in the heads of three seniors, you don't have a verification system. You have three single points of failure. And in a world where generation accelerates every quarter, those three points will become the bottleneck that determines whether your team can scale or drowns reviewing code it doesn't understand.

Treating that infrastructure as what it is — making it explicit, codifying it, scaling it. That's the engineering work of the coming years.

Newsletter Content

This content was first sent to my newsletter

Every week I send exclusive reflections, resources, and deep analysis on software engineering, technical leadership, and career development. Don't miss the next one.

Join over 5,000 engineers who already receive exclusive content every week

Emilio Carrión
About the author

Emilio Carrión

Staff Engineer at Mercadona Tech. I help engineers think about product and build systems that scale. Obsessed with evolutionary architecture and high-performance teams.