TL;DR

A May 2026 Google whitepaper argues that software engineering is shifting from writing code toward expressing intent and verifying machine-generated work. The paper, as summarized by Thorsten Meyer AI, says the model itself may account for only about 10% of agent behavior, with tooling, context, tests and oversight carrying the rest.

A new Google whitepaper by Addy Osmani, Shubham Saboo and Sokratis Kartakis argues that AI-assisted software development is being reshaped by verification, tooling and human judgment around coding agents, not by model upgrades alone, a claim that matters as teams rely on AI for a growing share of new code.

The paper, titled The New SDLC With Vibe Coding, says the main shift in software engineering is from writing code directly to expressing intent and letting machines produce working software. According to figures cited in the paper, 85% of professional developers were regularly using AI coding agents as of early 2026, 51% used them daily, and about 41% of all new code was AI-generated.

The whitepaper draws a line between casual “vibe coding” and what it calls agentic engineering. In the paper’s framing, vibe coding means loose prompts, limited review and a reliance on whether the output appears to work. Agentic engineering means formal specifications, automated tests, evals, CI gates, tool controls and human review of architecture and risk.

Its most pointed claim is that the model is only about 10% of an agent system, while the surrounding harness accounts for about 90% of behavior. The paper cites benchmark and experiment results in support of that view, including a Terminal Bench 2.0 case in which an agent reportedly moved from outside the top 30 to the top five by changing the harness while keeping the same model.

AI Dispatch · Field Notes

Google · Osmani, Saboo & Kartakis · May 2026

The model is only 10%

A Google whitepaper argues software’s biggest shift is from writing code to expressing intent. Its sharpest claim: the model you obsess over is the smallest part of the system — the scaffolding around it does the real work.

A spectrum, not a binary — the differentiator is how outputs get verified

Vibe Coding

Casual prompts · “does it seem to work?” · disposable code · high risk

Structured AI-Assisted

Detailed prompts + constraints · manual testing · features in real codebases

Agentic Engineering

Formal specs · automated tests + evals + CI gates · production scale · low risk

Tests verify the deterministic; evals verify the rest. Without both, it’s vibe coding — however clever the prompt.

The idea worth building your strategy around

Agent = Model + Harness

~10%

HARNESS — prompts · tools · context · hooks · sandboxes · observability

MODEL~90% IS YOUR SURFACE AREA, NOT THE PROVIDER’S

Outside Top 30 → Top 5 on Terminal Bench 2.0 by changing only the harness — same model.

“Most agent failures, examined honestly, are configuration failures” — a missing tool, a vague rule, a noisy context.

The economics: it’s a token-cost problem (CapEx vs OpEx)

Vibe Coding

Low CapEx · High OpEx

Looks free, hides debt: token burn (fix-it loops), maintenance tax (AI spaghetti), security remediation. Crosses over to 3–10× more per feature.

Agentic Engineering

High CapEx · Low OpEx

Pay upfront (specs, evals, context), then ship cheaply. Levers: context engineering for first-pass success + intelligent model routing — cheap models for the easy work.

85%

of devs use AI coding agents (51% daily)

41%

of all new code is AI-generated

~90%

of agent behavior is the harness, not the model

+19%

longer on some tasks (METR) — verification is the cost

The read

The clearest map yet of how serious AI development works — and mostly tool-agnostic. But it’s a Google funnel: the concepts are neutral, the on-ramps point to Gemini, Jules & the ADK. If the harness is 90% and it’s yours, your moat and your costs both live there — so own your scaffolding, route across models, and remember: AI amplifies whatever engineering culture it lands in.

Source: Osmani, Saboo & Kartakis, “The New SDLC With Vibe Coding,” Google (May 2026). Figures are the paper’s own, incl. METR & LangChain. Analysis is the author’s.

thorstenmeyerai.com

Verification Becomes The Cost Center

The argument matters for engineering leaders because it shifts spending and management attention away from model selection alone. If the paper’s claim is right, teams that buy better models but underinvest in tests, context management, tool permissions, sandboxes and observability may see limited gains and higher maintenance costs.

Thorsten Meyer AI’s analysis of the paper frames the issue as an economics problem: low upfront process investment can look cheap but may lead to repeated fix loops, security remediation and harder maintenance. The same analysis says disciplined agentic engineering has higher upfront costs but can lower the cost per feature when specifications, evals and routing are in place.

Amazon

automated testing tools for software development

As an affiliate, we earn on qualifying purchases.

Vibe Coding Gets Narrower

The paper responds to the broad use of “vibe coding,” a phrase popularized by Andrej Karpathy in February 2025 to describe accepting AI-generated code through feel and repeated prompting. The term has since been used loosely across many AI-assisted workflows.

Google’s paper treats vibe coding as one end of a spectrum rather than the whole category. At the other end, it places agentic engineering, where AI generates code inside a controlled process with tests for deterministic behavior and evals for less predictable agent decisions.

The source material also notes a commercial angle: while the ideas are described as broadly applicable, the analysis says the on-ramps point toward Google’s Gemini, Jules and Agent Development Kit ecosystem.

“generation is solved; verification, judgment, and direction are the new craft”
— Osmani, Saboo and Kartakis, in the Google whitepaper

Amazon

CI/CD pipeline automation software

As an affiliate, we earn on qualifying purchases.

Methods Still Need Scrutiny

The adoption figures and benchmark examples are attributed to the whitepaper and cited sources, including METR and LangChain, but the supplied source material does not provide the full methodology behind every number. It is also not yet clear how broadly the 10% model and 90% harness framing applies across different products, teams, languages and regulated environments.

The paper’s commercial implications are also open to interpretation. The concepts may be tool-agnostic, but the analysis says Google’s examples and suggested paths point toward its own AI developer stack.

Amazon

AI code review and verification tools

As an affiliate, we earn on qualifying purchases.

Teams Test The Harness Thesis

The next test is whether software teams change how they evaluate AI coding systems. The paper points toward more investment in specifications, automated tests, evals, observability, context engineering and model routing, rather than waiting for a single model upgrade to fix workflow problems.

For readers running engineering teams, the practical milestone is measurable: whether AI coding agents can improve first-pass success, reduce repair cycles and pass production checks without creating hidden debt.

Amazon

software development harness components

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the actual news here?

Google has published a May 2026 whitepaper arguing that AI-driven software development should be judged by the full system around the model, including prompts, tools, tests, evals and human oversight.

Does the paper say models no longer matter?

No. The paper’s claim is that models matter, but they are only one part of a working agent system. It argues that the surrounding harness has a larger effect on real-world behavior.

What is the difference between vibe coding and agentic engineering?

In the paper’s framing, vibe coding relies on loose prompts and surface-level checks. Agentic engineering uses formal specs, automated tests, evals, CI gates and human review before AI-generated code reaches production.

Why should developers care about the 10% claim?

If the claim holds, teams may get more value by improving tests, context, tools and review systems than by switching models alone. It also means engineering culture and process shape AI output.

What remains unconfirmed?

The broad direction is clear from the whitepaper, but the exact strength of the 10% and 90% split across different teams and workloads remains to be tested outside the cited examples.

Source: Thorsten Meyer AI

Wellness content on this site is informational and not a substitute for professional medical guidance.

The Model Is Only 10%: The Real Lesson of the New SDLC

Up next

The Hollywood Vanity Mirrors Creators Love for a Reason

Author

The Blissful Studio Team

The model is only 10%

Verification Becomes The Cost Center

automated testing tools for software development