CampeloLabs
← All articles

Blog

AI Testing for the AI Coding Era

Cicero Campelo

Cicero Campelo, CISSP
June 27, 2026 · 8 min read

Part of our guide to AI for startups.

A single founder at a laptop watching AI agents run automated tests across a web app before shipping
Table of contents

In a YC Root Access interview, Wei-Wei Wu and Jeff An, co-founders of Momentic, put words to a problem every team now feels: AI writes code faster than any human can check it. Their company builds the verification layer for that world. Momentic went through Y Combinator in winter 2024, raised a 15 million dollar Series A led by Standard Capital, and powers testing for Notion, Bilt, Quora, Xero, and Webflow. The founders say it runs more than a million tests a day.

The bet underneath all of that is simple. As code generation gets cheap, the bottleneck moves to verification, and AI testing is how a lean team keeps shipping without breaking things. If you are a founder pointing coding agents at your product and merging more than you can read, this is the part of the stack you cannot skip.

What AI testing actually is

Start with the plain version. Testing is how you answer one question: when I ship this code, is my app still working? That question gets harder as the app grows, more people touch it, and more product lines pile up. The old answers were a human clicking around before every release, or a suite of brittle automated scripts that someone has to maintain.

AI testing is a third answer. Momentic does what the founders call functional testing: an agent impersonates one of your users, goes through the app, clicks through the flows, and confirms that everything a real person would do still works. You describe the flow the way you would describe it to a manual tester, in plain English, and the agent runs it in a real browser. Wu says the average step runs in under 300 milliseconds, which is the difference between a test suite you run on every change and one you run overnight.

That is distinct from the checks you may already have. Linters scan your code for bad patterns. Code review, human or AI, looks at the change before it merges. Those operate on the code. AI testing operates on the running product, which is the only place you find out whether the thing actually works for a user. Andrej Karpathy made the general version of this point in his Sequoia conversation: AI is most reliable in domains where the output can be verified. Testing is how you create that verifiable signal for your own software.

Why more code means more verification, not less

Developers have always disliked writing tests, and the reasons are honest. Wu saw it up close at Robinhood, where he watched the team grow from 300 engineers to over 1,000. "My entire job was basically managing eight people and trying to get those eight people to convince the other thousand people to write and maintain tests," he said. The goal was 80 percent coverage and a 90 percent pass rate, and hitting it was "basically impossible." Tests do not feel like productive work. The customer never sees them, you cannot put them in a flashy demo, and they do not show up on a performance review.

Now flip that against the present. The amount of code being written per day is growing fast as teams point Cursor, Claude Code, and Codex at their products. The work of typing code is getting cheaper every month. The work of confirming that code is correct is not. So the bottleneck moves. As Wu put it, as code output increases you get massive bottlenecks in verifying the work that may not have existed before. The same point came up in a YC discussion of building with Claude Code, where an engineer described the unglamorous reality of ending up doing QA by hand, the least fun part of the job. AI testing exists to take that part off your plate so the verification step does not become the thing that caps how fast you ship.

How AI testing fits a lean team's stack

There are two places it plugs in. The first is your developer loop. Momentic exposes an MCP integration, so a coding agent like Cursor or Claude Code can write and run tests as a tool call while it builds. The agent makes a change, then calls out to a real browser to confirm the flow still works before it claims it is done.

That raises the obvious question: why not let the coding agent test itself? Two reasons, both worth internalizing. First, the agent often assumes its own work is correct when it is not, and general agents are not tuned for browser use, especially on complex apps with rich text editors, drag-and-drop, or canvases. Second is maintenance. You can ask Cursor to generate Playwright tests today, but as Wu notes, once you have a hundred thousand lines of Playwright and you change a feature, you now have tens of thousands of lines to find and update. A purpose-built layer maintains that test suite for you, and even suggests updates when it notices a new UI component, so you are not burning a session of agent tokens keeping tests in sync.

The second place it plugs in is your ship gate. On Notion's team, Momentic tests must pass before an engineer can merge a pull request. That is the same pattern strong AI-native teams are converging on: humans write a spec and a set of tests that define success, and agents generate code until those tests pass, a loop YC has described as the software factory. The test suite stops being a chore and becomes the gate that lets you trust faster output.

Truth-driven development: specs become the source of truth

The deeper idea in the interview is what the founders call truth-driven development, and it reframes what an engineer does. There are two schools of thought, Wu says. One holds that your code is the source of truth: whatever is in production is, by definition, how the product behaves. The gap is obvious once you say it out loud. Code has bugs, and you would not call a bug part of the spec.

The other school says the source of truth is the spec: the detailed user journeys, success criteria, and edge cases a human writes, usually with AI. "Your code is just an implementation of that source of truth," Wu says. In that world the engineer's job moves up a level. You become, in his words, a requirements gatherer and a truth finder, deciding what should be built out of a thousand feature requests, while agents handle the implementation and a testing layer confirms it matches the spec. "I would be disappointed if in 3 to 6 months I'm still reviewing TypeScript or React code," he said. Code becomes an implementation detail. The spec and the tests that enforce it become the thing you actually own.

You do not have to fully buy the timeline to act on the structure. Writing down what good looks like, in plain language, before you point an agent at a feature is the highest-return habit you can build right now. It is what lets you delegate the typing and still sleep at night.

What Notion's switch shows about the payoff

The clearest proof point is Notion. Before Momentic, they ran a mix of manual testing and a large Selenium suite the team had to maintain. Selenium is notoriously flaky, breaking on XPaths and selectors every time the page shifts, and Notion is a hard app to test: a flexible rich text editor where everything is a database. The story of how they started is almost a founder cliche in the best way. A Notion engineer tweeted that he wished he could just describe a test and have it run. People in the replies recommended Momentic. Wu, in San Francisco at 10 p.m., direct-messaged him, sent a Loom of the tool running on his own Notion workspace, and onboarded him that night.

Today Notion executes nearly half a million test runs a day, and those tests gate every merge. The way the founders frame the return is worth copying. The easy lens is developer hours saved versus a legacy tool like Selenium, Cypress, or Playwright. But the North Star, Wu says, is how many regressions and incidents the tests prevent from ever reaching customers. Tests are not the goal. Shipping more without shipping bugs is the goal, and the count of incidents you avoided is the number that actually maps to it.

What to do this week

  • Pick your single most important user flow, the one that would cost you customers if it broke, and write it out in plain English as a test, step by step.
  • Add a ship gate. Make at least that one flow pass before any code merges, whether a human or an agent wrote it.
  • If you use coding agents, wire a real browser check into the loop so the agent verifies its own changes work in production, not just that they compile.
  • Stop trusting agents to grade their own homework. Treat verification as a separate step with its own source of truth.
  • Write the spec before the code. For your next feature, define success criteria and edge cases first, then let the agent implement against them.
  • Track the right number. Count the regressions your tests caught before customers did, not just the hours saved.

Getting this right is less about a single tool and more about how your company decides to ship: what you specify, what you verify, and where you put the gate. That operating model, building with AI without losing control of quality, is exactly what we teach in the AI Operating System for Startups. For the model-quality half of the same problem, see our piece on LLM evals for founders.

Sources

Frequently asked questions

What is AI testing?

AI testing uses AI agents to verify that software works the way it should, by impersonating a real user and clicking through the actual app instead of asserting against code. You describe a user flow in plain English, like 'log in, create a document, share it', and the agent runs it in a real browser and reports whether it still works. It sits alongside linters and code review as the check that confirms the product behaves correctly in production, not just that the code compiles. Momentic, the company in this article, runs this kind of functional testing for companies like Notion and Quora.

What is the difference between AI testing and LLM evals?

They verify different things. AI testing checks that your software works: the login flow runs, the checkout completes, the page does not crash. LLM evals check that an AI feature produces good output: the summary is accurate, the answer is safe, the agent picked the right tool. If your product has an AI feature inside it, you need both. AI testing is the software-quality layer; evals are the model-quality layer. See our companion piece on LLM evaluation for the second half of that picture.

Can AI coding agents just write their own tests?

They can write test code, but Momentic's founders argue you should not trust them to verify their own work. A coding agent often assumes what it built is correct when it is not, and general agents are not optimized for driving a browser, especially on complex apps with rich text editors, drag-and-drop, or canvases. There is also a maintenance trap: an agent can generate a hundred thousand lines of Playwright tests, but the next feature change leaves you hunting through tens of thousands of lines to update. A purpose-built testing layer maintains that source of truth for you.

Does AI testing replace QA engineers?

It changes the job more than it removes it. The manual click-through bug bash before every release does not scale on a lean team, and that is the part AI testing automates. What stays human is deciding what should be tested in the first place: which user journeys matter, what success looks like, and which edge cases are worth catching. Momentic's founders describe future engineers as requirements gatherers and truth finders, people who specify what good means and let agents handle the mechanical verification.

Build your AI Operating System

A practical course to grow with AI, build internal tools, and operate safely. v1.0 launches July 31, join the waitlist.