An LLM eval is a repeatable test that scores what your AI system produces against what you want it to produce. You give it a set of inputs, run them through your prompt, model, and tools, then grade the outputs with a scorer: an exact match, a rule, a smaller model acting as a judge, or a human. The score tells you whether a change made the product better or worse. Evals are to AI products what unit and integration tests are to normal software, except the output is open-ended, so the grading is the hard part.

Why are evals so important when building an AI product?

Because an AI product is non-deterministic: the same input can produce different output, and a prompt or model change that fixes one case can quietly break ten others. Without evals you are shipping on vibes and you cannot tell whether you are improving. Braintrust founder Ankur Goyal puts it bluntly: if you are building an AI product, there is no point doing anything other than evals, and everything should revolve around them. Evals are how you turn a demo that works sometimes into a product that works reliably.

How do you build an eval for an LLM application?

Start with a dataset of real inputs, ideally pulled from actual usage rather than invented. Define what a good output looks like for each, which is a judgment call that usually needs a subject-matter expert, not just an engineer. Pick a scorer for each dimension you care about (correctness, format, safety, tone), run your system over the dataset, and look at where it fails. Then make a change, re-run, and compare scores. The first version can be a spreadsheet and a few hand-written checks. The discipline matters more than the tooling.

Do evals make it easier to switch AI models?

Yes, and that is one of their biggest payoffs. The model you pick today is very unlikely to be the one you use tomorrow, as Braintrust's Ankur Goyal puts it. A solid eval suite lets you swap in a new model, re-run the same tests, and see in minutes whether quality went up or down for your specific use case. Without evals, every model migration is a leap of faith. With them, it is a measurement.

What is the difference between offline evals and production monitoring?

Offline evals run before you ship: you score a fixed dataset to decide whether a change is safe to release. Production monitoring runs after: you log real traffic and score a sample of live outputs to catch regressions and new failure modes you did not anticipate. You need both. Offline evals are your gate before deploy; monitoring is your smoke detector once real users are in the system. As an agent generates more steps and more data, both get harder and matter more.

LLM evaluation: how founders ship reliable AI

Most AI products break in the same place. The demo works, the founder ships, and then real users send inputs nobody tested, the model does something slightly wrong, and there is no way to tell whether yesterday's prompt tweak helped or hurt. The missing piece is evals. In a Greylock conversation, Ankur Goyal, founder and CEO of Braintrust and previously head of the ML platform at Figma, makes the case that LLM evaluation is not a step in building an AI product, it is the core of it. Here is what that means and how to start.

What is an LLM eval?

An LLM eval is a repeatable test that scores what your AI system produces against what you actually want. You take a set of inputs, run them through your prompt, model, and tools, then grade the outputs with a scorer. The scorer can be an exact match, a rule, a smaller model used as a judge, or a person. The grade tells you whether a change made the product better or worse.

If that sounds like testing, it is, with one twist. In normal software the output is deterministic, so a test is a simple assertion. In an AI product the output is open-ended and changes run to run, so the hard part is not running the test, it is deciding what good looks like and how to measure it. Goyal frames evals as focusing on what the AI should produce, not how it works under the hood. That shift, from the mechanism to the outcome, is the whole idea.

Why evals are the core of building an AI product

Goyal does not hedge on this. In his words, if you are building an AI product, there is no point doing anything other than evals, and everything should revolve around your evals. That sounds extreme until you have shipped one. A model is non-deterministic, so the same input can give different answers, and a prompt change that fixes one customer's complaint can silently break ten other cases you were not looking at. Without a way to measure quality across many cases at once, you are flying blind.

Evals are what turn that chaos into a feedback loop. They are the difference between "this seems better" and "this scores higher on the cases our customers actually care about." Founders coming from traditional software underrate this because they are used to deterministic systems where a passing test means the feature works. In AI, the eval suite is the product surface where you define and defend quality.

A good eval starts with taste, not code

The instinct is to treat evals as an engineering task. The harder and more important part is judgment. Goyal describes the craft of deciding how you want the model to behave as fundamental to building evals. Someone has to say what a good answer is, and that someone is often not an engineer.

This is where subject-matter experts come in. Goyal points to healthcare, where doctors help shape and review the prompts so the model performs well on real cases, and argues that pulling experts into AI development is essential for solving hard problems. The practical takeaway for a founder: the people who know what good output looks like (the clinician, the lawyer, the support lead) belong in your eval process, not just your engineers. A separate Greylock discussion makes the same point from the buyer side: the most discerning AI customers optimize for the quality of the output, not for cost. If quality is what wins the market, quality is what you have to measure.

The hard part is the infrastructure

Writing your first eval is easy: a spreadsheet of inputs and a few checks gets you started. Running evals well at scale is not. In a separate Greylock segment, the Braintrust team lays out why this is genuinely hard infrastructure. Production AI systems generate enormous volumes of trace data, on the order of megabytes per second. Some evals take a long time to run, sometimes days. And you have to orchestrate those long-running jobs and capture every output in one place where you can actually visualize it.

It gets worse as you move from a single model call to an agent. An agent takes many steps, calls tools, and generates far more data per task, which multiplies the eval and observability load. You do not have to build this plumbing yourself (that is the gap tools like Braintrust fill), but you should know it exists, because "we will add evals later" usually means bolting them onto a system that was never instrumented to be measured.

Evals are what make you model-agnostic

Here is the strategic reason to build evals early, beyond quality. The model landscape changes every few months. As Goyal puts it, the only thing you can be certain of is that whatever model you pick today is very unlikely to be the one you use tomorrow. New releases land constantly, and the best model for your task this quarter may be beaten by a cheaper or smarter one next quarter.

If you have a real eval suite, that churn is an opportunity instead of a risk. You drop in the new model, re-run the same tests, and within minutes you know whether quality went up or down for your specific use case. Without evals, every model swap is a gut-feel gamble and most teams freeze, stuck on an old model because they cannot prove a new one is safe to adopt. This is the same discipline I wrote about in building for the next AI model, not this one: evals are the mechanism that lets you actually do it.

Evals are also a safety gate

As a CISSP, the framing I keep coming back to is that evals are a control, not just a quality tool. A prompt change or a model upgrade is a change to production behavior, and untested changes to production are how you get incidents. Your eval suite is the gate that change has to pass before it ships, the same way an enterprise security review gates a deal.

That means running evals offline before you deploy (does this change regress correctness, format, or safety on the cases we care about?) and monitoring a sample of real outputs in production (what is failing now that we did not anticipate?). Treat safety and policy violations as scored dimensions in the suite, not as an afterthought. Evals catch the quiet regressions that no human is going to notice until a customer does. This is the same verification mindset behind moving from vibe coding to agentic engineering: the check is what makes the speed safe.

What to do this week

Write down, in plain language, what a good output looks like for your single most important use case. If you cannot, that is your first problem to solve.
Collect 20 to 50 real inputs from actual usage and save the outputs your current system produces for them.
Grade those outputs, by hand is fine, on the one or two dimensions that matter most (correctness and safety are a good start).
Pull in the person who actually knows the domain to define and check what good means. Do not leave it to engineering alone.
Wire that dataset into a scorer you can re-run, so the next prompt or model change is a measurement, not a guess.
Before your next deploy, run the eval as a gate. Make passing it the rule, not the exception.

Evals are not the glamorous part of building with AI, but they are what separate a demo that works in the meeting from a product that works for customers. Building those measurement loops into how your company operates is exactly what the AI Operating System for Startups is about.

Sources

Braintrust's Ankur Goyal on Why Evals Are the Core of AI Development (Greylock), the interview this article distills.
How Braintrust Tackles the Hard Infrastructure Problems Behind Evals (Greylock), on the data volume, long-running jobs, and agent load behind evals.
The Most Discerning AI Customers Optimize for One Thing: Quality (Greylock), on buyers optimizing for output quality over cost.
Background on Ankur Goyal and Braintrust: his LinkedIn, and the Impira (acquired by Figma) to Figma to Braintrust path covered by First Round Review.