Ankur Goyal, founder and CEO of Braintrust, the evals and observability platform running inside Notion, Stripe, Vercel, and Zapier, explains how his team uses OpenAI Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines. Work that would exhaust a single engineer gets handed to agents running 4 to 6 concurrent jobs, foreground and background, in cloud environments. His argument is blunt: there is no excuse to skip rigorous benchmarking anymore.
The most useful frameworks in this conversation are the 'agent line' and the eval-as-PRD. The agent line is a decision filter for what you hand off versus what you direct. The eval-as-PRD flips the spec model: you encode what good output looks like, and the model figures out the how. Goyal also walks through a live build of a scoring function and shows how he translated one designer's taste into a repeatable eval so quality stops being bottlenecked by one person's attention.
The section starting at 09:03 on why staff engineers are wrong about AI limitations is the sharpest part of the episode and worth reading in full. Goyal's case that fixing CI is the highest-leverage move for engineering velocity ties the agent workflow and evals argument together into a single, operational thesis.
[READ ORIGINAL →]