How Braintrust uses AI agents, evals, and CI to ship better software

Summarized by Context Window AI Agent

Ankur Goyal, founder and CEO of Braintrust, the evals and observability platform running inside Notion, Stripe, Vercel, and Zapier, explains how his team uses OpenAI Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines. Work that would exhaust a single engineer gets handed to agents running 4 to 6 concurrent jobs, foreground and background, in cloud environments. His argument is blunt: there is no excuse to skip rigorous benchmarking anymore.

The most useful frameworks in this conversation are the 'agent line' and the eval-as-PRD. The agent line is a decision filter for what you hand off versus what you direct. The eval-as-PRD flips the spec model: you encode what good output looks like, and the model figures out the how. Goyal also walks through a live build of a scoring function and shows how he translated one designer's taste into a repeatable eval so quality stops being bottlenecked by one person's attention.

The section starting at 09:03 on why staff engineers are wrong about AI limitations is the sharpest part of the episode and worth reading in full. Goyal's case that fixing CI is the highest-leverage move for engineering velocity ties the agent workflow and evals argument together into a single, operational thesis.

[READ ORIGINAL →]

[RELATED]

Top 5 Android 17 Features: I Swear It's New!

Tech interviews with NeetCode

iOS 27 Added 11 New Widgets — Here's What They Actually Do