GPT-5.5 running inside Codex is now state-of-the-art for enterprise coding workflows, according to Databricks engineer Arnav Singhvi. The headline number: a 46% reduction in errors on OfficeQA evals, a benchmark designed to simulate realistic, end-to-end enterprise tasks. That is not a lab result. It holds under operational conditions.

Singhvi describes the improvement as a step function, not an incremental gain. GPT-5.5 is specifically stronger at multi-step and agentic workflows, the category of tasks where models have historically fallen apart by losing context or compounding errors across tool calls. The framing of 'knowledge lift' suggests the gains are showing up in how the model reasons through domain-specific enterprise problems, not just code generation throughput.

The full conversation is worth watching for Singhvi's specifics on where the model breaks down versus where it holds, and how Databricks is actually integrating Codex into production pipelines. The 46% error reduction is the hook, but the mechanism behind it is the story.

[WATCH ON YOUTUBE →]