GPT-5.4 is out and the reviewer runs it against real engineering work, not toy prompts. The model ships with a new efficient tool search feature and gets benchmarked on SWE-Bench Pro across two configurations: Medium and Extra High. Extra High costs more compute but handles large refactors better. Medium is faster but cuts corners on complex bugs.
The honest part is in the weaknesses section. GPT-5.4 struggles with Convex and UI-specific tasks, and the video flags a concrete prompt injection security concern that most first-look reviews skip entirely. The direct comparison against Claude Opus gives you a practical decision framework rather than a spec sheet.
The full video is worth watching for the live bug-fixing demo at 09:44 and the large refactor test at 15:15. Those sequences show where the model actually breaks down, which is more useful than any benchmark number. The security section alone at 16:44 justifies the runtime if you are deploying this in production.
[WATCH ON YOUTUBE →]