GitHub Copilot CLI now ships 'Rubber Duck' in experimental mode: a second AI model from a different family that reviews the primary agent's plans and code before execution. When Claude Sonnet 4.6 is the orchestrator, GPT-5.4 runs as the reviewer. On SWE-Bench Pro, that pairing closed 74.7% of the performance gap between Sonnet and Opus, with a 4.8% improvement on the hardest multi-file problems spanning 70-plus steps.
The design targets three checkpoints where bad decisions compound fastest: after planning, after complex implementation, and after writing tests but before running them. The agent can also trigger Rubber Duck reactively when stuck in a loop. Real catches from evaluation include a scheduler that would have started and immediately exited running zero jobs, a loop silently overwriting the same dict key across four Solr facet categories, and three files reading a Redis key that new code stopped writing, all with no errors thrown.
The technical argument worth reading in full is not the benchmark number. It is the section on why self-reflection fails: a model reviewing its own output is still bounded by its own training biases and blind spots. Cross-family review is the proposed fix. Enable it via the /experimental command in GitHub Copilot CLI with any Claude model selected in the model picker.
[READ ORIGINAL →]