GitHub Copilot CLI "Rubber Duck" — Catching Blind Spots with a Second Opinion From a Different Model Family¶
For / Key Points
Audience: Developers and tech leads already using (or evaluating) GitHub Copilot CLI, and anyone interested in how to raise the quality of AI coding agents.
Key Points:
- Rubber Duck is an experimental feature where a model from a different AI family automatically critiques the agent at three points: after planning, after complex implementation, and after writing tests.
- On SWE-Bench Pro, Claude Sonnet 4.6 + GPT-5.4 closed 74.7% of the performance gap between Sonnet and Opus.
- The effect grows with problem difficulty — on the hardest problems it delivers a +4.8% improvement.
Why "Same-Model Self-Review" Isn't Enough¶
The standard loop for a coding agent is plan → implement → test → fix, repeated. Self-reflection — the agent reviewing its own output inside this loop — is already common practice, but it has a structural limit.
When the same model reviews its own code, it can't detect blind spots that come from the same training data and the same reasoning characteristics. A wrong assumption made at the planning stage only becomes more entrenched as it flows downstream into implementation and tests. By the time anyone notices, you're not fixing one small mistake — you're fixing everything that was built on top of it.
Adding more rounds of self-reflection hits diminishing returns quickly. If you don't change the model's field of view, what was invisible stays invisible.
How Rubber Duck Works¶
Rubber Duck is a dedicated review agent powered by a model from a complementary family to whatever drives the main session.
Main model (orchestrator) Rubber Duck (reviewer)
────────────────────────── ──────────────────────
Claude Opus 4.6 GPT-5.4
Claude Sonnet 4.6 → GPT-5.4
Claude Haiku 4.5 GPT-5.4
Rubber Duck's job is to hand back a short, focused list of details the main agent overlooked, assumptions that deserve scrutiny, and edge cases to consider. It is intentionally not designed to rewrite everything — it returns only the high-value concerns.
Under the hood, it's invoked through the same task-tool infrastructure Copilot already uses for other sub-agents.
Three Automatic Checkpoints¶
Rubber Duck runs proactively, reactively, and on manual request. The three automatic trigger points are placed where the return on feedback is highest.
Checkpoint 1: After drafting a plan This is where the biggest wins live. Planning mistakes compound downstream, so flipping a single assumption here can prevent a full round of implementation rework.
Checkpoint 2: After a complex implementation Cross-file dependencies and edge cases get a second look — this time from a model with different reasoning characteristics.
Checkpoint 3: After writing tests, before executing them The last chance to catch coverage holes and incorrect assertions before the agent runs the suite, sees it pass, and self-reinforces.
There is also a reactive path: if the agent is stuck in a loop and not making progress, Rubber Duck can be called in to break the deadlock. Users can request a critique manually at any time, and Copilot shows the diff reflecting Duck's feedback.
SWE-Bench Pro Results¶
Evaluation used SWE-Bench Pro — a benchmark of real-world coding problems pulled from large, high-difficulty open-source repositories.
| Configuration | SWE-Bench Pro Result |
|---|---|
| Claude Sonnet 4.6 alone | Baseline |
| Claude Sonnet 4.6 + Rubber Duck (GPT-5.4) | Closed 74.7% of the Sonnet→Opus gap |
| Claude Opus 4.6 alone | Top performance |
Breaking it down by difficulty makes the effect gradient even clearer.
| Problem category | Improvement vs. Sonnet alone |
|---|---|
| All problems (average) | 74.7% of the gap closed |
| Hard problems (3+ files, 70+ steps) | +3.8% |
| Hardest problems (identified over three trials) | +4.8% |
The pattern makes sense: the more files and steps a problem has, the more likely a planning error compounds into downstream damage — and the more valuable an early critique becomes.
Three Real Bugs Rubber Duck Caught¶
The three examples in the official blog post all passed the main model's self-reflection and were only caught by Rubber Duck. The common thread: all three fail without raising any error — they pass tests. These silent-failure bugs are exactly where a reviewer with a different perspective earns its keep.
How to Enable It¶
Rubber Duck ships as an experimental feature.
# Enable experimental mode in Copilot CLI
/experimental
# Pick a Claude model in the model picker
# → Rubber Duck is automatically assigned to GPT-5.4
Available to all users running Claude Opus 4.6, Claude Sonnet 4.6, or Claude Haiku 4.5 as the orchestrator. Access to GPT-5.4 is required.
Highest-value use cases:
- Complex refactors and architectural changes
- Tasks where a mistake is expensive
- Verifying test coverage
- Getting a second opinion before committing to a plan
Takeaway: The Case for Cross-Family Critique¶
The real lesson in the Rubber Duck architecture is this: combining models with different characteristics can be more cost-effective than pushing a single model harder.
If Sonnet + Duck (a modest cost bump on top of Sonnet) gets you close to Opus alone (several times the Sonnet cost), then "run the strongest single model" is no longer the obvious engineering choice.
The checkpoint design — plan > implementation > test — also aligns with a very old software engineering principle: defect fix cost grows exponentially as you move later into a phase. The same rule that governs human code review turns out to apply, almost unchanged, to AI agent workflows.
GitHub says they are also exploring reviewer families when GPT-5.4 is the orchestrator. "Which model should I use?" may be giving way to a more interesting question: which two models should I pair?1
Summary¶
- Rubber Duck is a structured second opinion that catches bugs which are "obvious from a different training lineage."
- Placing the automatic checkpoint after planning mirrors the well-known exponential growth of defect fix cost across phases.
- The future of agent design may be less about picking a single best model and more about picking the right pair.
Related Articles¶
- GitHub Copilot Autopilot Mode — Approval-Free Autonomous Execution to Task Completion
- GitHub Copilot CLI Practical Guide — Differences from VS Code Agent Mode, Plan, Autopilot, and Fleet
GitHub Blog, "GitHub Copilot CLI combines model families for a second opinion" (April 6, 2026). https://github.blog/ai-and-ml/github-copilot/github-copilot-cli-combines-model-families-for-a-second-opinion/ ↩