Automating the Claude Code × Codex Review Loop — Three Levels: SKILL.md, Plugin, and Pipeline¶

Audience: Mid-to-senior developers exploring cross-model code review automation

Key Points¶

Level 1: Start with a Single SKILL.md File
The /codex-review command triggers an automated Codex review loop
Level 2: Auto-trigger via Stop Hook
Codex reviews fire automatically on task completion, eliminating missed reviews
Level 3: Pipeline for Team Governance
Gate the entire workflow, enforcing quality standards across the team

In the previous article, we verified that "implement with Claude Code → review with Codex" is a rational cross-model division of labor, grounded in benchmark differences.

But running this division manually is impractical. Every time Claude Code finishes an implementation, you'd launch Codex CLI in a separate terminal, copy-paste plans and diffs, and manually relay review results back — the human overhead negates the quality gains.

As of late February 2026, multiple developers have independently published approaches to automate this review loop. This article organizes them by adoption cost, covering three levels from "quickest to deploy" to "team-grade operations."

Prerequisites: Constraints When Calling Codex CLI from Claude Code¶

Before diving into automation, one technical constraint needs attention.

Claude Code's bash environment is non-interactive. Therefore, the standard approach is to use the codex exec subcommand (non-interactive mode) when calling Codex CLI. While TUI-dependent operations — session pickers, interactive approval dialogs — are unavailable, global flags like -a (approval policy), -m (model selection), and -s (sandbox) propagate to codex exec¹. Global flags are placed after the subcommand (e.g., codex exec -m gpt-5.3-codex ...).

# Review (read-only sandbox)
codex exec -m gpt-5.3-codex -s read-only \
  "Review the plan in /tmp/plan.md. End with VERDICT: APPROVED or VERDICT: REVISE"

# Resume session (preserve prior context)
codex exec resume <session-id> "Re-review the updated plan"

# Pass prompt via stdin (use - as PROMPT)
cat /tmp/plan.md | codex exec -m gpt-5.3-codex -s read-only -

The key capability is codex exec resume, which reopens a previous session. This lets Codex remember its prior findings and verify whether issues were actually addressed. However, the -o (file output) flag is unavailable with resume, so stdout must be captured instead².

codex exec also supports the --output-schema option, which constrains the final output to a JSON structure¹. While VERDICT: string matching is currently the dominant approach for termination conditions, schemas enable structured output like "verdict + issue list + severity" — a natural bridge to Level ⅔ automation.

Windows note: Codex CLI runs natively from PowerShell on Windows, with Windows sandbox support included. WSL is also a viable option¹.

Level 1: Drop a Single SKILL.md File — `/codex-review`¶

The lowest-cost approach is the SKILL.md method published by Aseem Shrey².

How It Works¶

Place a single Markdown file at .claude/skills/codex-review/SKILL.md, and Claude Code recognizes the /codex-review slash command. On invocation, the following loop runs automatically:

Plan export: Claude writes the current plan to a temp file (/tmp/claude-plan-<uuid>.md)
Initial review: codex exec launches Codex in a read-only sandbox to review the plan
Verdict: If Codex returns VERDICT: REVISE, Claude revises the plan and resumes the same session via codex exec resume
Convergence: Repeats until VERDICT: APPROVED or a maximum of 5 rounds

Round 1: Claude writes plan → Codex reviews → VERDICT: REVISE (8 issues)
Round 2: Claude revises   → Codex re-reviews (resume) → VERDICT: REVISE (6 issues)
Round 3: Claude revises   → Codex re-reviews (resume) → VERDICT: APPROVED ✅

Design Highlights¶

Session ID for concurrency safety. Each review session generates a UUID, binding the temp file path to the Codex session ID. The --last flag (auto-fetch most recent session) can grab the wrong session when multiple Claude Code instances run reviews simultaneously, so explicit session IDs are used².

Read-only sandbox. Codex can read the codebase for contextual review but cannot modify files. The principle that "the reviewer doesn't touch the implementation" is enforced at the technical level.

Claude's "active revision." The SKILL.md instruction explicitly states "Claude should make real improvements." Rather than passively relaying Codex's feedback, Claude rewrites the plan before resubmitting to Codex. This is the fundamental difference between a message relay and a true review loop.

Results¶

Shrey reports detecting 14 issues across 3 rounds. These included missing authentication models, shell script quoting bugs, schema field inconsistencies (status vs column state drift), and lack of concurrency. Specification quality reached production level with zero manual reviews².

Setup¶

# 1. Install Codex CLI if needed
npm install -g @openai/codex

# 2. Place SKILL.md in the project
mkdir -p .claude/skills/codex-review
# Download from the Gist or copy the SKILL.md content

# 3. Launch in Claude Code
/codex-review

The SKILL.md is published at the following Gist:

https://gist.github.com/LuD1161/84102959a9375961ad9252e4d16ed592

Note that the model name in the SKILL.md is gpt-5.3-codex, while the Codex CLI official reference uses gpt-5-codex in examples. Available model names vary by environment and subscription, so verify at runtime.

When to Use¶

Best suited for reviewing plans involving authentication, data models, concurrency — tasks that take days to implement. Overkill for bug fixes or small changes; Shrey himself recommends skipping it when speed takes priority².

Level 1.5: Adding a Final Audit Review — "Fix Loop + Fresh Session"¶

Level 1's SKILL.md approach has one structural weakness: Codex may be influenced by context accumulated through the session.

Two design schools exist in cross-model review automation:

Resume school (Aseem Shrey approach): Continues the session so Codex remembers prior findings and verifies whether fixes were actually applied. High traceability, but Codex may resist re-raising issues it previously deemed "minor"².
Fresh session school (Kim Major approach): Starts a new session each time, ensuring reviewer independence from prior context. However, tracking fixes becomes the human's responsibility. Major states: "Independent reviews only become meaningful when you define the scope of context you allow"³.

This article proposes a hybrid of these two approaches. Use resume during the fix loop for traceability, then run a fresh session for a final audit after convergence — combining fix verification with independent holistic review.

The implementation is simple: after the fix loop converges, run exactly one final audit review in a fresh session.

Implementation¶

Insert the following step before Step 7 (presenting final results) in the SKILL.md:

# After fix loop converges, run fresh session final audit
codex exec \
  -m gpt-5.3-codex \
  -s read-only \
  -o /tmp/codex-audit-${REVIEW_ID}.md \
  "You are performing a FINAL AUDIT of an implementation plan.
This plan has already been through iterative review. Your job is NOT to repeat prior feedback,
but to check for:
1. Systemic issues that incremental reviews might miss
2. Consistency across the entire plan (naming, error handling patterns, state management)
3. Anything that only becomes visible when reading the plan as a whole

The plan: $(cat /tmp/claude-plan-${REVIEW_ID}.md)

End with: AUDIT: PASS or AUDIT: CONCERNS (with specific items)"

Three key design decisions:

Redefine the prompt's role. The fix-loop Codex acts as a "reviewer pointing out individual issues," while the final-audit Codex serves as an "auditor checking overall consistency." The prompt explicitly requests "the big picture that incremental reviews might miss."

The fresh session itself is the value. By using a fresh session instead of resume, Codex reads the plan from a blank slate. The "this is acceptable" bias gradually formed during the fix loop is eliminated.

Differentiate the verdict. The fix loop uses VERDICT: APPROVED/REVISE while the final audit uses AUDIT: PASS/CONCERNS, enabling Claude to clearly distinguish processing flows.

As Major notes, ambiguous exit conditions can cause loops that never terminate³. The final audit runs once only (no loop), and if CONCERNS are raised, the decision is deferred to the human. This design is more stable.

Level 2: Auto-trigger via Stop Hook — `claude-review-loop` Plugin¶

Level 1 requires the user to explicitly invoke /codex-review. If you forget, the review is skipped.

The claude-review-loop plugin by Hamel Husain eliminates this problem using Claude Code's Stop Hook⁴.

How It Works¶

A Stop Hook intercepts Claude Code when it attempts to end a session after completing a task. There are two blocking mechanisms⁹:

(A) Exit code 2 method: The hook script returns exit code 2, forcefully blocking Claude Code's exit. stdout and JSON output are ignored; stderr content is passed as feedback to Claude.
(B) JSON decision method: Exits with code 0 and returns {"decision":"block","reason":"..."} to stdout. Enables structured control where the reason field becomes Claude's feedback.

claude-review-loop adopts method (B), with stop-hook.sh outputting {"decision":"block", ...} to stdout and exiting with code 0⁴.

1. /review-loop to start the task phase
2. Claude completes the task → attempts to end the session
3. Stop Hook intercepts → auto-launches Codex
4. Codex runs up to 4 parallel sub-agents for review
5. Review results written to reviews/review-<id>.md
6. Feedback returned to Claude, transitioning to fix phase

How It Differs from Level 1¶

Automatic triggering. Once /review-loop starts the task phase, reviews run automatically every time Claude Code determines "work complete." The human error of "forgetting to review" is eliminated.

Parallel reviews. Up to 4 Codex sub-agents run in parallel depending on the project type, covering security, performance, and design consistency simultaneously.

Persistent state management. Review state is tracked in .claude/review-loop.local.md, and results are saved to reviews/review-<id>.md with timestamps, Codex exit codes, and elapsed time⁴.

Setup¶

# Install as a Claude Code plugin
claude plugin marketplace add hamelsmu/claude-review-loop
claude plugin install review-loop@hamel-review

# Start a task
/review-loop

# Cancel if needed
/cancel-review

Caveats¶

Watch the default permission settings. claude-review-loop's default configuration uses the --dangerously-bypass-approvals-and-sandbox flag for Codex execution⁴. This skips both sandbox and approval processes — Codex's official docs explicitly warn to "use only in externally isolated environments"¹. For production or shared repositories, override via the REVIEW_LOOP_CODEX_FLAGS environment variable with --sandbox read-only or --sandbox workspace-write. If full access is truly needed, run in a disposable container or CI/CD isolated runner.

Stop Hook firing timing. The Stop Hook timeout is set to 900 seconds (15 minutes). Adjust in hooks/hooks.json if Codex reviews take longer⁴.

As Nick Tune's analysis points out, Stop Hooks don't always fire at ideal moments⁵. They may trigger when Claude pauses for requirements clarification, starting reviews on incomplete work. When Claude commits before stopping, the diff under review becomes unclear. Tune suggests "a CodeReadyForReview hook aligned with development workflow semantics would be ideal"⁵.

Infinite loop prevention. The Stop hook input includes a stop_hook_active flag⁹. Check this flag in the hook script and avoid recursively blocking when a hook is already executing. Without this guard, a review → block → review → block infinite loop can occur.

Level 3: Multi-AI Pipeline — Team Operations and Governance¶

Levels 1–2 are designed for individual developers integrating into their own workflows. When teams need unified review standards and enforced processes, a pipeline approach is necessary.

`claude-codex` Plugin (Z-M-Huang)¶

Implements the Claude (planner + coder) and Codex (reviewer) division of labor as a pipeline⁶. The workflow — requirements-gatherer → planner → plan-reviewer → implementer → code-reviewer — is managed by scripts that loop until all reviewers approve.

Team operations perspective:

The strength is process standardization and enforcement. By elevating reviews from "personal habit" to "pipeline gate," every team member passes through the same quality bar.

However, licensing requires attention. In addition to GPL-3.0, attribution requirements are stated at the top of the file⁶. Corporate legal review is recommended before adoption. Also, since agents execute sequentially, even simple tasks run through the entire pipeline. As verified in the previous article, this approach becomes excessive when task complexity doesn't justify orchestration overhead.

`Claude-Code-Workflow` (catlog22)¶

A JSON-driven multi-agent framework that orchestrates multiple CLI tools — Codex, Gemini, Qwen, and others — in a unified manner⁷. The README defines skill variants like /workflow:lite-plan, /workflow:plan, and /workflow:brainstorm, with installation via the ccw command. Skills are selected based on task complexity, supporting everything from simple tasks to complex multi-role collaboration.

The distinctive feature is support for multi-model parallel execution — "collaborative review by Gemini and Codex" or "architecture analysis using all available CLIs" — going beyond Codex-only reviews.

GitHub Agent HQ¶

GitHub's Agent HQ, released in February 2026, achieves platform-level cross-model integration⁸. Within a repository, you can directly launch Copilot, Claude, and Codex — assigning multiple agents to the same issue and comparing results. Follow-up tasks are as simple as mentioning @Claude or @Codex in a PR comment.

Unlike local CLI automation, it natively integrates with GitHub's CI/CD and code review workflows, offering the highest team-operations compatibility. However, it requires a Copilot Pro+ or Copilot Enterprise subscription and is currently in public preview.

Selection Criteria Across Three Levels¶

	Level 1: SKILL.md	Level 2: Plugin + Hook	Level 3: Pipeline
Adoption cost	Single file	Plugin install	Framework setup
Automation	Manual trigger / loop is auto	Auto-fires on task completion	Full workflow management
Parallel reviews	None (sequential)	Up to 4 sub-agents	Model and perspective parallelism
Team governance	None (personal habit)	Persistent logging	Gate enforcement / unified standards
Scale	Individual / small team	Individual / mid-size team	Mid-to-large team
Token cost	Low–Medium	Medium	High

The selection guideline is simple. Start with Level 1's SKILL.md to validate whether cross-model review delivers value for your project. Once confirmed, adopt Level 2 to eliminate missed reviews. Only consider Level 3 when you need to unify quality standards across an entire team.

The Microsoft Cloud Adoption Framework principle cited in the previous article — "prove value with a single agent before investing in multi-agent coordination" — applies directly here.

Where We Are and What's Next¶

Among the automation approaches covered here, the SKILL.md method and Plugin are production-ready as of February 2026. Pipeline-level tools remain in active development, with Stop Hook behavioral inconsistencies and team-specific review criteria integration as open challenges.

On March 30, 2026, OpenAI released the official plugin codex-plugin-cc, changing the landscape. A single slash command now runs a Codex review, making it lower-friction than the DIY approaches in this article. However, the DIY methods still offer superior customization for review loop stop conditions and criteria. For a comparison with the official plugin, see "OpenAI Releases Official Claude Code Plugin — What codex-plugin-cc Means."

To be precise, codex-plugin-cc is not Aseem Shrey's SKILL.md promoted unchanged into an official product. It is a separate OpenAI implementation that lets Claude Code call the local Codex CLI / Codex app server from inside the Claude Code workflow¹⁰. Functionally, it covers part of Level 1's manually triggered review flow and part of Level 2's Stop Hook based review gate.

This article focuses on cross-model work where Claude Code implements and Codex reviews. If Codex is the primary coding agent and a separate Codex session or subagent reviews the work, see Codex Reviewing Codex: Second-Pass Review with Independent Sessions.

Install it inside Claude Code:

/plugin marketplace add openai/codex-plugin-cc
/plugin install codex@openai-codex
/reload-plugins
/codex:setup

Use /codex:review for normal reviews and /codex:adversarial-review when you want Codex to challenge design choices and risk areas. The README also documents /codex:setup --enable-review-gate for a Stop Hook based review gate, with a warning that it can create long-running Claude/Codex loops and drain usage limits, so it should be enabled only when the session is actively monitored¹⁰.

The direction is clear. The era of manually switching between agents is ending. Approaches that define the entire "plan → implement → review → fix" cycle as inter-agent protocols are being rapidly established. Platform-level entrants like GitHub Agent HQ will only accelerate this trend.

Start with a single SKILL.md file and see for yourself.

OpenAI Releases Official Claude Code Plugin — What codex-plugin-cc Means — Official plugin features, setup, and how it compares to the DIY approaches in this article
Codex Reviewing Codex: Second-Pass Review with Independent Sessions — Same-model review via separate sessions and subagents when Codex is the main coding agent
What Benchmarks Reveal About Cross-Model AI Coding
Codex vs Claude Code: 2026 Benchmark Comparison
Claude Code Agent Teams Guide

Sources¶

OpenAI Developers, "Codex CLI: Command Line Options" (global flags, codex exec, --output-schema, Windows support); "Non-interactive Mode" ↩↩↩↩
Aseem Shrey, "I Made Claude and Codex Argue Until My Code Plan Was Actually Good," February 20, 2026; SKILL.md on GitHub Gist ↩↩↩↩↩↩
Kim Major, "AI-Assisted Coding: Automating Plan Reviews with Claude Code and Codex for Higher Quality Plans," January 18, 2026 (Flow Specialty) ↩↩
Hamel Husain, "claude-review-loop" (Claude Code plugin: automated code review loop with Codex) ↩↩↩↩↩
Nick Tune, "Auto-Reviewing Claude's Code," Medium, 2026 (also featured in O'Reilly Radar) ↩↩
Z-M-Huang, "claude-codex" (Multi-AI orchestration plugin for Claude Code, GPL-3.0 + Attribution clause) ↩↩
catlog22, "Claude-Code-Workflow" (JSON-driven multi-agent framework) ↩
GitHub Blog, "Pick your agent: Use Claude and Codex on Agent HQ," February 2026 ↩
Anthropic, "Hooks reference — Claude Code Docs" (Stop hook, exit code 2, JSON decision control, stop_hook_active) ↩↩
OpenAI, "codex-plugin-cc" (official plugin for using Codex from Claude Code; documents /codex:review, /codex:adversarial-review, review gate, and local Codex CLI authentication) ↩↩

Automating the Claude Code × Codex Review Loop — Three Levels: SKILL.md, Plugin, and Pipeline¶

Key Points¶

Prerequisites: Constraints When Calling Codex CLI from Claude Code¶

Level 1: Drop a Single SKILL.md File — /codex-review¶

How It Works¶

Design Highlights¶

Results¶

Setup¶

When to Use¶

Level 1.5: Adding a Final Audit Review — "Fix Loop + Fresh Session"¶

Implementation¶

Level 2: Auto-trigger via Stop Hook — claude-review-loop Plugin¶

How It Works¶

How It Differs from Level 1¶

Setup¶

Caveats¶

Level 3: Multi-AI Pipeline — Team Operations and Governance¶

claude-codex Plugin (Z-M-Huang)¶

Claude-Code-Workflow (catlog22)¶

GitHub Agent HQ¶

Selection Criteria Across Three Levels¶

Where We Are and What's Next¶

Related Articles¶

Sources¶

Level 1: Drop a Single SKILL.md File — `/codex-review`¶

Level 2: Auto-trigger via Stop Hook — `claude-review-loop` Plugin¶

`claude-codex` Plugin (Z-M-Huang)¶

`Claude-Code-Workflow` (catlog22)¶