a category

Three AI Reviewers Per Commit (And Why That’s Not Overkill)

Why every commit on my projects gets reviewed by three different AI models before any human sees the PR — and why three is the magic number, not one or two. Plus the gpt-5.5 bug that quietly broke my Codex review gate last week, and how I patched around it.

I’ve been in a few conversations recently about which models work best for development/code review, and why we use three. Decided to write it down.

The short version:

  1. Plan and implement with Claude Code or Codex CLI
  2. Force a local CLI review with the other one before commit ever hits GitHub
  3. Let GitHub Copilot do its automated PR-side review

That’s the loop. Three different models, three different contexts, all looking at the same diff. None of them get to be the one who shipped it.

Three-pass review workflow flowchart with three stages: Stage 1 Plan + Build (Claude Code or Codex CLI, writes the code), Stage 2 forced CLI Review (whichever model didn't write it, catches author's blind spots), Stage 3 PR Review (GitHub Copilot on the open PR plus human merge, catches what local missed); a dashed loop returns from Stage 2 to Stage 1 if the review finds something
Three independent passes. The author never gets to clear their own work.

Why three? Why not just one?

Here’s the thing — every model has blind spots. Claude Opus 4.7 catches certain bugs really well: race conditions, missed edge cases in async code, careless mutation. Codex (GPT-5.5 family) catches different stuff: type narrowing issues, dead branches, subtle off-by-ones, leftover debug code. GitHub Copilot’s PR-side review reads the diff in GitHub’s UI, sees the PR description and linked issues, and cross-references the repo’s review history — so its priors are different again.

If you run code past three reviewers with different blind spots, you get the union of their catches. If you run it past three reviewers with the same blind spots, you just feel safer for no reason.

I’m not paying for verification by repetition. I’m paying for complementary failure modes.

Whoever wrote it doesn’t review it. That’s the whole rule.

The “forced” part is the whole point

The CLI review isn’t a thing I remember to run. It’s a Makefile target — npm run review:copilot:ci in autohub, equivalent commands in the others — wired into the pre-PR gate. If it fails, I literally can’t open the PR.

Friction on purpose. The whole reason to force it is that if you don’t, you skip it on the day you most need it — when you’re tired, the change “is fine,” and the test bar is low.

This setup runs GitHub Copilot CLI as the local gate now, but it didn’t always. I was on CodeRabbit through most of February and March. Three reasons I switched:

  1. CR’s local review was 7 to 30 minutes per pass. Not “force this on every commit” speed. The first version of my pre-PR gate had a 120-second timeout on it and CR would blow right through it. I patched it to 30 minutes and added progress logging. It still felt like asking permission to ship. Copilot CLI returns in seconds.
  2. CR CLI had real git-context bugs inside .worktrees/ directories. I run a lot of parallel agent worktrees — one of the reasons the workflow exists is to keep three or four PRs cooking at once without stepping on each other. CR couldn’t resolve git context properly in worktrees, and the workarounds inside the skill were ugly.
  3. I’m already paying for GitHub Copilot. Once Copilot CLI got good enough, paying twice for two AI reviewers — when one of them ships in the same product I’m already on — felt silly.

None of this is a CR dunk. CR genuinely does some things better, especially the comment-resolution learning loop. It just stopped being worth the friction for my setup. Your mileage will vary.

The real surprise after the switch wasn’t speed. It was how often a “trivial” change had something the second model wanted me to fix. Not always serious. But often enough that 30 seconds of local review beats 20 minutes of PR back-and-forth — or worse, finding out three days later that something I shipped is silently broken.

How the loop actually runs

For something I’m building from scratch:

  1. Open a fresh Claude Code session
  2. Paste the goal. Let it plan
  3. Approve or revise the plan (this part is most of the value, honestly)
  4. Let it implement
  5. Run codex review against the diff before committing
  6. Codex either green-lights or finds stuff
  7. If it finds stuff, fix and re-review
  8. Commit. Push. Open PR
  9. Copilot does its automated review on GitHub
  10. Address Copilot’s comments via the copilot-review skill (which knows to commit and push, not just commit — learned that one the hard way after PR #6 shipped unfixed code and PR #7 had to recover it)
  11. Human review (me, mostly. Sometimes Steve)
  12. Merge

For something where I started with Codex instead, swap step 1 and step 5. The point isn’t “always Claude first” or “always Codex first.” The point is whoever wrote it doesn’t review it.

The official Codex plugin made this dramatically less painful (mostly)

Up until late March 2026, this whole loop was custom shell scripts, Makefiles, and a Stop hook I’d hand-rolled. Then on March 30, OpenAI shipped codex-plugin-cc — an official Claude Code plugin that wires Codex into your Claude Code session via slash commands, plus the thing I actually cared about: a review gate powered by Stop hooks.

The review gate is exactly what I’d been hand-rolling. When Claude finishes a response cycle, the Stop hook intercepts, runs a targeted Codex review on what changed, and if Codex flags issues the stop is blocked — Claude can’t quit until they’re addressed. Forced review with no Makefile shim. Slash commands cover the manual side: /codex:review for a normal review, /codex:rescue for a skeptical adversarial pass, /codex:status and /codex:result for managing background runs.

This is roughly the workflow I’d been hacking together for months, made one-click. I switched the day after it shipped.

Two real catches I’d flag before you turn it on though:

  1. It can drain your Codex usage limits. The review gate creates a Claude → Codex → fix → Claude loop that runs to completion. Long autonomous sessions chew through a ChatGPT subscription’s Codex quota faster than you’d guess. The plugin’s own README warns about it. Only enable the gate when you’re actively monitoring.
  2. Model-compat broke for me on April 29. The gate inherits whatever model your Codex config has as default. Mine was gpt-5.5 because that’s what my main Codex sessions use. Codex CLI 0.125.0 rejects 5.5 in the gate path specifically — silent failure, no review actually ran. Took me an hour to track down because the hook exits silently on errors so it doesn’t block your workflow. Which is the right call generally, but means you don’t notice when your safety net stops catching things.

The fix for now: I patched the gate hook to force gpt-5.3-codex-spark as the gate-specific default, with a CODEX_COMPANION_STOP_REVIEW_MODEL env var to override when 5.5 catches up in a future Codex CLI release. Honestly, “different default model for the thing that writes code vs the thing that reviews it” turns out to be the right factoring anyway — review tasks don’t need 5.5’s full reasoning depth, and the cheaper/faster model means the gate runs more often without me hesitating.

If you’re starting fresh today, just install the plugin, pin a known-good review model in your Codex config, and skip the four months of duct tape I went through.

Where I tested this before betting on it

I don’t push new dev workflows to WP Fusion first. WP Fusion runs on tens of thousands of customer sites. It’s not the place to experiment with my own toolchain.

The pattern: new tooling lands in autohub (my personal agent infrastructure, not public) and AutoMem first. Those are mine. If something breaks, only I lose. Once a workflow has been baked for a couple weeks and I’ve actually written down what failed, then it graduates to the revenue products.

The three-pass review survived the autohub gauntlet for ~6 weeks before I let it anywhere near WP Fusion’s release process.

The research actually backs this up

There’s a folk wisdom in the AI tooling crowd that you should just “use the best model” — pick Opus 4.7 for everything and call it a day. I don’t believe that anymore. A few things changed my mind:

1. Different model families catch different bug classes. The SWE-bench data is suggestive here — Sonnet 4.6 leads on aggregate, but its delta over GPT-5.5-Codex is small overall and reverses on certain task categories. The models are trained differently, they read code differently, they hallucinate differently. Their misses are not the same misses.

2. Independent passes compound. Napkin math: if Model A catches 80% of issues and Model B catches 80%, two passes catch ~96% — if their misses are uncorrelated. Three passes get you north of 99%. The math only works when the reviewers are actually independent, which is why I run different model families in different contexts. Two Claude passes is not two passes. It’s one pass with extra steps.

Line chart comparing catch rate of independent reviewers (different model families, rising to 99.2% at 3 passes) versus correlated reviewers (same model run 3x, plateauing around 82%) across one to four passes
Three independent passes at 80% catch rate each → 99.2% combined. Three Claude passes → roughly the same as one. Independence is the multiplier.

3. GitHub Copilot’s PR-context view is genuinely different. It sees the diff in the GitHub UI, has access to the PR description and any linked issues, and cross-references the repo’s review history. That’s information neither local CLI pass has. Different context = different catches.

4. OpenAI’s own agents best-practices guidance recommends a scout / verifier / PR-generator separation rather than one agent doing everything end-to-end. My setup is just that pattern, made physical with a Makefile. I didn’t invent it. I just refused to skip it.

What I’d cut if I had to

If I could only keep one of the three:

  • Drop Copilot’s PR review first. It’s the most easily replaced — a careful human reviewer covers most of what it catches, and the GitHub-side review pass adds latency to the merge.
  • Drop the local CLI gate second. It’s the highest-friction one, but it’s also where I personally catch the most bugs in my own commits. Painful to give up.
  • Keep the planner. Without a planning pass, the implementation is dramatically worse, and no amount of review fixes architecturally bad code. You can’t review your way to a good design.

But I keep all three because the marginal cost is roughly 3 minutes per commit and the marginal benefit is “I haven’t shipped a regression in a month.” Cheap insurance.

Skills, not scripts (and where they live)

One last thing — all three steps live as agent skills, not bash scripts. Skills are markdown. They’re version-controlled. They’re inspectable. When something goes wrong, I edit the SKILL.md, not a 400-line shell script someone wrote at 2am.

The copilot-review skill, for example, knows to commit AND push because I learned the hard way (PR #6 shipped unfixed code, PR #7 had to recover it). The gh-pr-review-fix skill knows how to handle .worktrees/ dirs because CodeRabbit CLI has known git-context bugs in worktrees.

These are scars in markdown form. That’s the whole point. The next time the same bug tries to ship, the skill catches it before I do.

I keep all of them in AutoVault — a curated skill library I’ve been building out as a sibling project to AutoMem. The thing AutoVault solves that random folders of SKILL.md files don’t: a validation gate at install time. When an agent proposes a new skill (or you copy one from somewhere on the internet), AutoVault dedupes it against what’s already there, runs a security pass on whether the skill’s declared capabilities match what the body actually does, and signs the result. Skills don’t quietly mutate. Skills don’t sneak in a credential stealer disguised as a weather plugin (that’s a real thing — someone got hit with that on a competing skill registry earlier this year).

Boring infrastructure. But the whole reason this three-pass review workflow scales across autohub, AutoMem, WP Fusion and the rest is that the skills behind it are managed in one place, with one validation pipeline, and the same rules apply everywhere. More on AutoVault soon.


That’s the workflow. Been running ~2 months now and I haven’t shipped a regression to WP Fusion in that window. Could be coincidence. Could be the workflow. Probably some of both.

But three independent reviewers per commit feels like cheap insurance for the times it isn’t a coincidence. 🧡

– Jack

Leave a Reply

Your email address will not be published. Required fields are marked *