This is one team's experience moving from human-only PR review to a Mesrai-augmented workflow. The team has asked to stay anonymous, so identifying details are obscured. The numbers are exact, drawn from their internal dashboards and reviewed with them before publishing.

Their before-and-after, in one chart, looks like this: median time-to-first-review fell from 18 hours to 4 hours, median time-to-merge fell from 3 days to 14 hours, and 23 critical bugs were caught at review that the prior process would have caught only after merge (or after deploy). The team added zero headcount.

The team

40 engineers across four product squads. B2B SaaS, payment-adjacent (which matters — security finds are non-negotiable). TypeScript backend on Node, Python data services, React + Next.js frontend. Monorepo on GitHub. Two-reviewer-required policy on `main`.

Before Mesrai, their review process was straightforward: PR opens, reviewer slack-pings happen, one or two engineers eventually pick it up, review notes get exchanged, the PR merges some hours or days later. Standard.

The problem they were solving

Two pains, both showing up in their internal engineering survey:

Senior reviewers were the bottleneck. Half the PRs were waiting on the same six engineers. Reviewer fatigue was real and people were asking out of the rotation.

Bugs were slipping through. Their post-mortems for the prior quarter pointed at "review missed it" four times. Not catastrophic, but enough that the engineering lead was concerned about whether the bar would hold as the team grew to 50.

The setup

They installed the Mesrai GitHub App on the entire monorepo. Severity set to medium, all four rule packs enabled, BYO key with their existing Anthropic contract. Total setup time: under an hour including the internal Slack announcement.

Policy decision they made up front, and it mattered: Mesrai's review does not count as one of the two required human reviews. Mesrai posts findings, the author addresses them, then two humans approve. Mesrai is additive, not substitutive. This single decision is responsible for most of the quality gains they saw.

Week-by-week numbers

text

                       Before   W1      W2      W3      W4      W5      W6
--------------------------------------------------------------------------
Median time-to-first   18h     8h      6h      5h      4h      4h      4h
Median time-to-merge   72h     48h     32h     22h     17h     15h     14h
Mesrai findings/PR     —       4.2     3.1     2.4     2.0     1.7     1.6
Reviewer satisfaction  3.1/5   3.4     3.6     4.0     4.2     4.4     4.5
Bugs caught at review  12/qtr  3/wk    4/wk    5/wk    6/wk    4/wk    3/wk

The interesting patterns:

Time-to-first-review dropped immediately because Mesrai's three-minute first pass shows up in the metric. The human reviewer still arrives at their normal pace.

Time-to-merge took six weeks to land. The early weeks were noisier — Mesrai surfaced things the team had been letting slide, and they spent time addressing real findings that had accumulated.

Findings per PR dropped over time. Same severity, same rules — the codebase got cleaner, so there was less to find.

Reviewer satisfaction climbed sharply after week three. The qualitative comments converge on: "I am reviewing better code now, my comments are more interesting."

The 23 caught bugs

Six weeks, 23 findings their team agrees would have shipped under the old process. Sample (anonymized):

A payments retry handler with `Promise.all` and no `await` — would have charged customers twice on a retry. Severity: critical.

Three separate cases of unsanitized input concatenated into raw SQL — would have been SQL injection vectors. Severity: critical.

An auth middleware change that bypassed token expiry for one route. Severity: critical.

Nine performance regressions — N+1 queries inside `.map` loops, mostly. Severity: high.

Nine architecture violations — service-layer code importing from the infrastructure layer, breaking the established boundary. Severity: medium.

Their CTO's framing: any one of the four critical finds was worth more than the year's subscription cost. The performance finds were nice. The architecture finds were the long-term gift.

What did not change

The team did not remove the two-human-reviewer requirement. They did not reduce headcount. They did not change their merge policy or branch protections. The intervention was strictly additive.

Senior engineers initially worried Mesrai would dilute their judgement role. After six weeks the consensus reversed: by the time they touched a PR, the noise was gone, and they could focus on the calls only they could make. The judgement role got more concentrated, not less.

What we learned helping them

Three things, generalizable:

The substitutive-vs-additive distinction matters more than any tooling choice. Teams that let AI count as a human reviewer saw quality drop. The team that kept the human bar saw quality climb.

Findings count drops over time. Six weeks is a fair window to see the asymptote. Do not panic when week-one numbers look noisy.

Reviewer satisfaction is a leading indicator. If your team is feeling better about reviews after week three, the merge-time wins are coming.

Honest caveats

Three:

This is one team. We have seen similar numbers across several others, but "one team" is not a study. Treat this as an existence proof, not a forecast.

Their starting baseline was on the slow side — 3 days to merge is room for improvement. A team already at 1-day median would see a smaller absolute gain.

They paid for the bundled Anthropic-BYOK plan. If you cannot get senior-model quality, the numbers would be lower; the lighter models still help but the depth gap matters on a payments codebase.

The takeaway

78% faster time-to-first-review, 81% faster time-to-merge, 23 critical bugs caught pre-merge, zero headcount added, six weeks. Same engineering team, same merge bar, same product. The change was a tool that absorbed the throughput layer so the human review could focus on judgement.

Whether your team would see the same numbers depends on your starting baseline, your codebase, and your discipline about keeping AI additive. The math is favorable enough that the experiment is worth running.

More essays3

// try

See it on your next PR.

Free for individuals. Install in two minutes. Mesrai reviews every commit.

Start free Back to blog