Skip to main content
Bartek Czyż
← writing
7 min read
  • ai
  • code review

I Had to Review a 330-File PR. Here's How AI Actually Helped

by Bartek

Code review
Photo by Volodymyr Dobrovolskyy on Unsplash

Some time ago I had to review a monster.

It was a pull request in our Next.js app: 330 application files changed, plus 110 visual regression baselines. Fifteen thousand lines added, over two thousand removed. It cut across multiple business domains, shared libraries, utilities, and every kind of test we have - unit, e2e, visual. The kind of PR where GitHub politely gives up rendering the diff.

I know what you’re thinking. “Just split it into smaller PRs.” Believe me, we tried. This one genuinely couldn’t be split at the time - the changes were interdependent in a way that made partial merges either broken or meaningless. And it touched critical parts of the app, so “merge it and see what happens” was not on the menu. Tests were green and the visual baselines showed the changes clearly, but our codebase isn’t perfect. Some of the older areas were under-tested, and there was plenty of room for bugs to hide.

So we had to review it properly.

The human pass

Reviewing this by hand was about as fun as it sounds. It took hours just to read through everything and build a mental model of what the PR actually did. Then came the second-order work: jumping between files to leave comments like “this is a duplicate of what we do over there” or “these two places solve the same problem differently.” Cross-referencing 330 files in your head is not what human brains were built for.

The obvious move in 2026 is to throw AI at it. So we did. Three times, in fact, and the differences between the attempts taught me more about AI-assisted code review than any blog post I’d read before. Including, probably, this one.

Attempt one: the built-in review

A colleague ran Claude Code’s built-in /review command on the PR. It read through the changes, narrating its thought process along the way - including a moment of “wow, this is a large PR, you should have split this into chunks” before getting to work. Fair point, machine. Fair point.

Eventually it produced a tidy document: strengths, weaknesses, actionable items sorted by severity. My colleague was happy. The job was done automagically, the report looked professional, and he called it a day.

Attempt two: ultrareview

Another colleague ran /ultrareview - Claude Code’s heavier review mode that spins up a fleet of reviewer agents in a cloud sandbox. It took noticeably longer and found more issues than the plain review. That was expected; it’s the more thorough tool.

But it also planted an uncomfortable thought in my head: the first run found some things, the second run found other things. What if there’s way more to find, and we just don’t know it yet?

That question is worth sitting with, because it points at something fundamental about how these tools fail.

The real problem: context rot

Here’s the mental exercise. Imagine you personally had to hold the contents of roughly 300 files in your head - before and after the changes - then figure out what each change means, cross-reference them against each other, and evaluate everything in the context of the whole application. You’d fail. Not because you’re a bad engineer, but because that’s simply too much to hold at once.

Language models have the same problem, just with a bigger buffer. A model’s context window is finite, and worse, its quality degrades long before the window is actually full. Research on this is pretty unambiguous: as input length grows, models get measurably worse at finding and using information buried in the middle of all that text. Chroma’s “context rot” study tested 18 frontier models and found that every single one degrades as context grows. The information is technically there; the model just pays less attention to it.

Now feed a single AI session a 15,000-line diff plus all the surrounding code it needs to read for context, plus documentation lookups, plus its own intermediate reasoning. By the time it’s reviewing file 200, the careful analysis it did on file 12 has dissolved into noise. You end up with conclusions that are incomplete, shallow, or occasionally just wrong - and the report still looks complete, which is the dangerous part.

AI helps a lot with reviews. But it has real limitations, and if you’re aware of them, you can work around them instead of being quietly burned by them.

Attempt three: divide and conquer

My solution was based on exactly that awareness, and it was almost embarrassingly simple. My prompt looked more or less like this:

shell
I have an insanely huge PR that I want you to review: <link>. You must
split it into chunks and for each chunk use a separate, dedicated
subagent. Each chunk should be a logical slice of the app, small enough
to be reviewable, but large enough to be a standalone PR. Ignore .png
files, I already reviewed them. Pay extra attention to whether the rules
from <our-rules.md> were applied (critical). When all subagents are done
reviewing, put together a report grouped by severity and index issues
1-n. Present the output in a readable form (separate .md file) so that
I can share this review report with my team.

In other words: I virtually split the PR into multiple smaller PRs, purely for review purposes, and gave each slice to its own subagent.

The trick is in how Claude Code’s subagents work. Each subagent runs in its own isolated context window. The main session never reads the diff itself - its only job is to figure out a sensible split and distribute the work. All the heavy lifting happens inside the subagents: reading the changed files, pulling library docs through context7, running web searches, cross-referencing against our internal rules. None of that pollutes the main thread. When a subagent finishes, only its conclusions flow back to the orchestrator, which assembles everything into the final report.

Each reviewer effectively got a small, focused PR and a fresh, empty head to review it with. Which, when you say it out loud, is exactly how you’d want humans to do it too.

The results

It burned through my entire 5-hour usage limit. It also found roughly 300 items to fix.

Three hundred. In a PR touching about 330 code files. I didn’t believe it at first, so I spot-checked the findings - and they held up. Duplicated logic, inconsistent implementations of the same pattern, violations of our internal rules, genuine bugs in under-tested corners of the old code. Sure, some of it was nitpicking. But it was exactly the kind of nitpicking I’d happily do myself if the PR were twenty lines long.

And that’s the part that stuck with me. There’s an old joke in our industry: submit a 10-line PR and you’ll get 50 comments; submit a 5,000-line PR and you’ll get “LGTM.” Review quality has always been inversely proportional to PR size, because human attention doesn’t scale. The subagent approach quietly breaks that law. Every slice of the monster got the scrutiny of a tiny PR.

I sent the report to the PR’s author, who fed it back into his own AI workflow and fixed the items in no time. The PR merged - huge, but now correct - and we never had to revert it for breaking something in production.

What I took away from this

The lesson isn’t “AI is great at code review” or “AI is bad at code review.” It’s that context is a budget, and how you spend it determines what you get.

A single AI pass over a massive diff produces a report that looks thorough but is shaped by everything the model had already forgotten by the end. Two different passes producing two different lists of findings isn’t a quirk - it’s a symptom telling you that neither pass saw the whole picture clearly.

The fix isn’t a smarter model or a bigger context window. It’s architecture: one orchestrator that plans, many isolated workers that each see a small, clean slice, and a final report assembled from their conclusions. The same reason we prefer small PRs from humans applies to machines, and when reality hands you a PR that can’t be small, you can at least make the review small - three hundred times over, in parallel.

Know your tools’ limitations, and they stop being limitations. They become design constraints. And design constraints, as every engineer knows, are just architecture waiting to happen.

Discussion