Reviewing code you didn’t write, at scale, without reading every line; this isn’t a new problem. If you’ve been a tech lead or senior developer for any length of time, you’ve already developed this skill - you don’t read every line your team writes, you can’t. A team of five developers produces more code than you could comprehensively review, and a team of ten makes it impossible even to pretend, so you’ve learned where to look. You’ve calibrated to patterns: this person is meticulous about error handling but sometimes over-abstracts; that person writes clear code but occasionally misses performance implications. You know which modules are fragile regardless of who touches them. You’ve built heuristics for what deserves deep attention and what you can skim. This is existing practice, refined over years of tech leadership. You already allocate finite attention across large volumes of work you didn’t produce. The question is whether that skill transfers to a different source of output, at a different scale, on a compressed timeline.
It does, more directly than you might expect, though the calibration targets change. A team of ten developers might produce the equivalent of a few thousand lines of meaningful code across a sprint; an AI agent working overnight produces that same volume before you’ve finished your morning coffee. The work you used to review across two weeks now arrives in a single pull request. But the core judgments are the same: is this change high-risk or routine, does it follow established patterns or introduce something new, does the scope match the stated intent, are the edge cases covered? What adapts is who you’re calibrating to. Instead of individual developers with known tendencies, you’re calibrating to models, to agents, to different prompting strategies. Different models have different blind spots, just like different developers do. Claude 4.6 might handle architectural reasoning well but miss certain edge cases; GPT 5.2 might be thorough but occasionally over-engineer; Gemini might excel at pattern matching but struggle with business context. The metacognitive skill, knowing that different sources have different strengths and adjusting your review focus accordingly, is the same skill you’ve used for years with human teams. The source changed; the skill didn’t.
The practical framework transfers too. High-risk changes deserve deep attention regardless of who or what produced them: anything touching critical paths like authentication, payment processing, or data persistence; changes to abstractions that other code depends on; modifications to error handling or edge case logic; new patterns being introduced to the codebase; code that crosses module boundaries or affects multiple layers. Low-risk changes can be skimmed or trusted more readily: internal refactoring that preserves interfaces; changes that follow established patterns exactly; modifications well-covered by comprehensive tests; purely mechanical transformations. The risk profile doesn’t change based on the source. What changes is your confidence in identifying which category a given change falls into, because you don’t have the same background context you’d have if you’d been in the room when a human developer was thinking through the problem. Certain signals help: a diff that’s mismatched to the stated goal deserves skepticism; inconsistency with existing conventions suggests the agent didn’t fully understand the codebase patterns; test coverage that dropped or changed in ways that suggest the tests are now less meaningful is a warning; dependencies that changed without obvious reason warrant closer examination. Conversely, changes that match documented patterns exactly, tests that verify behaviour rather than implementation details, scope bounded to a single module: these lower the attention priority. You’re not making binary decisions; you’re placing bets on where limited attention will have the most impact.
The bet is what makes this psychologically uncomfortable. Every review is a bet that what you chose to examine deeply was actually the high-risk area, and what you skimmed was actually low-risk. Sometimes you’re wrong, and something breaks in production from a section you didn’t scrutinise carefully enough. The cost of the bet when you’re wrong varies dramatically: waste attention on over-scrutinising low-risk changes and you’ve burned time you could have spent elsewhere, which is costly but not catastrophic; miss a critical issue in a high-risk change and you’ve shipped a bug that affects users, which is significantly more costly. This asymmetry biases you toward reviewing more than necessary, which leads to burnout when the volume scales faster than your attention can keep pace. The skill isn’t avoiding the bet. The skill is calibrating the bet better over time, tracking your misses and your over-scrutiny and adjusting your heuristics so the next bet is more accurate. When something breaks that you reviewed, trace it back: where in the change was the bug, did you look at that area, what signal would have directed your attention there? If you missed it because you trusted a low-risk signal that turned out to be wrong, perhaps you assumed a purely mechanical refactoring was safe but it actually introduced a subtle behavioural change, then your heuristic for that category needs adjustment. When you deep-review something that turns out to be completely fine, notice that too: what made you think it was high-risk, what signals did you overweight? If you consistently spend an hour reviewing changes that turn out to be straightforward, you’re wasting attention you could allocate elsewhere. The calibration is bidirectional, catching more real issues while spending less time on non-issues, and it develops through practice rather than through reading about it. You can’t shortcut it. You have to make the bets, see which ones were right and which were wrong, and adjust.
Some teams are experimenting with AI-assisted review, and the early results raise interesting questions about how this might scale differently. The core insight is straightforward: if peer review helps catch issues that the original author missed, why does the peer have to be human? Some teams are trying multi-model review: generate code with one model, review it with another. Claude and GPT don’t always agree on implementation choices, and when they disagree, that disagreement often points to a genuine tradeoff worth examining. The economics are appealing; an AI reviewer can keep pace with AI generation in a way a human reviewer can’t, and if it can filter changes down to just the parts that need human judgment, that could shift where attention gets allocated. But the failure modes aren’t well understood yet. What does an AI reviewer miss? When does it over-flag low-risk changes and waste human attention? How do you calibrate it to your specific codebase patterns? These are open questions.
Some of the specialised approaches seem more tractable than others. Security-focused scanning, checking for SQL injection patterns, XSS vulnerabilities, authentication bypasses, feels like a relatively contained problem where an AI agent could be thorough in ways that are difficult for a human at scale; it can check every function, every input path, every boundary condition without fatigue. Performance analysis, flagging algorithmic complexity issues or missing indexes, similarly feels automatable. Pattern consistency checking is harder: an AI agent can verify whether code matches documented conventions, but how many teams have their conventions documented clearly enough for an agent to check against? And conventions evolve; an agent checking against yesterday’s patterns might flag today’s intentional evolution as a violation. Test quality evaluation raises similar questions: can an AI agent reliably distinguish between tests that verify behaviour versus tests that check implementation details? The brittleness question is subtle. What’s interesting about this direction, whether it works reliably or not, is how it might change what you delegate versus what requires your direct attention. If security scanning is genuinely automatable, you stop spending attention on manually checking for SQL injection and spend it on questions the scanner can’t answer: does this architectural change fit the system’s direction, does this abstraction clarify or obscure, does this tradeoff make sense given business context? The shift would be from comprehensive first-pass review to evaluating what automated tools flagged plus making judgment calls about things that can’t be automated. Whether that works in practice depends on how reliably the tools surface real issues without drowning you in false positives.
The fundamental question hasn’t changed. It’s the same question you’ve been asking for years every time you approve a pull request: did I look at the right things? The constraint hasn’t changed either; your attention is finite, and the output you need to review exceeds what you can comprehensively examine. That was true with a team of human developers; it’s more acutely true now that agents generate at higher volume on compressed timelines. The tools around that constraint are evolving, with AI-assisted review being tried by some teams, though it’s not yet clear how reliably it works or what its failure modes are, and if those tools prove useful they might change what you delegate versus what requires your direct attention. But the constraint being tighter and the tooling being different doesn’t mean the problem is fundamentally different. It means the skill you already developed, knowing where to look, recognising risk patterns, allocating attention strategically rather than uniformly, becomes even more important. You’re not learning something new. You’re adapting something you already know to conditions that are still shifting.