A year ago, if a codebase accumulated three different ways to do the same thing, somebody would usually notice.
A reviewer might leave a comment. A senior engineer would mutter something about "yet another helper," and eventually the team would clean it up. Maybe they'd write a short doc. People would learn, collectively, that this was not the way.
This was not a perfect system. Human review can be slow. People miss things. Pattern docs get stale. Engineers develop opinions that are 70% wisdom and 30% scar tissue.
Still, the friction did something useful. It kept codebases from drifting too quickly.
Coding agents change that.
Once agents are writing a meaningful share of code, the question is no longer just "can we get working code faster?" Obviously, yes. We can. Sometimes hilariously so.
The more interesting question is: what does the codebase teach the next agent?
Because it does teach.
Every merged PR becomes a precedent. If the repository contains one clear pattern, the next agent has a decent shot at following it. If it contains three slightly different patterns, the next agent may extend one, combine two, or invent a fourth for reasons that are, in technical terms, vibes.
The funhouse builds itself
Coding agents, like some of us, have a chaotic inner goth teenager.
Their non-deterministic nature means they can be inconsistent, often in surprising ways. They can generate the weird little leap that solves the problem, or they can generate novelty where nobody asked for it.
The dangerous thing is that most of the code is fine, if you look at it in isolation.
A PR passes tests. The implementation makes sense. The file is clean enough. You look at the diff and think, yeah, sure, that seems okay.
But zoom out, and the codebase as a whole has started to get weird.
- three near-duplicate utilities
- multiple data-fetching patterns
- parallel abstractions solving almost the same problem
The codebase remains locally cohesive, while slowly losing global coherence.
That matters more than it used to, because even more than humans, agents are responsive to the signals around them. A clear codebase gives them clear precedent. A muddled codebase gives them confusion.
You can put principles in CLAUDE.md or docs, but they're... advisory. They compete with task instructions and inline comments for attention, get summarized when context windows compact, and depend on the agent attending to the right instruction at the right moment.
If your codebase contains three slightly different ways to solve a problem, the next agent has to infer which one is canonical.
From the agent's point of view, this is not irrational. The repository gave it mixed evidence.
The scary outcome is not code that obviously fails. It is code that superficially keeps working, while the shape of the system increasingly gets worse.
Tests are nice
Most discussions about AI-native development jump from this problem – agents' tendency to accumulate tech debt – directly to tests.
And yes, across the industry, teams are writing dramatically more tests than they ever have. Agents have made high test coverage affordable in a way it never used to be.
Recently, Garry Tan argued that the primary way to keep AI agents on track is 90% test coverage:
Tests are the ratchet. 90% coverage, every PR, no exceptions.
And indeed, test coverage is the simplest kind of ratchet: a mechanism that allows motion in one direction only, like a socket wrench that turns the bolt forward but never lets it spin back. Once a test locks in a behaviour, it becomes difficult to accidentally regress.
But it's worth being specific about what tests actually do.
Unit tests check that a function still returns what it returned before. Integration tests check that pieces still wire together the same way. E2E tests reach further: they check whether the product still does what it used to, and the assertions you prioritize there should be an important act of human judgment.
But they all share the same basic shape: tests verify that code does what it did before.
Whether what it did was the right way to do the thing is a separate question.
Tests ratchet behavioural sameness.
Meanwhile docs and evals can pin down reasoning and behaviour bars, which is sometimes more important. But none of these is directly looking at the shape of the system.
90% coverage of good patterns
A coverage ratchet only improves the codebase if the patterns it's locking in are already good ones.
If a questionable pattern already exists in the code, the next agent is more likely to extend it. The tests generated around that implementation only reinforce it further.
Over time, high coverage can unintentionally fossilize architectural drift, rather than prevent it in the first place.
At first, especially during prototyping, this can be easy to miss, because nothing looks obviously broken. Coverage stays high. The system works. But over time, certain parts of the codebase become strangely resistant to cleanup.
Tests that protect the wrong behaviour create drag against coherence. The harder you push for coverage, the more deeply you can lock in the shape you happened to start with.
This isn't a criticism of tests. Tests are necessary. But tests primarily validate behaviour, while many of the emerging failure modes in agent-generated systems, especially as they grow, are failures of shape.
Fitness functions protect the shape of the system
Unit tests ask whether the code still does what it did before. Fitness functions ask a different question: is the system still shaped the way we want?
Most codebases already have a few simple versions of this. Linters, type checks, and import-boundary rules all encode some idea of acceptable shape, at the micro level.
But in agentically-maintained codebases, we need an additional kind of macro fitness function. It has a couple of important jobs:
- catch bad agent choices at the moment of work
- redirect agents with the right amount of friction for the violation
At Forestwalk, for example, we have a fitness check that prevents tool handlers from calling back into our AI services. Tool handlers run inside the agent's own loop, so any handler that kicks off more inference can stack model calls inside model calls and freeze a live meeting. Agents sometimes accidentally introduce this pattern, so our fitness check flags it and ensures it's not committed.
Other fitness tests are ratchets that apply pressure to measurable properties – heuristics for shape like file size, bundle growth, prompt size, import fanout, and dependency count.
For these, the ceiling starts at the current value (e.g. 837kb) and flags to agents and humans if a PR would unduly increase it. For instance, when an agent adds an unnecessary new dependency that blows up our frontend bundle, the check fails.
The important part isn't just the limit itself. It's the “just in time” agent coaching.
At first, we found agents would sometimes over-optimize against whatever metric the checks exposed, with great resourcefulness and in ways we did not want. But refining the flags to give common sense guidance helped steer the agents into considering tradeoffs correctly.
For example, an agent might see this just in time:
⎯⎯⎯⎯⎯⎯⎯ Failed Tests: 1 ⎯⎯⎯⎯⎯⎯⎯
FAIL src/frontendBundleBudget.test.ts > Frontend bundle size budget > keeps built JS growth below the failure threshold
AssertionError: Frontend bundle grew +29 KiB (700 -> 729 KiB; fails at +5 KiB).
This ratchet is review pressure, not a mandate to shrink the number at any cost.
Diagnose first:
- Is the growth accidental (dependency, duplicate/dead code, broad import, avoidable scope creep)?
- Does the added JS unlock user-visible value?
- Can cleanup reduce it without hurting UX, accessibility, tests, or maintainability?
Quality-preserving fixes:
- remove duplicated rendering paths or dead production code
- move repeated styling or static configuration out of lazy-loaded JavaScript
- lazy-load rarely used flows instead of loading them on the main path
- replace an accidentally heavy dependency with an existing local utility
- import only the needed module or icon instead of a whole package
Do not fix this by:
- omitting requested scope or deferring needed UI work
- removing aria-label/title text or visible affordances
- replacing styled in-app controls with native/browser controls
- deleting helpful empty/error-state copy
- cutting corners that hurt code quality, UX, accessibility, tests, or maintainability
For intentional growth, raise FRONTEND_BUNDLE_BUDGET.totalJsKiB in architecture/src/allowlists.ts to the measured value and explain why in the PR: expected 29 to be less than 5
If you simply tell agents to "reduce bundle size," they will sometimes do naive things like:
- delete empty-state copy
- replace styled controls with browser-native ones
- inline dependency code to avoid import costs
- remove accessibility affordances
Instead, the idea is to not just explain what a given fitness test is protecting, but also explain what is not allowed to be sacrificed in the process, and do so at the moment the agent is working.
In contrast, the same guidance in CLAUDE.md is often skimmed past or applied inconsistently by agents. A just-in-time check, on the other hand, creates desired friction between the agent and a finished task. Where humans get tired and occasionally creative about routing around friction like this, agents are quite willing to schlep through feedback and loop until the check is passing.
What the checks can't see
Fitness functions are good at catching things precise enough to encode.
That is useful, but coherence is not always that crisp. Some drift is obvious only in relation to the rest of the codebase: this new thing is too close to an existing thing, this test is protecting a pattern we no longer want, this change is making the next change harder.
So in addition, we use a few non-deterministic checks on top of the deterministic ones.
One runs before a PR is opened. It scans the proposed change against the existing codebase, and asks whether the agent has introduced or expanded a competing pattern. When it finds one, it points to the existing pattern, explains why the new one looks redundant, and asks the agent to either extend the original or justify why this case is genuinely different.
Most of the time, the agent course-corrects quickly.
We run a similar check again at the PR review level, where it has more context and can catch subtler pattern drift.
Another check runs nightly and looks for tests that have fossilized patterns we have since decided against. Its mission is to find tests that protect old implementation choices that, if left alone, would teach future agents to preserve patterns we no longer want.
These checks are not magic. They miss things. They can be noisy. They still need human judgment around them.
But the goal is not to eliminate judgment. The goal is to move more human judgment into places where it can be highest leverage.
Making coherence the easy path
Fitness ratchets are not the whole answer to maintaining coherence in agent-generated codebases.
Fitness functions are one layer. Pattern drift checks are another. Product verification, code review, evals, documentation, repository structure, and the actual taste of the humans steering the system all matter too.
However, it's worth investing regularly in the early layers: the ones closest to the moment an agent is making an implementation choice.
That is where we have found these types of fitness functions useful. They do not solve coherence by themselves, but they add friction in the right places. Bad patterns get interrupted earlier. Good patterns have a better chance of becoming precedent.
The same feedback loop that spreads bad patterns can be used to spread good ones.
So there’s hope. When you see this pattern work, it soothes some of the worry we all have about agents causing chaos. They’re a lot more useful when the codebase gives them an obvious next move. The right pattern is easier to find. The weird fourth version of the thing has a harder time looking reasonable.
It's not perfect. Keeping a growing codebase coherent still requires human judgment, architectural intuition, and social coordination. The system still needs to be built.
But that system, itself, increasingly feels like the work.
If the codebase trains the next agent, then the job is not just to write better code.
It is to build the environment where better code is the obvious thing to write.