A while ago, Anthropic rewrote a C compiler from scratch for around $20,000. The compiler worked well enough to compile the Linux kernel. Rewrites that used to take years now take days.

If you’ve been using AI coding tools seriously for the past year, you’ve probably had your own version of this experience — a feature that would have been a sprint shipped over an afternoon, a refactor you’d been putting off for months done before lunch.

So here’s the awkward bit — when creating code costs almost nothing, what’s the job?

The Harness

Software used to be handwritten from scratch. More recently, it’s been mostly glued together from libraries, cloud services and SaaS platforms. Now and increasingly, it’s generated by AI.

What is coding then? Writing a prompt? We quickly realized one prompt isn’t enough. We need to manage a lot of context, which requires attention to detail — and so our prompts grew into specs. Then we noticed the spec doesn’t run on its own; it needs rules, skills, commands, tests, lints, builds — a harness — to actually produce working software.

The most valuable part of a repo is no longer the code, it seems. Rather, it’s the encoded knowledge — edge cases, UX decisions, performance tricks, implementation nuances — that sits in the spec and the harness. The code is one rendering of that knowledge. Given a good spec and harness, you can re-render it, for the price of tokens. Perhaps, in a different language.

The C compiler rewrite is a useful illustration of why the harness matters as much as the code or more. The rewrite was only viable because the original project had a solid test suite. But it was not perfect, the system didn’t boot — the new compiler lacked optimizations, so the compiled bootloader was too large to fit in the boot sector. It shows what specs are — more than just high-level requirements. They encode the non-obvious things that make software actually work, including performance constraints that aren’t visible until something fails to fit. In the case of the C compiler, there were no such specs.

So what does the coding process look like in this new reality?

Planning

Planning used to involve product work, graphic design, team gatherings with story estimation, whiteboarding sessions, design docs, occasional spikes for high-uncertainty work, and finally tracking it all with tickets in a tracking system. It involved a lot of human coordination.

Now, it seems, planning is mostly a spec activity, done with the agent as a collaborator. It compresses from weeks to hours.

The typical loop looks like this: the agent reviews the relevant parts of the codebase, researches good practices around the topic at hand (architecture, security, UX), drafts a spec, and iterates with you over edge cases, nuance, and the technical detail that matters. A good rule of thumb: by the end, you should have something detailed enough that the agent can run uninterrupted for around an hour — writing code and tests, both manual and automated — and produce something worth reviewing.

Planning often includes design. It can now be done by the same person as coding with modern tools like:

Many other useful tools can be part of planning — e.g. you can include UML diagrams in the spec, generated with Mermaid.

For a detailed take on what good planning looks like, see this great post from Boris Tane.

Reviewing the AI

Code review splits into two distinct activities now:

  • First, the developer reviews what the AI produced.
  • Second (optional), other developers review the combined output of the AI and the developer.

When I’m reviewing AI-generated code, my attention goes to a few specific things.

Unexpected changes. Files that shouldn’t have changed, scope creep, hidden refactors. The agent sometimes does more than it was asked to. I try to catch this early.

Tests. This is where most of my time goes.

  • Tests need to actually test something meaningful. Unit tests should be easy to read and exercise real logic — especially for side-effect-free functions, which should dominate in logic/model/domain.
  • Pay attention to tests with heavy mocking that create the illusion of coverage without real verification. The agent will happily write these if you don’t push back.
  • Make sure integration tests cover all database and external service interactions.
  • Good e2e tests are also worth the cost. Still on my list is recording videos so I can review results quickly, without reading the test code.

I pay a lot of attention to whether all edge cases are covered.

With good tests I only skim through the implementation, looking for things that seem off, while paying attention to architecture and good software design.

UI. I mostly don’t review UI by reading code. Instead I ask for visual evidence — in the case of web development, I want to cover different UI states in Storybook. For UI regressions across changes, Chromatic tracks the diffs.

Rule of AI reviews

The single most important rule: don’t fix AI output manually or with one-off prompts. When you spot a problem, the move is to improve the spec, improve skills, rules, or other parts of the harness. Then ask the agent to update the code to meet the new expectations. Patching the output is a trap — you fix it once and the next PR reintroduces the same class of issues. Improving the harness fixes it long-term.

If the implementation is far off from my expectations, I would rather improve the spec and harness, and regenerate the PR from scratch. Code is cheap after all.

A few smaller habits that go with this:

  • Commit often, in small chunks, as you accept specific parts of the feature and adjust the harness.
  • Build your own code-review command and improve it over time, to reduce the number of things you have to look at manually. Such a command usually starts in a fresh context, often spawning multiple subagents that use different models. It reviews both the spec and the implementation.
  • Another good practice is to add AI security-review.

Team Review

Here’s the puzzle a lot of teams hit quickly: coding got an order of magnitude faster, and throughput barely budged. Why?

Three causes:

  1. Weak planning that produces vague specs. Covered in the planning section above.
  2. Manual prompt-fixing instead of investing in the harness. Covered in the previous section.
  3. Async GitHub review is too slow for the new pace. When the coding loop is hours and the review loop is days, review becomes the bottleneck. I cover this one in the current section.

You can forget the async PR-comment ritual we know from GitHub — it was designed for a world where writing code was slow.

Sync review

Instead of commenting on GitHub, do in person review or remote over screen share. During such a review you can ask a person or Claude Code to clarify various pieces of code.

Read the spec first. Get a quick understanding of the PR’s intent and catch anything conceptually wrong before diving into code.

Fix things as you go, with the agent. Of course, improve the harness first, not the code itself. Don’t forget to improve your review command to include things you find in your reviews.

Calibrate review depth to project maturity. Early in a project, I review slowly and thoroughly. I speed up as the harness matures and I put guardrails in place — both traditional (tests, linters, stories, Chromatic) and modern (skills, rules, commands).

Ensure all review tools run locally via the agent before pushing — lints, tests, builds, anything that could surface in CI. Agents are reasonably good at picking the right tools for the change and not the whole thing. Adding specific skills can help if they get lost. Don’t waste time cycling small improvements between GitHub and your local machine.

Don’t cut testing corners. Verify the AI-written tests actually test something. Good tests are the best quick-feedback loop for AI — both for developing new features and for catching regressions. They’re also a great tool for quick reviews.

Split PRs. Lightweight changes (UI tweaks, copy, cosmetics) can skip team review; core changes require it. Use CODEOWNERS to mark which paths are core.

When in doubt, throw it out and start again. Rewrite the spec and regenerate. Code is cheap. A solid review of cleanly regenerated code is often faster than picking through a flawed PR.

The New Programming Paradigm

Two broader shifts are worth naming.

Back to architecture and code design. Whole categories of “developer convenience” SaaS exist because integrating common functionality used to be expensive. When that cost collapses, the value proposition narrows. Clerk is a clear example — building a login system used to mean weeks of work or a paid integration; now you can ship most of what you need in a day or two. Integrating multiple SaaS providers also comes with complexity cost and limited customisation. We will see more in-house code and fewer integrations.

A new way of working. The new process is faster than agile by another order of magnitude, and notably less collaborative. The unit of production becomes the generalist — one person who talks to the customer, decides the UX, writes the spec, and ships across the stack, without handoffs. Less ceremony, fewer meetings, more direct connection between customer need and shipped code.

Agentic coding isn’t hard to learn, but the number of good practices and new habits to pick up is significant. It can be overwhelming — but taken step by step, significant improvements are around the corner.