Claude Code and Codex in Production: What We've Learned After 50 Deployments

Neil Simpson9 September 2025Updated 16 May 2026

claude-codecodexcase-study

Server room with rows of network equipment and blue lighting

We've shipped over 50 production systems using Claude Code and Codex as core parts of our workflow. Not side projects. Not experiments. Real systems handling real traffic for paying clients.

Here's what we've learned — the patterns that work, the mistakes we've made, and why the methodology matters far more than the tool.

Key takeaways

The single highest-leverage investment is a well-maintained CLAUDE.md (or AGENTS.md). Every project gets one. It's not human documentation — it's context injection for the model.
Test-first works exceptionally well with AI. Humans define what should be tested; the AI implements both the tests and the code. Coverage becomes the default, not the afterthought.
AI should never make architectural decisions. It's an exceptional implementer and a poor architect. The local-correctness bias produces code that works in isolation but doesn't fit the larger system.
Every line of generated code gets human review. No exceptions. AI output looks more correct than hand-written code even when it isn't — that's the trap.
Scope sessions tightly. "Build the auth flow" produces better results than "build the backend". 1–2 hour sessions with clear acceptance criteria outperform open-ended ones.
The tool matters less than the process. An engineer with a mediocre tool and a great process beats an engineer with a great tool and no process. Every time.

What Works

`CLAUDE.md` and `AGENTS.md` as living architecture documents

Every project we ship gets an agent-facing markdown file at the repo root. Claude Code reads CLAUDE.md automatically on session start; Codex reads AGENTS.md. The contents are the same kind of thing in both cases: project conventions, file layout, stack decisions, golden rules, common mistakes.

A real example of the kind of section that pays for itself within a week:

## Golden Rules
 
### Always
 
- Use animation presets from `src/lib/animations.ts` — never inline animation configs
- Add `data-section-theme` to every section for theme-aware styling
- Use `next/image` for all images with explicit width/height
- Use semantic color tokens (`bg-background`, `text-foreground`) — never hardcode colors
 
### Never
 
- Animate layout properties (`width`, `height`, `top`, `left`) — use `transform` and `opacity`
- Use `"use client"` unless the component needs interactivity
- Commit filler content (dummy text, fake URLs, TODO markers)

Without that file, the model defaults to "generic React app" patterns and you spend the review pass undoing them. With it, the first draft already matches your house style and review time drops by an order of magnitude.

We update these files as the project evolves. New convention → add a rule. Pattern you keep correcting → add a "Common mistakes" entry. Treat it as the canonical source of project context, not as documentation.

Test-first development, AI-assisted

We write the test specification first — usually as a plain-English describe block or a function signature with a docstring — and let the model implement both the tests and the code. The human defines what should be tested. The AI handles the how.

A typical session looks like this:

// Step 1: human writes the contract
describe("calculateInvoiceTotal", () => {
  it("applies tier discount when customer is in a discount tier", () => {});
  it("ignores discount when order is unconfirmed", () => {});
  it("rounds to 2 decimals (banker's rounding)", () => {});
  it("returns subtotal when no discount applies", () => {});
});
 
// Step 2: AI implements the tests, then the function
// Step 3: human reviews both

This catches a huge proportion of issues before they reach code review. More importantly, it ensures every feature ships with meaningful coverage — not the kind of coverage you get when tests are bolted on after the fact.

Domain context injection

The biggest lever for code quality is feeding the model deep context about the business domain. When building a compliance system, we put the actual regulations in docs/regulations/ and reference them. For a logistics platform, we provide the domain model and the business rules.

The more domain context the AI has, the less generic — and less wrong — its output becomes. A model that knows your shipments have a Region field, a Carrier relationship, and a "Confirmed" status will produce code that handles those concepts correctly the first time. A model that doesn't will produce something plausible-looking that fails the first acceptance test.

Review-everything culture

Every line of AI-generated code gets human review. Every single line.

This isn't optional and it isn't a bottleneck. It's the quality gate that separates teams shipping reliable software from teams shipping AI-generated chaos. The review pass also feeds back into the CLAUDE.md — every recurring correction becomes a rule the model will follow next time.

The thing to internalise: review is the highest-leverage human activity in an AI-augmented workflow. It's not the boring part. It's the most important part.

What Doesn't Work

Blind trust in generated code

Early on, we caught ourselves approving generated code because it "looked right". Clean formatting, reasonable variable names, correct-seeming logic. Then it would fail on an edge case that a five-second read would have caught.

AI-generated code has a dangerous quality: it looks more correct than hand-written code, even when it isn't. The visual polish hides logic gaps. A junior engineer's first attempt at handling an edge case usually looks scrappy and forces a review conversation. An AI's first attempt looks production-ready, which suppresses the review instinct.

The fix is process, not vigilance: every line gets reviewed, no matter how polished. Cultural enforcement matters more than individual discipline.

Skipping tests because "it looks right"

Related to the above. If you don't have tests, you don't have confidence. Full stop.

The speed of AI generation makes it tempting to skip the test step. Resist that temptation. We've seen teams ship broken features that passed visual inspection but failed under real-world conditions. The model is faster than a human can review and test, so the temptation is to drop the test step to keep up. That's the wrong direction.

The right answer is to let the AI generate the tests too, alongside the implementation, and review both. You gain back the time you'd have spent typing tests, not the time you'd have spent thinking about correctness.

Letting AI make architectural decisions

Claude Code and Codex are exceptional implementers. They are not architects.

When we let the AI choose between architectural approaches — "should this be a queue or a webhook?", "monorepo or polyrepo?", "RPC or REST?" — it optimised for local correctness. The code worked. But it didn't fit the bigger picture, because the model couldn't see the bigger picture.

Architectural decisions require understanding trade-offs across the entire system lifecycle: maintenance cost, team familiarity, ops surface, future flexibility, blast radius of mistakes. Those trade-offs depend on context the model doesn't have and can't easily be given.

The reliable pattern: humans make architectural decisions and document them (often in CLAUDE.md as constraints). The AI implements within those constraints.

Long, unscoped sessions

AI works best with clear, bounded tasks. "Build the authentication flow" produces better results than "build the backend".

We break every project into focused implementation sessions — usually 1–2 hours — with specific objectives and acceptance criteria. A typical session: "Implement the /api/auth/login route handler, returning a session cookie on success and 401 with a standard error body on failure, with tests for both paths and rate limiting per IP."

When sessions run longer than 2 hours or scope creep sets in, output quality degrades sharply. The model loses track of the original objective, makes increasingly fuzzy decisions, and starts adding code that wasn't asked for. Tighter scope = better results.

Treating AI as a code-completion tool

The teams who get the least value from Claude Code and Codex treat them as fancier autocomplete. They use them inside their editor, line-by-line, for the same kind of work they did before — just with less typing.

That's not where the leverage is. The leverage is in delegating whole tasks: "write this service end-to-end including tests and the deployment YAML, here's the spec". The model is good at that. It's mediocre at predicting your next character.

If your team's AI workflow looks identical to their pre-AI workflow plus inline suggestions, you're leaving the multiplier on the floor.

A Worked Example: Shipping a Feature in 90 Minutes

To make this concrete, here's what a typical AI-augmented feature ship actually looks like for us — from a written spec to deployed code with monitoring.

Time	Step	Who
0–10 min	Read spec, locate affected files, sketch acceptance criteria	Human
10–15 min	Write test spec (describe blocks, function signatures)	Human
15–45 min	Implement tests + code; iterate until tests pass	AI under direction
45–60 min	Human review of generated code, test edits, refactor passes	Human
60–75 min	Run full test suite; fix any regressions; verify against acceptance criteria	Both
75–85 min	Commit, open PR, manual smoke-test in preview deploy	Human
85–90 min	Merge, watch deploy, verify in production	Human

Roughly 30 minutes of focused human attention. Maybe 60 minutes of AI-driven work in parallel. The whole thing in an hour and a half. A traditional ship of the same scope is usually a half-day to a day, mostly because the test-writing and the implementation are sequential rather than parallel.

This isn't theoretical — it's roughly how every feature in our Production-Grade Systems engagements ships.

The Tool Matters Less Than You Think

Teams ask us which AI tool we use as if the answer is the secret. It's not.

Claude Code and Codex are excellent and they are our primary tools, but the methodology around them is what produces results. An engineer with a mediocre AI tool and a great process will outperform an engineer with a great AI tool and no process. Every time.

The process is simple to describe and hard to internalise: define clear context, specify precise intent, review rigorously, test thoroughly. That works whether you're using Claude Code, Codex, GitHub Copilot, Cursor, or whatever ships next quarter.

If you switched tools tomorrow, you'd keep 90% of your productivity gain. If you abandoned the process tomorrow, you'd lose all of it.

The Real Shift

The 50 deployments taught us something bigger than tooling tips.

AI-augmented engineering isn't about writing code faster. It's about spending human attention on the things that actually matter — architecture, domain modelling, edge cases, user experience — and delegating the mechanical work. The bottleneck moves from typing speed to clarity of intent.

The engineers who thrive in this model are the ones who think of AI as a junior engineer with unlimited typing speed and zero judgment. You still need the judgment. You always will.