Deep dive · Engineering

From Vibe-Coded Prototype to Production-Ready Product

A hands-on engineering playbook for turning an AI-generated prototype into something you can ship and operate — tests, structure, resilience, and scale — using AI agent skills from Clean Code, Refactoring, Release It!, and more.

20 min read 9 skills Free & open-source
  1. 01 Clean Code clean-code Writing readable, maintainable code
  2. 02 Pragmatic Programmer pragmatic-programmer Practical software craftsmanship
  3. 03 Refactoring refactoring-patterns Improving code design through systematic refactoring
  4. 04 Legacy Code working-with-legacy-code Safely change untested code: seams and characterization tests
  5. 05 Software Design software-design-philosophy Reducing complexity through thoughtful software design
  6. 06 Clean Architecture clean-architecture Building maintainable, testable software architectures
  7. 07 Release It! release-it Production-ready software patterns for stability and resilience
  8. 08 System Design system-design Scalable system design patterns and trade-offs
  9. 09 Data-Intensive Apps ddia-systems Designing reliable, scalable data systems

You have a thing that works. You described what you wanted, an AI agent generated a few thousand lines, and now there’s a running app: it logs you in, it saves data, the demo lands. On your machine, it is real. The problem is that “works on my machine” and “production-ready” are separated by a gap most founders discover at the worst possible moment — when a paying customer’s data gets corrupted, when a third-party API hangs and takes the whole app down with it, when a single popular user’s query melts the database, or when you open the codebase to fix a bug and cannot find the bug because nothing is where it should be and there are zero tests to tell you whether your fix broke anything else.

That gap is not a character flaw and it is not a sign your prototype was bad. Fast-generated code is tactical code: it optimized for getting something running, not for surviving contact with reality. The good news is that the move from prototype to product is a well-trodden path with a real engineering literature behind it — and the same AI agent that generated the prototype can walk that path with you if you point it at the right principles. That is what this guide does.

The approach here is sequenced, not scattered. You will pin the current behavior so you can change code without fear, clean the worst of the names and functions, restructure the design so logic stops bleeding into the framework, draw an architecture boundary around your business rules, harden every integration point against failure, and finally make the data and scaling decisions that let the thing grow. Each phase is driven by a specific skill — a packaged framework from a canonical engineering book — that you install once and invoke by telling the agent to use it. The skills stack: Working with Legacy Code gives you the safety net, Clean Code and Refactoring clean the surface, Software Design and Clean Architecture fix the structure, Pragmatic Programmer sets the engineering habits that hold it all together, and Release It!, System Design, and Data-Intensive Apps make it operable at scale.

A prototype proves the idea can work. A product proves it keeps working when you are asleep.

One sequencing note before we start. Counterintuitively, the first skill you reach for is the legacy-code one — even though your code was written last week. The reason is the definition that skill runs on: legacy code is not old code, it is code without tests. Your week-old prototype has no tests, so it is legacy code, and every edit you make to it is a gamble until you change that. So we begin by stopping the gambling.

Phase 1 — Build a safety net before you touch anything

The very first instinct after the demo is to start cleaning. Resist it. If you rename a variable, extract a function, or “just fix this one thing” in untested code, you have no way to know whether you preserved behavior or quietly broke a path you weren’t even looking at. The Working with Legacy Code skill exists to break this exact trap. Its core move is cover and modify, never edit and pray: get the code you’re about to change into a test harness, pin down what it actually does right now, and only then change it.

The skill’s central technique is the characterization test. This is not a test of what the code should do — it’s a test of what the code actually does, quirks included. The recipe is mechanical and oddly liberating: call the code, assert something you know is wrong (expect(total).toBe(-1)), run it, read the failure message to learn the real output, then change the assertion to pin that real value. You are not judging the behavior, you are photographing it, so that later, when you refactor, any unintended change announces itself as a red test on your machine instead of as a support ticket.

The obstacle you will hit immediately is that vibe-coded prototypes love to instantiate their dependencies inline — a route handler that does new Database() in its own body, a service whose constructor opens a connection and news-up a payment client. That code can’t be tested because you can’t substitute a fake for the real database. The skill’s answer is seams: places where you can change behavior without editing in that place. The least-invasive seam is almost always Parameterize Constructor with a production default — constructor(db = new ProdDb()) — so every existing caller keeps compiling untouched while your tests pass in a fake. For a single dangerous call buried in a method, Extract and Override that call so a test subclass can stub it.

Prompt

Use the working-with-legacy-code skill to get my UserService class under test before I change it: it news up a Postgres pool and a Stripe client in its constructor, so identify the seams, parameterize the constructor with production defaults so existing callers still compile, and write characterization tests that pin the current signup and billing behavior

Legacy Code

When the change you need to make is genuinely urgent and you cannot get the whole area under test in time, the skill gives you two escape hatches that keep the risk contained to a single line. Sprout Method / Sprout Class: write the new behavior as fresh, fully tested code and call it from one line in the untested host. Wrap Method: rename the old function aside (rawPay) and add your new behavior — logging, validation, a feature gate — in a new wrapper that calls it. Both leave the old behavior byte-for-byte intact and let you test the new code properly, at the honest cost of leaving the host untested for now. Track those as debt and cover them on the next touch.

Prompt

Use the working-with-legacy-code skill to add referral-code validation to our 300-line checkout handler without rewriting it: use Sprout Class to put the new validation logic in a fully tested ReferralValidator, call it from a single line in the existing flow, and leave the rest of the handler untouched for now

Legacy Code

By the end of this phase you do not have a clean codebase. You have something more valuable: a codebase whose current behavior is pinned, so that everything you do from here is verifiable.

Phase 2 — Make the code readable, function by function

Now that changes are safe, attack legibility — because you (and your agent) are about to read this code far more than you write it, and the read-to-write ratio is well over 10:1. This is the Clean Code skill’s home turf. Its core principle is that code is read more often than written, so optimize for the reader, and it scores code 0–10 against six disciplines: meaningful names, small single-purpose functions, comments that explain why not what, clean error handling, clean tests, and a catalog of smells.

Start with names and functions, because that’s where generated code is weakest. AI-generated prototypes are full of data, result, handleStuff, temp2, and 120-line functions that validate, transform, call an API, format a response, and log — all in one breath. Clean Code’s rules are specific and enforceable: names reveal intent (elapsedTimeInDays, not d); booleans read as predicates (isActive, hasPermission); functions do one thing at one level of abstraction, ideally a handful of lines, with zero to two arguments; flag arguments are a smell because a function taking render(isPrint) is really two functions. Command-Query Separation matters here too: a function either changes state or returns a value, never both, so a name never hides a side effect.

Prompt

Use the clean-code skill to review the orders module and score it 0-10 on Clean Code: flag every name that hides intent, every function over 20 lines or doing more than one thing, every flag argument, and every place a function both mutates state and returns a value, then give me the top ten fixes in priority order

Clean Code

The second front is error handling, where prototypes are often genuinely dangerous. Generated code tends to either swallow everything in a bare catch that hides real bugs, or return null on failure so that null-checks metastasize across every caller and one missing check crashes far from the source. Clean Code’s prescriptions: prefer exceptions to return codes so the happy path stays uncluttered; catch specific exceptions and let the rest propagate; never return or pass null — use an empty collection, an Optional, or a Null Object with safe default behavior; and put context in every error you throw (Failed to save invoice #1234 for customer Acme) so debugging starts with a fact, not a hunt.

Prompt

Use the clean-code skill to audit error handling across our API layer: find every bare catch that swallows exceptions, every function that returns null on failure, and every error thrown without context, then refactor them to throw specific exceptions with operation and state details and replace null returns with Optionals or empty collections

Clean Code

A crucial discipline from this skill, reinforced by every other skill in the stack: never refactor and add a feature in the same step. Wear one hat at a time. Clean for clarity, run the characterization tests from Phase 1, commit. Then, separately, change behavior. When a commit is purely structural, a red test can only mean you broke something — which makes the failure trivial to diagnose. Mixing the two makes every failure ambiguous.

Phase 3 — Apply named refactorings to the structure

Clean Code tells you what good looks like; the Refactoring skill gives you the named, mechanical transformations to get there without breaking behavior. The distinction matters: this skill is a catalog. Each code smell maps to a specific refactoring with known mechanics, so instead of vaguely “cleaning up,” you say “this is Feature Envy” or “this is a Long Method” and apply the prescribed move. That precision is exactly what makes an AI agent effective here — it can name the smell, cite the transformation, and execute it in small steps.

The single most important transformation is Extract Method, and it’s the one your prototype needs most. The heuristic is beautiful in its simplicity: if you feel the urge to write a comment explaining what a block of code does, extract that block into a method and use the comment as its name. A 100-line generated function becomes a short, readable sequence of named steps — validateInput(), applyDiscounts(), persistOrder() — each of which you can now understand, test, and reuse. From there the catalog branches: Replace Magic Number with Symbolic Constant for the bare 5 and 86400 scattered everywhere; Decompose Conditional and Replace Nested Conditional with Guard Clauses for the pyramids of if/else that generated code produces; Replace Conditional with Polymorphism when you see a switch on a type code; Introduce Parameter Object when the same three or four arguments travel together through every function.

Prompt

Use the refactoring-patterns skill to refactor the processOrder function in our billing service using named refactorings: name each smell you find, then apply Extract Method to break it into readable steps, Replace Nested Conditional with Guard Clauses to flatten the validation pyramid, and Replace Magic Number with Symbolic Constant for the hardcoded limits, running tests after each single transformation

Refactoring

The skill is strict about workflow, and you should hold your agent to it: run tests (green), apply one small transformation, run tests (green), commit. If the tests go red, you revert — you do not debug a half-finished refactoring. Small steps make the cause of any failure obvious (it was the last thing you did) and reverting costs seconds. Two patterns from the catalog are especially useful when prototype code is tangled: Preparatory Refactoring (“make the change easy, then make the easy change” — clean the insertion point before you add a feature there) and the Rule of Three (tolerate a duplication once, notice it twice, extract it on the third occurrence, so you don’t abstract prematurely).

Prompt

Use the refactoring-patterns skill to do a preparatory refactoring before I add subscription pricing alongside our one-time pricing: restructure the existing pricing code to make the new path easy to add, extract the duplicated tax-calculation logic that already appears in three places, and keep every step behavior-preserving with tests green between transformations

Refactoring

Phase 4 — Reduce complexity: deep modules, not more classes

At this point the code is readable and locally well-structured, but you may have a subtler problem — one that AI agents, left unsupervised, actively create. Asked to “clean up” or “make it modular,” an agent will often shatter a function into a dozen tiny classes and interfaces, proud of the apparent decomposition. The Software Design skill, from Ousterhout’s A Philosophy of Software Design, is the antidote. Its governing principle: complexity is the enemy, and the goal is to minimize the complexity a module imposes on the rest of the system.

The skill’s most important idea is deep vs. shallow modules. A module’s depth is the functionality it provides divided by the complexity of its interface. A deep module hides significant machinery behind a simple interface — Unix file I/O is the canonical example: open, read, write, close conceal disk blocks, buffering, caching, and encoding. A shallow module has an interface nearly as complex as its implementation, so it adds cognitive load without hiding much. The disease of producing too many small shallow classes even has a name: “classitis.” The fix is to merge related shallow classes into deeper ones — a RequestParser plus RequestValidator plus RequestProcessor that always run in sequence and share state probably want to be one RequestHandler.

Prompt

Use the software-design-philosophy skill to analyze the notifications module for classitis: identify which classes are shallow (interface almost as complex as the implementation), where information is leaking across module boundaries, and which pass-through methods add indirection without value, then propose how to consolidate them into deeper modules with simpler interfaces

Software Design

The second big idea is information hiding and its inverse red flag, information leakage — when a single design decision (a date format, a wire protocol, a database assumption) is reflected in multiple modules, so changing it means editing all of them. A common cause in prototypes is temporal decomposition: organizing code by the order things happen (“first read the config, then validate it, then apply it”) instead of by knowledge, which forces every phase to share the same details. The skill’s guidance is to organize modules around knowledge — one module owns config, one owns serialization — so each hides a decision the others don’t need to know.

Finally, this skill reframes the whole prototype-to-product effort as a choice between tactical and strategic programming. Tactical programming ships features fast and accumulates complexity with every shortcut; strategic programming invests an extra 10–20% to keep the design clean and treats every change as a chance to improve structure. Your prototype is the output of pure tactical programming. The point of this guide is to flip the mode — and the skill is emphatic that startups need this most, because early shortcuts compound into crippling debt exactly as the team and the codebase grow.

Phase 5 — Draw the architecture boundary around your business rules

Your modules are now deep and your functions clean, but there’s a structural question that determines whether this product is a joy or a nightmare to evolve: does your business logic depend on your framework and database, or do they depend on it? In a vibe-coded prototype the answer is almost always backwards. Business rules live inside route handlers, ORM models are the domain objects, and pricing logic is interleaved with SQL. The Clean Architecture skill fixes this with one rule.

That rule is the Dependency Rule: source code dependencies point inward, toward higher-level policy. Picture concentric circles — Entities (enterprise business rules) at the center, then Use Cases (application-specific rules), then Interface Adapters (controllers, presenters, gateways), with Frameworks and Drivers (the web framework, the ORM, the message queue) on the outside. Nothing in an inner circle may name anything in an outer one. The database is a detail. The web is a detail. They are plugins to your business rules, not the skeleton of your application. The mechanism that enforces this is Dependency Inversion: a Use Case defines a UserRepository interface, and a PostgresUserRepository in the outer adapter layer implements it — so the business logic depends on an abstraction it owns, and Postgres depends on that abstraction, never the reverse.

Prompt

Use the clean-architecture skill to map the dependency graph of our Express app and find every violation of the Dependency Rule where business logic imports the ORM or the framework, then show me how to extract our order pricing rules into framework-free Use Cases that depend on repository interfaces, with the Prisma implementation moved out to an adapter layer

Clean Architecture

The practical payoff is enormous and immediate: once the boundary exists, you can test your business rules with no database, no web server, and no framework running — which makes the tests from Phase 1 fast and trustworthy — and you can swap Postgres for DynamoDB, or Express for Fastify, as a localized change instead of a rewrite. This skill pairs the Dependency Rule with the SOLID principles as its mid-level building blocks: Single Responsibility (a module has one reason to change because it serves one actor), Open-Closed (extend by adding code, not modifying existing code), and especially Dependency Inversion, which is how the boundary gets built in practice.

Prompt

Use the clean-architecture skill to apply Dependency Inversion to our PaymentService, which talks directly to the Stripe SDK throughout the codebase: define a PaymentGateway interface owned by our business layer, wrap Stripe behind a StripeGateway adapter that implements it, and wire the concrete choice in our composition root so we could swap to Braintree without touching business logic

Clean Architecture

A warning the skill is loud about, because it’s a trap founders fall into: do not assume that splitting into microservices gives you good architecture for free. A set of services sharing one fat data model is just a distributed monolith — a monolith with network calls between its modules, which is strictly worse than a clean monolith. Apply the Dependency Rule within your service first. Services are deployment boundaries, not automatically architectural ones.

Phase 6 — Lock in the engineering habits that keep it clean

Architecture decays without discipline, and discipline is a set of habits. The Pragmatic Programmer skill supplies the meta-principles that hold every other phase together — it’s less about any single line of code and more about how to think so the codebase stays easy to change, understand, and trust.

Four of its principles matter most in the prototype-to-product transition. DRY (Don’t Repeat Yourself) — but the skill is precise where most people are sloppy: DRY is about knowledge, not text. Two identical-looking blocks that encode different business rules are not duplication, and DRY-ing them couples concepts that should stay independent. The duplication that hurts is duplicated knowledge: the same validation rule in the client and the server, the same tax logic in three handlers. Orthogonality: components are orthogonal if changing one doesn’t affect another — change the auth provider and billing shouldn’t care — which is the design value the Clean Architecture boundary buys you. The Broken Window Theory: one unrepaired hack signals that nobody’s minding the quality, and the threshold for the next hack drops to zero; fix problems immediately or “board them up” with a tracked ticket, never an untracked // TODO: fix later. And reversibility: abstract third-party vendors behind your own interfaces so no decision is permanent — the same move the architecture phase made for Stripe and Postgres.

Prompt

Use the pragmatic-programmer skill to audit our codebase for the kinds of duplicated knowledge that cause bugs (validation rules repeated on client and server, the same business calculation in multiple places) while ignoring coincidental code that merely looks similar, and flag any broken windows like untracked TODOs or quick hacks that need boarding up with a ticket

Pragmatic Programmer

This skill is also where you set up the tracer bullet habit for everything you build after the prototype. A tracer bullet is a thin but real end-to-end slice — UI to API to database to response — that you keep, as opposed to a prototype you throw away. Building the thinnest complete path first surfaces integration problems early and gives you a skeleton to flesh out, instead of building each layer in isolation and discovering at the end that they don’t fit. For your next feature, that’s the move: one real vertical slice, then expand.

Phase 7 — Make it survive production: stability patterns

Everything so far has made the code good. This phase makes it survivable, and it’s the one prototypes ignore most completely. The Release It! skill opens with a sentence worth memorizing: the software that passes QA is not the software that survives production. Production is hostile. Its governing principle is that every system will eventually be pushed past its limits, and the only question is whether it degrades gracefully or collapses.

The skill catalogs stability anti-patterns, and the number-one killer is integration points — every socket, HTTP call, and queue your prototype touches. The specific failure that destroys vibe-coded apps is that they call third-party APIs with no timeout. When that API slows down (not even fails — just slows), your request threads block waiting, the thread pool exhausts, and your entire app stops responding to everyone, because of a dependency you don’t even control. The skill is blunt: a slow response is worse than no response, because it ties up resources and propagates delay up the whole call chain.

Against each anti-pattern stands a stability pattern. Timeouts on every outbound call — connect and read — are non-negotiable. The Circuit Breaker wraps a failing dependency: after a threshold of failures it “trips” open and fails fast instead of waiting, then periodically tests recovery in a half-open state — a tripped breaker is the system working correctly, protecting itself. Bulkheads isolate resource pools so one drowning dependency can’t drain the threads the rest of the app needs (a dedicated connection pool for the payment gateway, separate from search). Retry with exponential backoff and jitter so a recovering service isn’t immediately flattened by a synchronized thundering herd. And Steady State — prototypes accumulate cruft (sessions, logs, temp files) until a disk fills at 3 a.m.; design the cleanup in from the start.

Prompt

Use the release-it skill to audit every outbound call in our app for stability: find external API calls and database queries with no timeout, add connect and read timeouts to all of them, wrap our payment and email provider calls in circuit breakers that trip after five failures in sixty seconds, and isolate their connection pools with bulkheads so one slow dependency cannot freeze the whole service

Release It!

Two more areas from this skill turn a prototype into something operable. Deployment and release: decouple deploying code from releasing it to users via feature flags, so you can ship dark, enable for 10%, and roll back a release in seconds without redeploying; make database migrations backward-compatible (expand-contract) so old and new code can run simultaneously during a rolling deploy. And observability, which prototypes have essentially none of: you cannot operate what you cannot see. Add deep health checks (the /health endpoint verifies the database, cache, and queue are reachable, not just that the process is alive), instrument the RED metrics for each endpoint (Rate, Errors, Duration), and alert on symptoms users actually feel — error rate and latency — not on causes like CPU.

Prompt

Use the release-it skill to add production observability to our service: implement a deep health check endpoint that verifies the database, Redis, and queue are reachable, instrument RED metrics (request rate, error rate, p50/p95/p99 latency) per endpoint, and design alerts that fire on user-facing symptoms like error rate above one percent rather than on CPU

Release It!

Phase 8 — Design for the scale you actually have (and the scale you’ll reach)

Resilient code can still fall over under load if the system around it was never sized. The System Design skill brings structured thinking to scaling, and its first principle is a corrective to how prototypes grow: start with requirements, not solutions. Before adding a cache or a queue or a second database, write down the numbers — daily active users, queries per second, storage growth, the latency and availability you’re promising. Premature scaling machinery is as harmful as missing it.

The skill’s most immediately useful tool is back-of-the-envelope estimation: a two-minute calculation that prevents both over-provisioning and 3 a.m. outages. QPS is daily-active-users × actions-per-day ÷ 86,400 seconds, with peak typically 2–5× the average. Storage is records-per-day × record-size × retention. These order-of-magnitude numbers tell you which component bottlenecks first and roughly when — so you scale deliberately instead of reactively. From there the skill assembles systems from a standard toolkit of building blocks introduced when the estimates justify them: a CDN for static assets, a cache (cache-aside Redis with a defined TTL and explicit invalidation on writes) in front of read-heavy queries, a message queue to decouple slow work from the request path and absorb spikes (enqueue the image resize or the email send; let workers pull at their own pace), and read replicas before you ever consider sharding.

Prompt

Use the system-design skill to do back-of-the-envelope estimation for our app at 50k daily active users and tell me which component bottlenecks first: calculate average and peak QPS and yearly storage growth, then recommend in priority order where to add a cache, where to move slow work behind a message queue, and at what point we will need read replicas — without over-engineering for scale we do not have yet

System Design

The skill is opinionated about the order of scaling moves, and it maps cleanly onto a maturing prototype: scale vertically first (a bigger box is simple), then add a cache for read-heavy paths, then read replicas for the database, and only shard as a last resort because of its operational complexity. It also gives you a library of common designs — rate limiter, news feed, notification system — so when you need to protect your API you reach for a known token-bucket pattern returning 429 Retry-After at the gateway, rather than inventing one.

Phase 9 — Get the data layer right, because data outlives code

The last phase is the one with the longest shadow. The Data-Intensive Apps skill, from Kleppmann’s Designing Data-Intensive Applications, runs on a principle that should reframe every database decision you make: data outlives code. You will rewrite the application and swap frameworks several times, but the data persists for years — so the correctness, durability, and evolvability of the data layer deserve more care than anything else in the stack. A prototype that chose its database by familiarity (“it’s what the agent generated”) is carrying a decision it never actually made.

Two of the skill’s domains will bite a growing product first. Transactions and isolation: most databases do not default to serializable isolation — they default to read committed or snapshot isolation — which permits anomalies your prototype’s naive code will trigger under concurrency. The classic is write skew: two transactions read the same data, each decides independently, and each writes a different row, with no single-row lock to stop them — think two requests both reading “inventory is 1” and both selling the last item. The fix is explicit (SELECT ... FOR UPDATE to take a lock, or a genuinely serializable transaction for a balance transfer), and the skill insists you know your database’s actual default rather than assume safety. Replication lag: the moment you add the read replicas that System Design recommended, asynchronous replication means a user can write data and then read a stale replica that doesn’t have it yet — so you implement read-your-writes and monotonic-read guarantees deliberately, not by hoping.

Prompt

Use the ddia-systems skill to review our database access for concurrency bugs our prototype will hit under real load: find places vulnerable to write skew like inventory decrements and balance updates that read-then-write without a lock, tell me our database's actual default isolation level and the anomalies it permits, and fix the risky paths with SELECT FOR UPDATE or a serializable transaction

Data-Intensive Apps

The skill also gives you the vocabulary to choose datastores by requirements instead of habit, and to decouple them. Match the data model to the access pattern — relational for many-to-many and ad-hoc queries, document for self-contained aggregates, graph for relationship traversal — and recognize that polyglot persistence (different stores for different workloads) is often correct rather than a failure. Understand the storage engine trade-off: LSM-trees (Cassandra, RocksDB) for very high write throughput, B-trees (Postgres) for balanced read/write OLTP, columnar (ClickHouse) for analytics. And when you outgrow a single hot path, separate the system of record from derived data with change data capture or event sourcing, so a search index or a cache is something you can rebuild from the source of truth rather than a fragile second master.

Prompt

Use the ddia-systems skill to evaluate whether to stay single-database or go polyglot before we add full-text search and an activity feed to our Postgres-only app: assess each workload (relational profiles, write-heavy time-series feed, search across millions of documents) against data model fit and storage-engine characteristics, and if we add a search index show me how to keep it in sync with change data capture instead of dual writes

Data-Intensive Apps

Your checklist

  • Treat the prototype as legacy code: write characterization tests that pin current behavior before changing anything.
  • Break dependencies with the least-invasive seam — Parameterize Constructor with a production default — so tests can inject fakes.
  • Use Sprout/Wrap to add urgent behavior safely when full coverage isn’t possible yet, and track the untested host as debt.
  • Rename for intent; Extract Method every long function into named, single-purpose steps.
  • Replace swallowed exceptions and null returns with specific exceptions, Optionals, and contextful error messages.
  • Apply named refactorings one at a time, tests green between each step; revert (never debug) a red refactoring.
  • Never refactor and change behavior in the same commit — one hat at a time.
  • Merge shallow classes into deep modules; kill classitis and information leakage.
  • Enforce the Dependency Rule: business rules depend on interfaces they own, not on the framework or ORM.
  • Wrap every third-party vendor behind your own interface for reversibility.
  • Put a timeout on every outbound call; add circuit breakers and bulkheads on integration points.
  • Add deep health checks, RED metrics, and symptom-based alerts before launch.
  • Decouple deploy from release with feature flags; make migrations backward-compatible.
  • Do back-of-the-envelope estimation before adding scaling machinery; cache and queue only when numbers justify it.
  • Know your database’s actual isolation level; fix write skew and replication-lag reads explicitly.

Common mistakes

Cleaning up before writing a single test. This is the cardinal sin and the reason this guide starts with the legacy-code skill. Renaming and extracting in untested code feels productive and is actually gambling — you cannot know you preserved behavior. Pin behavior first, then clean.

Letting the agent “modularize” into a swarm of tiny classes. Ask an AI agent to make code modular and it will happily create fifteen shallow classes and call it architecture. That’s classitis, and it raises complexity. The Software Design skill’s deep-module principle is the corrective: hide real machinery behind simple interfaces; merge shallow classes that always travel together.

Refactoring and adding a feature in one commit. When the commit does both, a failing test can’t tell you which change broke things, and code review can’t separate the structural diff from the behavioral one. Split every such change into a structure-only commit (tests stay green) and a behavior-only commit.

Calling external APIs with no timeout. The most common cause of total prototype outages. A dependency doesn’t have to fail to take you down — it only has to get slow while your threads block on it. Connect-and-read timeouts on every outbound call, plus a circuit breaker, are table stakes.

Scaling before you’ve sized anything. Adding Kafka, sharding the database, and standing up five services because the prototype “needs to scale” is over-engineering that buys complexity you can’t yet operate. Estimate first; the numbers usually say a cache and a read replica are years of runway.

Mistaking microservices for architecture. Splitting a tangled prototype into services that share one database gives you a distributed monolith — all the coupling, plus network latency and partial failure. Apply the Dependency Rule inside the service before you split anything.

Assuming your database is serializable. It almost certainly isn’t by default. Code that reads-then-writes without a lock will corrupt data under concurrency through write skew, and it’ll pass every single-user test you run. Check the actual default and lock the paths that need it.

Frequently asked questions

In what order should I install and apply these skills?

Follow the phase order in this guide, which is deliberately not the order the books were written. Start with working-with-legacy-code because untested prototype code is legacy code and you need a safety net before anything else. Then clean-code and refactoring-patterns for surface legibility, software-design-philosophy and clean-architecture for structure, pragmatic-programmer to set durable habits, and finally release-it, system-design, and ddia-systems for resilience and scale. You don’t have to do all nine before shipping — getting through Phase 3 (tested, readable, well-refactored) already puts you ahead of most prototypes, and Phases 7–9 can follow once you have real traffic. Install them all in one go with npx skills add wondelai/skills --all --global and pull each into a session as you reach its phase.

Do I really need to write tests for AI-generated code I can just regenerate?

Yes, and the regeneration argument is exactly why. The moment real users have data in your system, “just regenerate it” stops being an option — a regenerated module that behaves slightly differently can corrupt or lose customer data, and you’d have no test to catch the difference. Characterization tests are cheap insurance: they pin what the code does today so that any future change, whether you write it or an agent does, surfaces unintended behavior changes immediately. Tests are also what make the Clean Architecture boundary worth building — fast, framework-free tests on your business rules are only possible once that boundary exists.

How do I keep my AI agent from over-engineering when it “improves” the code?

Two guardrails. First, name the principle that fights over-engineering and put it in the prompt: the Software Design skill’s deep-module / anti-classitis guidance, and the Pragmatic Programmer skill’s YAGNI stance (don’t build flexibility you have no evidence you need yet). Ask the agent to reduce the number of interfaces and merge shallow classes, not multiply them. Second, demand the small-step refactoring workflow — tests green between every transformation — which structurally prevents the agent from doing a sweeping rewrite under the banner of “cleanup.” If a change can’t be expressed as a sequence of named, behavior-preserving refactorings, that’s a signal it’s a rewrite in disguise.

When is my prototype actually “production-ready”?

Use a concrete bar drawn from the skills: your business rules have tests that run without a database or framework; every outbound call has a timeout and critical dependencies have circuit breakers; you have a deep health check and RED metrics wired to symptom-based alerts; you can deploy without downtime and roll back a release in under a minute; and you know your database’s isolation level and have locked the read-then-write paths. The Release It! and Clean Architecture skills both score 0–10 against their principles — run those scorers on your codebase and treat anything below an 8 on the resilience and boundary dimensions as a launch blocker. “It works in the demo” is not on this list, because the demo was never the hard part.

Should I do all of this before launching, or can I ship and harden in parallel?

Ship earlier than you think, but in a specific order. Phases 1–3 (safety net, readability, refactoring) should be largely done before real users arrive, because they’re cheapest now and they protect the data you’re about to start collecting. The architecture boundary in Phase 5 pays off most when you do it before the codebase doubles again — retrofitting it later is real work. But the scale-oriented phases (8 and 9) are explicitly requirements-driven: don’t build for 50K users while you have 50. The System Design skill’s first principle — start with requirements, not solutions — applies to your own roadmap. Harden the resilience patterns in Phase 7 before launch (timeouts and a circuit breaker are not optional even at low traffic), and let the scaling work track your actual growth numbers.

Start with the safety net

The fastest way from “works on my machine” to a product you can operate is not a heroic rewrite — it’s a sequence of small, verifiable moves, each backed by a framework that thousands of engineers have already proven. The skills in this stack turn that literature into something your AI agent can apply directly to your code.

Install the whole stack with a single command:

npx skills add wondelai/skills --all --global

Then open your prototype and start at Phase 1: tell your agent to use the working-with-legacy-code skill to pin your current behavior with characterization tests. Once the net is in place, every other phase becomes safe.

When you’re ready to think bigger than incremental hardening, read the companion guides next: Refactor a Codebase Buried in Technical Debt goes deeper on the refactoring and legacy-code workflow, and Design the Best Architecture for a New App shows how to get the structure right from the first line so your next project never becomes a prototype you have to rescue.

Get all 50 skills, free

Open-source, MIT-licensed, and ready in 30 seconds.

npx skills add wondelai/skills --all --global