How to Refactor a Codebase Buried in Technical Debt

You inherited it, or you wrote it three years ago, or it accreted while the company chased product-market fit — and now there is a codebase that everyone is afraid of. It has hundreds of thousands of lines, a handful of files that no one will open without a deep breath, a test suite that is either absent or so flaky that the team disabled it in CI, and a folklore of “don’t touch the billing code, it works and we don’t know why.” Every estimate has a tech-debt tax baked in. Every new feature takes longer than the last. Onboarding a new engineer takes months because the only documentation is the code, and the code lies.

The instinct, voiced in every such team eventually, is to rewrite it. Stop. The big-bang rewrite is the single most reliable way to turn a struggling-but-shipping product into a struggling-and-not-shipping one. The old system keeps moving while you rebuild — bugs get fixed, edge cases get handled, customers get onboarded onto behavior you don’t even know exists — and the rewrite ships late, missing years of accumulated edge cases, having burned the runway you needed to actually fix things. This guide is the alternative: a sequenced, incremental plan to pay down technical debt in place, without ever stopping the world, where the worst-case blast radius of any single step is small and reversible.

The core idea threaded through every phase is feedback over fear. Right now you change code and pray, because you have no way to know whether an edit preserved behavior. We will replace that with a discipline where the codebase tells you, on your own machine, the moment you break something. Then we clean what we can safely reach, reduce the complexity that makes the system hard to hold in your head, and carve durable boundaries so that the next change touches one place instead of fifteen. Each phase is driven by one skill — a packaged framework from a canonical engineering book that your AI agent applies directly to your code. You install the stack once with npx skills add wondelai/skills --all --global and pull each skill into a session as you reach its phase.

You do not pay down a mountain of debt by stopping to rebuild the mountain. You pay it down one safe, tested step at a time, along the paths you actually walk.

The stack, in the order you will use it: Working with Legacy Code gives you the safety net and the comprehension tools to know where to even start. Refactoring supplies the named, mechanical transformations to restructure safely. Clean Code raises the legibility of what you touch. Software Design attacks the complexity itself. Clean Architecture draws the dependency boundaries that stop the rot. Pragmatic Programmer sets the habits that keep debt from re-accumulating. Release It! hardens the integration points that take legacy systems down in production. And Domain-Driven Design gives you the strategy to carve the big ball of mud into bounded contexts you can eventually extract.

If you are rescuing a freshly AI-generated prototype rather than an aging system, the companion guide From Vibe-Coded Prototype to Production-Ready Product covers that narrower case; this guide is about the large, old, tangled codebase where the central problem is scale and comprehension, not just polish.

Phase 1 — Build a safety net and find where to start

Before any cleanup, two questions dominate: how do I change anything without breaking it, and where in this enormous codebase do I even begin? The Working with Legacy Code skill, built on Michael Feathers’ book, answers both. Its foundational redefinition reframes the whole problem: legacy code is simply code without tests. Not old code, not ugly code — untested code, because without tests you cannot know whether a change preserves behavior, so every edit is a gamble. The craft is to get tests in place before changing anything: cover and modify, never edit and pray.

The skill’s spine is the Legacy Code Change Algorithm, a fixed sequence you run for every change: identify the change points, find the test points, break dependencies, write characterization tests, then make the change and refactor. Notice that the actual edit happens last, inside a safety net you built first. The central technique for building that net is the characterization test — a test not of what the code should do but of what it actually does right now, quirks and bugs included. The recipe is mechanical: call the code in a harness, assert something you know is wrong (expect(total).toBe(-1)), run it, read the failure message to learn the real behavior, then change the assertion to pin that real value. You are photographing existing behavior, not judging it, so that later any unintended change shows up as a red test.

But in a large codebase, before you can write that test you have to understand a region you don’t understand — and that is the actual bottleneck, not typing. This skill is unusually strong here. Effect sketches trace what a change can affect: a bubble per variable or method, an arrow per “affects,” traced forward from your change point until you have a finite list of everywhere behavior can leak out — turning “what could this break?” from free-floating anxiety into a bounded answer. Pinch points are narrowings in that sketch where a handful of tests cover a wide swath of behavior; find one and you can pin an entire cluster cheaply. Scratch refactoring is refactoring recklessly on a throwaway branch purely to learn the code, then reverting — the understanding survives the git checkout. And feature sketches reveal how methods and fields cluster inside a god class, showing you where a hidden class boundary wants to be drawn.

Prompt

Use the working-with-legacy-code skill to help me understand our 4,000-line BillingEngine before I touch it: draw an effect sketch from the applyCharges method outward to find every place its changes can leak, identify pinch points where a few tests would cover the most behavior, and tell me the smallest set of characterization tests that would let me safely change the proration logic

Legacy Code

The obstacle you hit the instant you try to write a test is that legacy classes resist instantiation: the constructor opens a real database connection, a singleton is consulted three calls deep, a static Billing.charge() is wired in everywhere. The skill’s answer is seams — places where you can alter behavior without editing in that place — and a catalog of dependency-breaking techniques to create them. The rule is to use the least invasive technique that unblocks you, because these edits happen before tests exist. The workhorse is Parameterize Constructor with a production default (def __init__(self, conn=None): self.conn = conn or connect()), so every existing caller keeps compiling untouched while tests inject a fake. Extract Interface is the safest move in the book — introducing an interface cannot change behavior, only loosen a type. For pervasive statics and singletons, Introduce Instance Delegator hands callers an instance they can swap.

Prompt

Use the working-with-legacy-code skill to get our OrderProcessor under test: its constructor news up a Postgres pool, a Redis client, and a static EmailService, which makes it impossible to instantiate in a test, so identify the seams and break those dependencies with the least invasive techniques (Parameterize Constructor with production defaults, Extract Interface for the email static) so I can write characterization tests without changing behavior for existing callers

Legacy Code

When a change is genuinely urgent and you cannot get the surrounding area under test in time, the skill gives you two contained escape hatches. Sprout Method / Sprout Class: write the new behavior as fresh, fully tested code and call it from a single line in the untested host. Wrap Method: rename the old function aside (rawPay) and add behavior — logging, a feature gate, an audit hook — in a new wrapper that calls it, decorator-style. Both leave existing behavior byte-for-byte intact and let you test the new code, at the honest cost of leaving the host untested. The skill is firm that this is a tactical move: track every sprout as debt and cover the host the next time a change lands there.

Prompt

Use the working-with-legacy-code skill to add VAT handling to our untested 800-line invoice generator that we cannot get fully under test in time: use Sprout Class to put the VAT calculation in a fully tested VatCalculator, call it from one line in the existing generator, and leave the rest untouched, then list what we should characterize next time we touch this file

Legacy Code

By the end of this phase you have not cleaned anything. You have something more valuable: along the paths you are about to change, the current behavior is pinned, and you have a map of where the change can reach. Everything from here is verifiable.

Phase 2 — Restructure with named refactorings, one safe step at a time

With behavior pinned at your change points, you can start improving structure — and the Refactoring skill, from Martin Fowler’s catalog, is how you do it without breaking things. Its defining principle separates it cleanly from the rewrite you are avoiding: refactoring is not rewriting; it is a sequence of small, behavior-preserving transformations, each backed by tests. You never change what the code does, only how it is organized. The catalog is the point — each code smell maps to a specific named refactoring with known mechanics, so instead of vaguely “cleaning up,” you say “this is Feature Envy” or “this is a Long Method” and apply the prescribed move. That precision is exactly what makes an AI agent reliable here: it can name the smell, cite the transformation, and execute it in disciplined small steps.

In a debt-laden codebase the smells cluster predictably. Long Method is the most common and Duplicate Code the most expensive. You will see god classes with a dozen responsibilities (Large Class, Divergent Change), the same business rule copy-pasted across modules, Primitive Obsession where raw strings and ints stand in for domain concepts, and Shotgun Surgery where one logical change forces edits across many files. The single most important transformation is Extract Method, and the heuristic is elegant: if you feel the urge to write a comment explaining what a block does, extract that block into a method and use the comment as its name. A 300-line method becomes a readable sequence of named steps. From there the catalog branches: Replace Magic Number with Symbolic Constant; Decompose Conditional and Replace Nested Conditional with Guard Clauses for the pyramids of if/else; Replace Conditional with Polymorphism for switch statements on a type code; Introduce Parameter Object when the same cluster of arguments travels everywhere; Extract Class to split a god class along its axis of change.

Prompt

Use the refactoring-patterns skill to refactor the 600-line settleAccount method in our payments module using named refactorings: name each smell first, then apply Extract Method to break it into intention-revealing steps, Replace Nested Conditional with Guard Clauses to flatten the validation, and Replace Conditional with Polymorphism for the switch on account type, running the characterization tests after every single transformation

Refactoring

The skill is strict about the workflow, and you should hold your agent to it mechanically: run tests (green), apply one small transformation, run tests (green), commit. If the tests go red, you revert — you do not debug a half-finished refactoring. Small steps make any failure obviously attributable to the last thing you did, and reverting costs seconds; debugging a tangled multi-change refactoring costs days. Three workflow patterns are especially powerful against entrenched debt. Preparatory Refactoring — “make the change easy, then make the easy change” — means you clean the insertion point before adding a feature there, which is how you stop debt from compounding with every story. The Rule of Three keeps you from abstracting prematurely: tolerate a duplication once, notice it twice, extract on the third occurrence. And Branch by Abstraction lets you perform a large structural migration gradually in production: introduce an abstraction over the thing you want to replace, migrate callers behind it one at a time, then remove the old path — no long-lived feature branch, no big-bang merge.

Prompt

Use the refactoring-patterns skill to replace our hand-rolled date-handling utilities with a proper library across the codebase using Branch by Abstraction: introduce an interface that wraps our current date logic, migrate call sites to it incrementally with tests green at each step, swap in the new implementation behind the interface, and then remove the old utilities once nothing depends on them

Refactoring

A discipline this skill shares with the legacy-code skill and that is non-negotiable in debt work: never refactor and change behavior in the same commit. Wear one hat at a time. A structure-only commit that turns a test red can only mean you broke something mechanically, which makes the failure trivial to find. Mixing structural and behavioral change makes every failure ambiguous and every code review twice as hard.

Phase 3 — Raise legibility where you touch the code

Refactoring gives you the transformations; the Clean Code skill, from Robert C. Martin’s book, tells you what good looks like at the level of names, functions, and error handling. Its core principle is the one that justifies the whole effort: code is read far more often than it is written — the ratio is well over 10:1 — so optimize for the reader. In a codebase the team is afraid of, illegibility is the debt as much as bad structure is; every cryptic name is a tax paid by every future reader, human or agent. The skill scores code 0–10 across six disciplines and tells you the specific moves to climb the scale, which makes it ideal as a review lens for the team.

The two fronts that pay off fastest in legacy code are naming and functions. Legacy systems are dense with data, tmp, mgr, processData2, and functions that validate, transform, persist, and notify all in one breath. Clean Code’s rules are specific and enforceable: names reveal intent (elapsedTimeInDays, not d); booleans read as predicates (isActive, hasPermission, canEdit); one word per concept (don’t mix fetch, retrieve, and get for the same idea); classes are nouns, methods are verbs; longer scope earns a longer, more descriptive name. Functions should do one thing at one level of abstraction, take zero to two arguments ideally, and never use a flag argument (a function taking render(isPrint) is really two functions). Command-Query Separation matters acutely in old code: a function either changes state or returns a value, never both, so a name can never hide a side effect that bites a caller later.

Prompt

Use the clean-code skill to review our user-management module and score it 0-10 on Clean Code: flag every name that hides intent, every function over twenty lines or doing more than one thing, every flag argument, and every function that both mutates state and returns a value, then give me the top ten fixes in priority order so I can apply them as behavior-preserving refactorings

Clean Code

The second front is error handling, which in long-lived systems is often genuinely dangerous and is where production incidents hide. Legacy code tends to swallow exceptions in bare catch blocks that hide real bugs, or return null on failure so null-checks metastasize across every caller and one missing check crashes far from the source. Clean Code’s prescriptions: prefer exceptions to return codes so the happy path stays uncluttered; catch specific exceptions and let the rest propagate; never return or pass null — use an empty collection, an Optional, or a Null Object with safe defaults; wrap noisy third-party APIs behind a clean adapter that translates their exceptions into yours; and put context in every error (Failed to save invoice #1234 for customer Acme) so debugging starts from a fact. This pairs directly with the Boy Scout Rule the skill champions — always leave the code cleaner than you found it — which is how a large codebase improves continuously instead of waiting for a cleanup that never gets funded.

Prompt

Use the clean-code skill to audit error handling across our checkout and payments code: find every bare catch that swallows exceptions, every function that returns null on failure, and every error thrown without context, then refactor them to throw specific exceptions carrying the operation and the relevant state, replace null returns with Optionals or empty collections, and wrap the Stripe SDK behind an adapter that translates its errors into our domain exceptions

Clean Code

Phase 4 — Reduce the complexity that makes the system unholdable

By now the regions you have touched are tested, restructured, and legible — but a debt-laden codebase has a deeper affliction that surface cleanup does not cure, and that AI agents left unsupervised will actually worsen. Asked to “modularize” or “clean up,” an agent tends to shatter code into a swarm of tiny classes and interfaces and call it architecture. The Software Design skill, from John Ousterhout’s A Philosophy of Software Design, is the corrective. Its governing principle names your real enemy precisely: the greatest limitation in building software is our ability to understand the systems we create — complexity is the enemy. Evaluate every change by asking whether it increases or decreases the overall complexity of the system.

The skill gives you vocabulary to diagnose what “this code is a nightmare” actually means. Complexity shows three symptoms: change amplification (a simple change requires edits in many places), cognitive load (you must hold too much in mind to make a change), and unknown unknowns (it isn’t even obvious what must change or what is relevant — the worst symptom, and the defining feature of a codebase buried in debt). Its central design idea is deep versus shallow modules: a module’s depth is the functionality it provides divided by the complexity of the interface it imposes. A deep module hides significant machinery behind a simple interface (Unix file I/O — open, read, write, close — conceals disk blocks, buffering, caching, encoding). A shallow module has an interface nearly as complex as its implementation, adding cognitive load without hiding much. The disease of too many small shallow classes has a name — classitis — and the cure is to merge related shallow classes into deeper ones, which is the opposite of what an unsupervised agent does.

Prompt

Use the software-design-philosophy skill to analyze our notifications and messaging subsystem for classitis: identify which classes are shallow because their interface is nearly as complex as their implementation, find the pass-through methods that add indirection without value, and propose how to consolidate the shallow classes that always run together into deeper modules with simpler interfaces, telling me whether each change raises or lowers overall complexity

Software Design

The second pillar is information hiding and its red-flag inverse, information leakage — when one design decision (a date format, a wire protocol, a database assumption) is reflected across multiple modules, so changing it means editing all of them. That is change amplification made concrete, and it is rampant in old codebases. A frequent cause is temporal decomposition: organizing code by when things happen (“read the config, then validate it, then apply it”) rather than by knowledge, which forces every phase to share the same details. The fix is to organize modules around the knowledge they own — one module owns config, one owns serialization — so each hides a decision the others don’t need.

Finally, this skill reframes the entire debt-paydown effort as a shift from tactical to strategic programming. Tactical programming ships features fast and accumulates complexity with every shortcut; your codebase is the cumulative output of years of it. Strategic programming invests an extra 10–20% to keep the design clean and treats every change as an opportunity to improve structure — the “tactical tornado” who ships fast and leaves wreckage is the anti-pattern. This is the mindset that makes the Boy Scout Rule sustainable: a little improvement on every change, compounding, instead of a heroic rewrite that never comes.

Prompt

Use the software-design-philosophy skill to review how configuration and feature flags are read throughout our app for information leakage: find every place that knows the config file format or flag-resolution rules, show me where temporal decomposition has spread that knowledge across modules, and recommend a single deep config module that hides those decisions so a format change touches one place instead of twenty

Software Design

Phase 5 — Draw the dependency boundary around your business rules

Your modules are deeper and your functions cleaner, but the structural question that decides whether this codebase stays a swamp is bigger than any single module: does your business logic depend on your framework and database, or do they depend on it? In a codebase buried in debt the answer is almost always backwards — business rules live inside controllers, ORM models are the domain, pricing logic is interleaved with SQL, and you cannot test a business rule without standing up the whole framework. The Clean Architecture skill, from Robert C. Martin, fixes this with one rule and a great deal of leverage.

The rule is the Dependency Rule: source code dependencies must point inward, toward higher-level policy. Picture concentric circles — Entities (enterprise business rules) at the center, then Use Cases (application-specific rules), then Interface Adapters (controllers, presenters, gateways), with Frameworks and Drivers (the web framework, the ORM, the queue) on the outside — and nothing in an inner circle may name anything in an outer one. The database is a detail. The web is a detail. They are plugins to your business rules, not the skeleton of your application. The mechanism that enforces the rule is Dependency Inversion: a Use Case defines a UserRepository interface, and a PostgresUserRepository in the outer layer implements it, so business logic depends on an abstraction it owns and the database depends on that abstraction, never the reverse.

In a tangled monolith you do not retrofit this everywhere at once — that would be a rewrite. You do it module by module, starting where change is most frequent or most painful, and the SOLID principles are your mid-level tools. Single Responsibility (a module has one reason to change because it serves one actor) tells you where to split a god class; Open-Closed (extend by adding code, not modifying it) and Dependency Inversion are how you introduce a seam an adapter can plug into. The payoff is immediate and compounds: once the boundary exists for a module, you can test its business rules with no database and no framework — making the Phase 1 tests fast and trustworthy — and you can replace the framework or datastore behind it as a localized change.

Prompt

Use the clean-architecture skill to map the dependency graph of our Rails monolith and find every violation of the Dependency Rule where business logic reaches into ActiveRecord or the controller layer, then pick our order-pricing rules as the first module to fix: extract them into framework-free use cases that depend on a repository interface they own, and move the ActiveRecord implementation out to an adapter so we can unit-test pricing with no database

Clean Architecture

The skill is loud about a trap that debt-laden teams fall into when they finally decide to “modernize”: do not assume that splitting into microservices buys you good architecture. A set of services sharing one fat data model is a distributed monolith — all the coupling of a monolith plus network latency and partial failure, which is strictly worse. Apply the Dependency Rule within your monolith first; services are deployment boundaries, not automatically architectural ones. Component principles like Common Closure (classes that change together belong together) and Acyclic Dependencies (break every cycle in the component graph) tell you where genuine boundaries are before you ever consider distributing them.

Prompt

Use the clean-architecture skill to check whether breaking our monolith into microservices would just create a distributed monolith given that everything still shares one database: analyze the component dependencies for cycles and shared data models, apply Common Closure to group classes that change together, and tell me which boundaries are real architectural seams versus which would only add network calls to existing coupling

Clean Architecture

Phase 6 — Lock in the habits that stop debt from re-accumulating

Paying down debt is wasted if the codebase silently re-accrues it the moment your attention moves on. The Pragmatic Programmer skill, from Hunt and Thomas, supplies the meta-principles that hold every other phase together and keep the gains. Its ethos — care about your craft — is operationalized through habits that determine whether a large codebase stays easy to change. The most directly relevant to debt is the Broken Window Theory: one unrepaired hack signals that nobody is minding quality, and the threshold for the next hack drops to zero. This is the precise psychology that produced your current situation. The discipline is to fix problems immediately or “board them up” with a tracked ticket — never an untracked // TODO: fix later, which is itself a broken window — and to be intolerant of new debt in code review even as you pay down the old.

Three more principles matter acutely here. DRY (Don’t Repeat Yourself), stated precisely where most people are sloppy: DRY is about knowledge, not text. Two identical-looking blocks encoding different business rules are not duplication, and DRY-ing them couples concepts that should stay independent; the duplication that hurts is duplicated knowledge — the same validation rule on client and server, the same tax logic in three handlers. Orthogonality: components are orthogonal if changing one doesn’t affect another (change the auth provider and billing shouldn’t care), which is the design value the Clean Architecture boundary buys you and the antidote to the shotgun surgery that plagues your codebase. And reversibility: abstract third-party vendors behind your own interfaces so no decision is permanent — there are no final decisions, only ones you’ve made expensive to change.

Prompt

Use the pragmatic-programmer skill to audit our codebase for duplicated knowledge that causes bugs (the same validation rule repeated on client and server, the same business calculation in several services) while explicitly ignoring code that merely looks similar but encodes different rules, then find the broken windows like untracked TODOs and quick hacks and tell me which to fix now versus board up with a tracked ticket

Pragmatic Programmer

This skill is also where you institute the practice that makes ongoing debt paydown affordable rather than a separate budget line: a debt budget, the Boy Scout Rule with teeth, and Design by Contract to make assumptions explicit so failures surface early instead of corrupting data quietly. Crash early — a dead program does far less damage than one limping along in an invalid state — is a particularly useful stance in a legacy system where silent corruption is a constant risk. Add preconditions and invariants as guard clauses at the boundaries you are hardening, so the system fails loudly at the point of the problem.

Prompt

Use the pragmatic-programmer skill to add Design by Contract guard clauses to the boundaries of our newly extracted billing use cases: define the preconditions each entry point requires, the postconditions it guarantees, and the invariants that must always hold (an account balance never goes negative), and make them fail fast and loudly so an invalid state crashes at the source instead of corrupting data downstream

Pragmatic Programmer

Phase 7 — Harden the integration points before they take you down

Legacy systems do not usually fail in the demo; they fail in production, at 3 a.m., when a dependency you forgot about gets slow. The Release It! skill, from Michael Nygard, exists for exactly this, and its opening principle is worth memorizing: the software that passes QA is not the software that survives production. Production is hostile, and every system is eventually pushed past its design limits — the only question is whether it degrades gracefully or collapses. In an old codebase the danger is acute because the integration points have multiplied over years and almost none of them were hardened.

The skill catalogs stability anti-patterns, and the number-one killer is integration points — every socket, HTTP call, and queue the system touches. The specific failure that takes down legacy apps is calling a third-party API or a sibling service with no timeout. When that dependency slows down — not even fails, just slows — request threads block waiting on it, the thread pool exhausts, and the entire app stops responding to everyone, because of something you don’t control. The skill is blunt: a slow response is worse than no response. Two more anti-patterns are endemic to aging systems: unbounded result sets (a query that was fine when the table had a thousand rows becomes an out-of-memory crash at a million — add LIMIT, paginate every list endpoint) and blocked threads (the silent killer — contention and deadlocks that show no error until everything stops).

Against each anti-pattern stands a stability pattern. Timeouts on every outbound call — connect and read — are non-negotiable. The Circuit Breaker wraps a failing dependency: after a threshold of failures it trips open and fails fast instead of waiting, then periodically tests recovery in a half-open state — a tripped breaker is the system working correctly, protecting itself from a downstream failure. Bulkheads isolate resource pools so one drowning dependency can’t drain the threads the rest of the app needs (a dedicated connection pool for the payment gateway, separate from search). Retry with exponential backoff and jitter so a recovering service isn’t immediately flattened by a synchronized thundering herd. And Steady State — legacy systems accumulate cruft (sessions, logs, temp files) until a disk fills — so design automatic cleanup.

Prompt

Use the release-it skill to audit every integration point in our legacy service for stability anti-patterns: find external API calls and database queries with no timeout and add connect and read timeouts to all of them, find list endpoints and queries with no LIMIT that will fall over as our tables grow and add pagination, then wrap our payment and email provider calls in circuit breakers that trip after five failures in sixty seconds and isolate their connection pools with bulkheads

Release It!

Two more areas from this skill make a legacy system operable rather than merely alive. Deployment and release: decouple deploying code from releasing it via feature flags, so you can ship dark, enable for a small percentage, and roll back a release in seconds without redeploying; make database migrations backward-compatible with expand-contract so old and new code can run simultaneously during a rolling deploy — essential when you are changing schemas a debt-laden app depends on in undocumented ways. And observability, which old systems are usually missing: you cannot operate what you cannot see. Add deep health checks (the /health endpoint verifies the database, cache, and queue are reachable, not just that the process is up), instrument the RED metrics per endpoint (Rate, Errors, Duration), and alert on the symptoms users feel — error rate and latency — not on causes like CPU.

Prompt

Use the release-it skill to design an expand-contract migration for the orders table schema that half our untested code depends on, keeping it backward-compatible at every step: add the new columns and write to both old and new, backfill existing rows, migrate readers behind a feature flag, then contract by dropping the old columns once nothing reads them, with a fast rollback at each stage

Release It!

Phase 8 — Carve the big ball of mud into bounded contexts

You have made the codebase safe to change, legible, less complex, properly bounded at the module level, disciplined, and resilient. The final phase is strategic: how do you decompose a monolithic big ball of mud into parts the team can own and evolve — and eventually extract — without a rewrite? The Domain-Driven Design skill, from Eric Evans, provides the strategy. Its core principle reframes what “good structure” even means: the model is the code; the code is the model. The greatest risk in software is not technical failure but building a model that doesn’t reflect how the business actually works — and in a legacy codebase that mismatch is usually the deepest debt of all, because the code’s concepts and the business’s concepts drifted apart years ago.

The first and cheapest move is Ubiquitous Language: a shared, rigorous vocabulary between developers and domain experts, used consistently in conversation and in the code. Legacy codebases are littered with technical-only names — DataManager, RequestProcessor, Helper, Utils — that hide domain logic from the experts who could correct it. The skill treats naming difficulty as a design signal: if a concept is hard to name, the model is probably wrong. Renaming process() to underwrite() and RequestHandler to LoanApplication is not cosmetic; it surfaces the real domain and makes the next round of modeling possible.

Prompt

Use the domain-driven-design skill to build a ubiquitous language for our codebase, which is full of technical names like OrderManager, DataProcessor, and PaymentHelper that hide what the business actually does: go through the core billing and fulfillment modules, propose domain-meaningful names for the classes and methods based on how the business describes these operations, and flag any concept that is hard to name because it signals the model is probably wrong

Domain-Driven Design

The strategic heart of decomposing the monolith is Bounded Contexts and Context Mapping. A bounded context is an explicit boundary within which one model applies — and crucially, the same word (“Customer,” “Order”) can legitimately mean different things in different contexts. Large systems that force a single unified model collapse into inconsistency; that one bloated Customer class serving billing, shipping, and marketing is a chunk of your debt. The skill’s guidance is to start by mapping what exists (the Big Ball of Mud is itself a recognized pattern on the context map), then define target boundaries — often aligned with team boundaries, per Conway’s Law. The single most important defensive pattern is the Anti-Corruption Layer: when one part of the system talks to another (or to the legacy core you haven’t untangled yet), translate at the boundary so a foreign or outdated model never leaks into your clean one. This is what lets you build a clean new context beside the mud and protect it from contamination — the foundation of the Strangler Fig approach to incremental migration.

Prompt

Use the domain-driven-design skill to map the bounded contexts hiding inside our monolith: identify where the same term like Customer or Order means different things in billing versus shipping versus support, propose explicit context boundaries aligned with our team structure, and design an anti-corruption layer so a new cleanly modeled Billing context can talk to the legacy core without letting the old tangled model leak into it

Domain-Driven Design

Inside each context, the building blocks give you the consistency rules that tame the tangle: Entities have identity that persists across change; Value Objects are defined entirely by their attributes and should be immutable (prefer them — most things people model as entities are really value objects); and Aggregates are small clusters with a single root that enforces invariants and is the only thing the outside references (by ID, never by object reference). Keeping aggregates small is how you fix the slow loads and concurrency conflicts a giant legacy object graph produces. Domain Events (OrderPlaced, PaymentReceived, named in past tense) let contexts react to each other asynchronously instead of through the synchronous spaghetti of direct calls — decoupling cause from effect and giving you a natural audit trail. And strategic design tells you where to spend your limited effort: identify the Core Domain where your competitive advantage lives and invest your best modeling there, while treating supporting and generic subdomains (auth, email, payments) as candidates to buy or wrap thinly rather than lovingly refactor. You do not pay down debt uniformly; you pay it down hardest where the business value is.

Prompt

Use the domain-driven-design skill to redesign our legacy Order object, a 2,000-line god aggregate that loads dozens of related records and causes constant concurrency conflicts, using DDD aggregate rules: identify the true consistency boundary, shrink the aggregate to a small root plus only what must be immediately consistent, reference other aggregates by ID instead of loading them, and replace the synchronous cross-module calls with domain events like OrderPlaced that other contexts react to

Domain-Driven Design

Your checklist

Common mistakes

Proposing the big-bang rewrite. The cardinal error. The old system keeps accumulating fixes and edge cases while you rebuild; the rewrite ships late, missing years of hard-won behavior, and the business runs out of patience. Every skill in this stack assumes incremental change instead, because incremental change is the only kind that keeps shipping.

Cleaning before writing a single test. Renaming and extracting in untested legacy code feels productive and is actually gambling — you cannot know you preserved behavior. Pin behavior with characterization tests at the change point first, then clean. Coverage grows along the paths you actually touch; that beats a dedicated “testing project” that never gets funded.

Trying to fix everything at once. A codebase buried in debt has thousands of problems, and trying to address them globally produces a stalled, unmergeable mega-branch. Triage instead: a spot that changes once gets a sprout or a wrap; the same spot changing again has earned its tests and its refactor. Spend your effort where change is frequent and where the Core Domain value is.

Refactoring and changing behavior in one commit. When a commit does both, a failing test can’t tell you which change broke things and review can’t separate structure from behavior. Split every such change into a structure-only commit (tests stay green) and a behavior-only commit.

Letting the agent “modularize” into a swarm of tiny classes. Ask an AI agent to make legacy code modular and it will create fifteen shallow classes and call it architecture. That is classitis, and it raises complexity rather than lowering it. Hold the agent to the deep-module principle: hide real machinery behind simple interfaces, and merge shallow classes that always travel together.

Calling external services with no timeout. The most common cause of legacy production outages. A dependency doesn’t have to fail to take you down — it only has to get slow while your threads block on it. Connect-and-read timeouts on every outbound call, plus a circuit breaker, are table stakes before you trust the system in production.

Mistaking microservices for architecture. Splitting a tangled monolith into services that share one database gives you a distributed monolith — all the coupling, plus network latency and partial failure. Apply the Dependency Rule and find the real bounded contexts inside the monolith before you distribute anything.

Silently fixing bugs you find while characterizing. When a characterization test reveals the code does something wrong, you will be tempted to fix it on the spot. Don’t — callers, reports, and customers may depend on the quirk. Pin the actual (wrong) behavior, document it with a ticket, and fix it later as a deliberate, separate, verifiable change.

Frequently asked questions

In what order should I install and apply these skills?

Follow the phase order in this guide, which is deliberately not chronological by book. Start with working-with-legacy-code because untested code is legacy code and you need both a safety net and a comprehension map before anything else. Then refactoring-patterns for the safe transformations, clean-code for legibility, software-design-philosophy to attack complexity, and clean-architecture to draw module boundaries. pragmatic-programmer sets the habits that keep debt from returning, release-it hardens production, and domain-driven-design gives you the strategy to decompose the monolith. You don’t need all eight before you see value — getting through Phase 3 on the regions you actively work in already changes the team’s relationship with the code. Install the whole stack in one go with npx skills add wondelai/skills --all --global and pull each skill into a session as you reach its phase.

Where do I even start in a codebase this large?

Not with the worst file — with the file you have to change next. The legacy-code skill’s discipline is that coverage and cleanup follow change: you invest comprehension and tests precisely where a feature or bug fix is taking you, never speculatively across the whole system. Use an effect sketch to bound what your change can affect, find the pinch point where a few tests cover the most behavior, pin that behavior, then make the change. The second heuristic is frequency: the files that show up in every other commit (your git log and churn metrics will tell you) are where debt costs the most, so they earn investment first. The Core Domain — where your competitive advantage lives — is the third axis. The intersection of “I’m changing it anyway,” “it changes constantly,” and “it’s business-critical” is your starting point.

How do I convince my team (or my manager) not to do the rewrite?

Make the incremental path concrete and the rewrite’s risk explicit. The rewrite asks the business to fund a second system that produces zero customer value until the day it fully replaces the first — during which the first keeps changing, so you’re aiming at a moving target and will miss years of undocumented edge cases. The incremental path, by contrast, ships value continuously: every phase in this guide leaves the system better and still running, and the worst case of any single step is a fast revert. Frame it in the Software Design skill’s terms — strategic programming invests 10–20% per change and compounds — and show a small, fast win first: pick one painful, frequently-changed module, put it under characterization tests, refactor it visibly, and let the reduced fear and faster next change speak for themselves. A demonstrated improvement on real code beats any slide about a rewrite.

How do tests help when I don’t even know what the correct behavior is supposed to be?

That is exactly the situation characterization tests are built for, and the reason they’re not called “specification tests.” You are not asserting what the code should do — you’re documenting what it actually does right now, because in a legacy system the actual behavior is the de facto spec that callers and customers depend on. The recipe sidesteps your ignorance: call the code, assert something obviously wrong, run it, and let the failure message tell you the real output, then pin that value. You don’t need a spec; the running code is the spec. If, while doing this, you discover behavior that’s clearly a bug, you still pin the current (wrong) value, file a ticket, and change it later as a separate deliberate step — because something downstream may quietly depend on the bug.

Can I extract microservices from this monolith as part of the refactor?

Eventually, and only after the boundaries are real. The most expensive mistake teams make when modernizing a debt-laden monolith is to carve it into services that still share one database and one fat data model — a distributed monolith, which keeps all the coupling and adds network latency and partial failure. The correct sequence is to find the genuine seams inside the monolith first: use the Domain-Driven Design skill to map bounded contexts, the Clean Architecture skill’s Common Closure and Acyclic Dependencies to verify a candidate boundary has no cycles and groups things that change together, and an anti-corruption layer to decouple the new context from the legacy core. Once a context is cleanly bounded, well-tested, and communicating through explicit interfaces and domain events, then extracting it to a service is a deployment decision rather than an architectural gamble. The Strangler Fig pattern — grow the clean context beside the mud and gradually route traffic to it — is the low-risk path.

Start with the safety net

A codebase buried in technical debt did not get that way overnight, and it will not get fixed overnight either — but it does not need to. The way out is not a heroic rewrite that bets the company; it is a sequence of small, verifiable moves, each backed by a framework that thousands of engineers have already proven, each leaving the system better and still shipping. The skills in this stack turn that literature into something your AI agent can apply directly to your code, on your hardest files.

Install the whole stack with a single command:

npx skills add wondelai/skills --all --global

Then open the file you have to change next and start at Phase 1: tell your agent to use the working-with-legacy-code skill to draw an effect sketch and pin the current behavior with characterization tests. Once the net is in place under the code you’re about to touch, every subsequent step becomes safe — and the mountain starts coming down one verified step at a time.

For the adjacent journeys, read From Vibe-Coded Prototype to Production-Ready Product if your starting point is AI-generated rather than aged, and Improve an Existing App With AI Skills for the broader product-and-engineering view of leveling up a system you already run.

How to Refactor a Codebase Buried in Technical Debt

Phase 1 — Build a safety net and find where to start

Phase 2 — Restructure with named refactorings, one safe step at a time

Phase 3 — Raise legibility where you touch the code

Phase 4 — Reduce the complexity that makes the system unholdable

Phase 5 — Draw the dependency boundary around your business rules

Phase 6 — Lock in the habits that stop debt from re-accumulating

Phase 7 — Harden the integration points before they take you down

Phase 8 — Carve the big ball of mud into bounded contexts

Your checklist

Common mistakes

Frequently asked questions

In what order should I install and apply these skills?

Where do I even start in a codebase this large?

How do I convince my team (or my manager) not to do the rewrite?

How do tests help when I don’t even know what the correct behavior is supposed to be?

Can I extract microservices from this monolith as part of the refactor?

Start with the safety net

Related guides

Get all 50 skills, free

Don’t guess your AI engineering level.
Measure it.

AI Developer Scorecard

CTO Scorecard

How to Refactor a Codebase Buried in Technical Debt

Phase 1 — Build a safety net and find where to start

Phase 2 — Restructure with named refactorings, one safe step at a time

Phase 3 — Raise legibility where you touch the code

Phase 4 — Reduce the complexity that makes the system unholdable

Phase 5 — Draw the dependency boundary around your business rules

Phase 6 — Lock in the habits that stop debt from re-accumulating

Phase 7 — Harden the integration points before they take you down

Phase 8 — Carve the big ball of mud into bounded contexts

Your checklist

Common mistakes

Frequently asked questions

In what order should I install and apply these skills?

Where do I even start in a codebase this large?

How do I convince my team (or my manager) not to do the rewrite?

How do tests help when I don’t even know what the correct behavior is supposed to be?

Can I extract microservices from this monolith as part of the refactor?

Start with the safety net

Related guides

How to Create a New App with AI Skills

How to Improve an Existing Business with AI Skills

How to Improve an Existing Website with AI Skills

Get all 50 skills, free

Don’t guess your AI engineering level. Measure it.

AI Developer Scorecard

CTO Scorecard

Don’t guess your AI engineering level.
Measure it.