How to Improve an Existing App with AI Skills

The app already works. It has paying users, a deploy pipeline, and a year or two of git history. It also has a UserService that nobody wants to touch, a checkout flow that occasionally times out under load, three different spellings of “cancelled” in the database, and a settings page that generates a steady trickle of support tickets. None of this is an emergency. All of it is slowly making the next feature more expensive than the last. This is the most common — and most underserved — situation in software: the app that needs to get better, not get rebuilt.

The instinct to rewrite is almost always wrong. A rewrite throws away years of accumulated edge cases your current code quietly handles, ships late, and reproduces the same problems in new syntax. The disciplined alternative is to improve the running system in small, verified increments — and that is exactly the kind of work an AI coding agent is good at, if you give it the right judgment. An agent will happily reformat 4,000 lines and call it a refactor. What it lacks on its own is the seasoned engineer’s sense of which improvement matters, what order to do it in, and how to change code safely when there are no tests to catch a regression.

That judgment is what these eleven skills encode. Each one packages a canonical engineering book — Martin’s Clean Code, Fowler’s Refactoring, Feathers’ Working Effectively with Legacy Code, Ousterhout’s A Philosophy of Software Design, Nygard’s Release It!, Kleppmann’s Designing Data-Intensive Applications, and more — into a skill your agent loads on demand. Installed, they turn “clean this up” from a vague instruction into a specific, book-grounded discipline the agent can execute and defend.

This guide sequences them into one workflow that mirrors how a careful team actually improves a shipped product: understand and stabilize before you change, change safely under a net of tests, then deepen the design, harden the runtime, fix the data layer, and finish with the UX polish users actually feel. You can run the whole sequence over a quarter, or pull a single phase for a single afternoon. Either way, the moves are concrete and the order is deliberate — you are about to learn not just which skill to reach for, but why each one comes when it does.

You don’t rewrite a shipped app. You change it one verified step at a time — and the order of those steps is the whole game.

Phase 1 — Establish a quality baseline you can defend

Before you change anything, you need a shared, objective definition of what “good” means for this code — otherwise every review devolves into taste arguments and every cleanup is unfalsifiable. Start with Clean Code. Its central claim is that code is read far more often than it is written — well over a 10:1 ratio — so every naming choice and function boundary either adds clarity or adds cost. That gives you a yardstick, not a vibe.

The skill scores code 0-10 against six disciplines: meaningful names, small single-purpose functions, comments that explain why and never what, error handling kept separate from business logic, clean tests, and a catalog of code smells. Crucially it doesn’t just grade — it tells you the specific moves to reach 10. Point it at your most-feared module first; the score becomes your before-and-after evidence.

Prompt

Use the clean-code skill to review the UserService class and score it 0-10 on naming, function size, comment discipline, and error handling, then list the top five concrete fixes in priority order with the line ranges they touch

Clean Code

Run it on the parts of the app that change most often, not the parts that scare you most — a frightening module nobody touches isn’t costing you anything. Watch especially for the smells the skill flags as expensive: duplication (the costliest), functions that need a comment to explain what they do (extract and name the block instead), magic numbers, and functions that both change state and return a value. Those are the ones that compound.

Prompt

Use the clean-code skill to scan the checkout module for the specific smells in your catalog — duplication, magic numbers, flag arguments, functions doing more than one thing, and returning null — and group the findings by smell type so I can fix one class of problem at a time

Clean Code

The discipline to internalize here is the Boy Scout Rule: leave each file a little cleaner than you found it. Wire this skill into your normal review flow so it runs on every pull request, and the baseline ratchets upward on its own instead of through heroic cleanup sprints. One caution the skill itself insists on — never let it refactor and add behavior in the same step, and never let it clean code that has no tests yet. That second constraint is exactly what Phase 3 solves.

Phase 2 — Refactor safely, one named transformation at a time

Now that you can name what’s wrong, you need to fix structure without changing behavior — and prove you didn’t. Refactoring is the skill for this, and its core principle is non-negotiable: refactoring is not rewriting. It is a sequence of small, behavior-preserving transformations, each backed by tests — verify green, apply one change, verify green, commit. Big-bang rewrites fail precisely because they fuse structural and behavioral change, so when something breaks you can’t tell which edit did it.

The skill carries Fowler’s full catalog: it maps each code smell to the named transformation that fixes it. A 200-line method is a Long Method → Extract Method into named steps. A switch on a type code is → Replace Conditional with Polymorphism. The same (startDate, endDate) pair threaded through ten signatures is → Introduce Parameter Object. Deeply nested if/else is → Replace Nested Conditional with Guard Clauses. The value of named refactorings is that they’re mechanical and reversible — and your agent can apply them in exactly the disciplined loop the book prescribes.

Prompt

Use the refactoring-patterns skill to untangle the calculateInvoiceTotal function, which has nested conditionals and three responsibilities knotted together — propose a sequence of named refactorings, tell me which test to confirm green before each step, and apply them one at a time so behavior never changes

Refactoring

The single most important transformation is Extract Method, and the heuristic for when to use it is beautifully simple: if you’re about to write a comment explaining what a block does, extract the block and make the comment its name. Lean on the Rule of Three to avoid premature abstraction — tolerate a duplicate once, note it twice, extract on the third occurrence. And use preparatory refactoring before you add a feature: clean the insertion point first so the new code drops into a tidy spot, rather than wedging it into a mess.

Prompt

Use the refactoring-patterns skill to do a preparatory refactoring of the pricing engine before I add tiered discount logic — clean the area where the new code will plug in first so it lands cleanly, and keep the refactoring commit separate from the feature commit

Refactoring

The hard prerequisite this skill keeps stating: it only works when tests already cover the code. Refactor red and you’re flying blind. If the module you want to clean has no tests — which, for a shipped app, is the usual case — you do not skip ahead. You go to Phase 3 first.

Phase 3 — Get untested code under a net before you touch it

Most of a real app is what Michael Feathers bluntly defines as legacy code: not old code, not ugly code — code without tests. Without tests you cannot know whether a change preserves behavior, so every edit is a gamble. Legacy Code is the skill that breaks the central dilemma — to change code safely you need tests, but to get tests in place you often have to change code — with a fixed, safe algorithm.

The five steps: identify change points, find test points, break dependencies, write characterization tests, then make the change. The key reframe is the characterization test: you are not testing what the code should do, you are pinning down what it actually does right now, quirks included — because in a legacy system the real behavior is the de facto spec that callers and customers already depend on. The recipe is to assert something deliberately absurd, read the failure message to learn the true value, then pin that value.

Prompt

Use the working-with-legacy-code skill to write characterization tests that pin the current behavior of the OrderService pricing path — including the rounding quirks and the weird tax edge cases — by asserting absurd values first, reading the real output, and locking it in, so I have a safety net before I refactor

Legacy Code

The thing that usually blocks a test is a dependency: a constructor that opens a real database connection, a hard-wired call to a payment gateway, a static singleton. The skill’s answer is seams — places you can substitute behavior without editing in that place — and a catalog of least-invasive dependency-breaking moves. The workhorse is Parameterize Constructor with a production default: existing callers compile untouched, while tests inject a fake. Always pick the cheapest seam that unblocks you, because these edits happen before the safety net exists.

Prompt

Use the working-with-legacy-code skill to find the seams in the NotificationService constructor, which news up its own SMTP client and Redis connection and so cannot be instantiated in a test — break those dependencies with the least invasive technique, preserving every existing call site exactly

Legacy Code

When you genuinely cannot get an area under test in the time you have, the skill offers the honest tactical move: Sprout Method or Sprout Class. Write the new behavior as fresh, fully tested code and call it from a single line in the untested host. The old behavior stays byte-for-byte intact, the new behavior arrives verified, and the risk is contained to one line. It’s not a permanent excuse — track the sprout as debt and cover the host next time you’re in there — but it lets you ship safely today.

Prompt

Use the working-with-legacy-code skill to add fraud-scoring to the 600-line processPayment method, which is too tangled to cover right now — use Sprout Class to write the fraud logic as new test-driven code, wire it in with a single call line, then leave a TODO tracking the host as test debt

Legacy Code

If your whole codebase is at this stage — buried in untested debt, every change a regression risk — this phase is the heart of a larger effort, and our deeper companion guide Refactor a codebase buried in technical debt walks the full campaign.

Phase 4 — Deepen the design, don’t just tidy it

Clean functions and a test net make code safe to change. They don’t make it simple to change. For that you need Software Design, built on Ousterhout’s thesis that the greatest limitation in writing software is our ability to understand the systems we create — complexity is the enemy. Where Clean Code works line-by-line, this skill works at the level of modules and interfaces.

Its sharpest tool is the distinction between deep and shallow modules. A deep module hides powerful functionality behind a simple interface — Unix file I/O is the canonical example. A shallow module has a complex interface relative to what it does, so it adds complexity instead of absorbing it. The skill warns against “classitis”: the disease of splitting things into many small classes, each adding interface cost without adding depth. Counter-intuitively, the cure for an over-engineered module is often to merge classes, not split further.

Prompt

Use the software-design-philosophy skill to audit the notifications module — fifteen tiny classes and eight interfaces for what should be a simple job — and tell me which classes are shallow, where information is leaking between them, then recommend how to consolidate them into one or two deep modules with a simpler interface

Software Design

The other red flag this skill hunts is information leakage — one design decision reflected in multiple modules, so a change ripples everywhere. A classic cause is temporal decomposition: splitting code by when things happen (“read config”, then “process config”, then “write config”) instead of by what knowledge it owns, which forces every phase to share the same assumptions. The fix is to organize modules around knowledge, so each one encapsulates a decision the others don’t need to know.

Prompt

Use the software-design-philosophy skill to trace where our date-format and currency-rounding logic leaks across the API layer, the PDF generator, and the email templates, then design a single deep module that owns each decision so a change lives in exactly one place

Software Design

The mindset shift the skill is really selling is strategic over tactical programming: invest 10-20% extra effort treating every change as a chance to improve structure, instead of taking the quick-and-dirty path that the “tactical tornado” leaves as wreckage. For a shipped app this is how you stop the slow accretion of complexity that makes month twelve’s features cost triple month one’s.

Phase 5 — Make it survive production, not just pass QA

A well-designed app can still fall over the first time a downstream dependency gets slow. Release It! exists for the gap Nygard names exactly: the software that passes QA is not the software that survives production — production is hostile. This is where you make the running system resilient, and it’s the phase teams skip until an outage forces it.

The skill’s most valuable content is its catalog of stability anti-patterns and the patterns that counter them. The number-one killer is integration points — every HTTP call, socket, or queue is a risk. The subtle killer is slow responses, which are worse than outright failures because they tie up threads and propagate delay up the call chain until pools exhaust. The counters are a small, well-known set: a timeout on every outbound call (connect and read), a circuit breaker that trips after a failure threshold and stops a cascade, a bulkhead that isolates connection pools per dependency so one failure can’t drain them all, and retry with exponential backoff and jitter to avoid a thundering herd on recovery.

Prompt

Use the release-it skill to audit every outbound call in our payment and search integrations for missing resilience — find the calls with no timeout, no circuit breaker, and a shared connection pool, then implement timeouts, a circuit breaker with sensible thresholds, and a bulkhead that isolates each dependency

Release It!

The skill also reframes operability as a design concern, not an afterthought. Deploy and release are separate operations — put code on servers safely, then expose it to users behind a feature flag with an emergency off switch. Health checks should be deep (verify the database, cache, and queue are reachable), not merely shallow (the process is alive), so your load balancer stops routing traffic to an instance whose dependencies are dead. And you alert on symptoms users feel — error rate, latency — not on causes like CPU.

Prompt

Use the release-it skill to replace our shallow /health endpoint, which only confirms the process is running so the load balancer keeps routing traffic to instances whose database connection is dead — design a deep health check that verifies the database, Redis, and the job queue, and returns a structured status for each

Release It!

One important boundary the skill draws itself: the chaos-engineering material is for planning what to test, not for an agent to go inject failures into your production system autonomously. Use it to design the experiments; have authorized engineers run them with real tooling and rollback plans. For the broader system-level view of resilience, the Pragmatic Programmer skill in Phase 10 pairs naturally here.

Phase 6 — Fix the data layer, because data outlives code

The most expensive mistakes in a shipped app hide in the data layer, and they surface at the worst time — under load, after a region failover, or when one customer’s data grows past what your test data ever modeled. Data-Intensive Apps brings Kleppmann’s discipline to bear, anchored on a principle every team eventually learns the hard way: data outlives code. Frameworks get rewritten; the data persists for decades, so its correctness and evolvability dominate.

The single highest-leverage thing this skill will surface is your database’s actual default isolation level — because almost nobody knows it, and it’s almost never serializable. Most databases default to read committed or snapshot isolation, which permits anomalies like write skew: two transactions read the same rows, each makes a decision, and each writes a different record, with no row lock to stop them. That’s the bug behind double-booked inventory and overdrawn balances that “can’t happen” according to the code.

Prompt

Use the ddia-systems skill to explain why our inventory reservation occasionally oversells when two orders hit the same item at once — name the write-skew anomaly that allows it under our database default isolation level, and show me whether to fix it with SELECT FOR UPDATE, a serializable transaction, or a redesign that avoids the race

Data-Intensive Apps

The skill is equally strong on the slow-burn scaling problems. It catches unbounded result sets — a list endpoint with no LIMIT that works fine in test and OOMs in production once the data grows — and it reasons about replication lag (the cause of “I saved it but it’s not there” read-your-writes bugs) and partitioning hotspots (the celebrity problem, where one popular key downs a cluster). When you’re choosing or evolving a datastore, it forces the trade-off into the open instead of defaulting to whatever you already use.

Prompt

Use the ddia-systems skill to review our list and search endpoints for unbounded queries that will degrade as tables grow, then propose a pagination strategy and the indexes each query needs so they stay fast at ten times our current row count

Data-Intensive Apps

The throughline is to make every consistency, durability, and latency trade-off explicit and deliberate. An app that survived its first year on accidental defaults will hit a wall; this skill finds the wall before your users do.

Phase 7 — Polish the microinteractions users feel every day

With the engine sound, attention turns to the surface — the thousand small moments users touch without thinking. Microinteractions brings Dan Saffer’s framework, and its premise is the one most engineering teams underrate: the difference between a product you tolerate and a product you love is almost always in the microinteractions. Users feel a dead button or a silent save even when they can’t articulate it.

Every microinteraction has the same four-part structure — Trigger, Rules, Feedback, Loops & Modes — and the skill audits each. The most common failure it catches is missing or late feedback: an action with no immediate visual response, when the bar for direct manipulation is under 100ms. The fix is rarely a separate toast; it’s animating the element the user already touched — the button depresses and its label becomes “Saving…”, the checkbox fills, the field shows an inline checkmark.

Prompt

Use the microinteractions skill to audit the five most-used interactions in our app — saving a record, submitting the form, toggling a setting, deleting an item, and the loading state — using the Trigger, Rules, Feedback, Loops and Modes structure, and specify the exact feedback each one needs so nothing ever feels dead

Microinteractions

The skill is rigorous about edge cases and states, which is where shipped apps quietly break: the empty state, the zero case, the maximum, the double-tap, the interrupted action. It pushes you to map every state — empty, loading, partial, full, error, disabled — for each interaction. And it warns against the opposite failure, feedback overload, where every action fires an animation, sound, and toast until the noise drowns the signal: use the smallest feedback that communicates.

Prompt

Use the microinteractions skill to design the full feedback sequence for our save button, which currently gives no indication anything happened until the page silently refreshes — cover the pressed state, the in-progress state, success, and the failure case where the network call errors out, animating the button itself rather than showing a separate toast

Microinteractions

Phase 8 — Remove the friction that makes users think

Polish makes interactions feel good; usability makes the app understandable. UX Heuristics brings Krug and Nielsen, built on the title that says it all — “Don’t Make Me Think.” Every question mark that pops into a user’s head is cognitive load, and users don’t read, they scan; they don’t choose optimally, they satisfice; they don’t figure things out, they muddle through. Design for what users actually do.

The skill runs a heuristic evaluation against Nielsen’s ten heuristics with 0-4 severity ratings, so you get a prioritized list, not a wall of nitpicks. It targets the friction patterns that quietly cost completion: jargon labels (“Persist” instead of “Save”), missing system-status feedback, error messages that state a problem but not the fix, mystery-meat icons with no labels, and forms with no inline validation. For a shipped app, this is the fastest way to find why a specific flow leaks users.

Prompt

Use the ux-heuristics skill to run a heuristic evaluation of our settings page and account flow against Nielsens ten heuristics, rate each violation 0-4 for severity, and give me a single fix list ordered by severity times how often users hit it

UX Heuristics

Two of its tools are especially useful mid-life. The Trunk Test — drop a user on any page cold and ask whether they can tell what site this is, what page, what their options are, and where search lives — exposes navigation that made sense to the team but disorients real users. And the relentless “get rid of half the words” rule cuts the happy-talk and instructions nobody reads, making the content that matters prominent. Error messages get the specific three-part treatment: what happened, why, and how to fix it.

Prompt

Use the ux-heuristics skill to rewrite the error and empty states across our checkout flow so each one says what went wrong, why, and exactly how to fix it in plain language, and cut every page down by removing the polite filler and instructions users skip

UX Heuristics

Phase 9 — Make it fast where speed is a feature

A usable app that loads slowly still feels broken, and speed problems are almost always misdiagnosed. High Perf Browser brings Grigorik’s grounding in how browsers and networks actually work, and its core correction reframes the entire problem: latency, not bandwidth, is the bottleneck. Most web performance pain comes from too many round trips, not too little throughput — so throwing a bigger pipe at a slow page does almost nothing.

The skill diagnoses against Core Web Vitals with concrete targets — LCP under 2.5s, INP under 200ms, CLS under 0.1 — and ties each to a specific cause and fix. LCP is usually the hero image or heading block loading late: preload it and raise its fetchpriority. INP is the main thread blocked by a long task: break it up and defer non-critical JavaScript. CLS is content that shifts after render: reserve space with explicit dimensions. These aren’t guesses; they’re mapped to the rendering pipeline.

Prompt

Use the high-perf-browser skill to diagnose why our dashboard scores poorly on Core Web Vitals — LCP around five seconds and a high INP — by finding the render-blocking resources and long main-thread tasks, then give me a fix list ordered by impact to get LCP under 2.5 seconds and INP under 200 milliseconds

High Perf Browser

It also fixes the architectural mistakes that quietly tax every page load: render-blocking scripts that should be deferred, static assets without content-hashed immutable caching so repeat visitors re-download everything, and HTTP/1.1-era workarounds (domain sharding, sprites, file concatenation) that actively hurt once you’re on HTTP/2. The skill knows the protocol generation matters and undoes the workarounds that no longer apply.

Prompt

Use the high-perf-browser skill to review our resource loading and caching strategy — find the render-blocking scripts that should be deferred, the static assets missing content-hashed immutable cache headers, and any leftover HTTP/1.1 workarounds like domain sharding that hurt us now that we are on HTTP/2

High Perf Browser

The sibling effort on the marketing surface — landing pages, conversion paths, SEO — lives in Improve an existing website with AI skills, which leans on this same skill from the front-of-site angle.

Phase 10 — Lock in the habits that keep debt from coming back

The improvements so far are point fixes. Pragmatic Programmer installs the meta-principles that stop the rot from returning — Hunt and Thomas’ systems-level view of craftsmanship, summed up as care about your craft, and treat every line of code as a living asset that must earn its place.

Two principles do the heavy lifting for a maturing app. DRY — every piece of knowledge has a single authoritative representation — but the skill is careful to note DRY is about knowledge, not textual similarity; two identical-looking blocks serving different business rules are not duplication, and merging them couples things that should move independently. Orthogonality — components so independent that changing the database doesn’t break the UI, and swapping the auth provider doesn’t touch business logic. The test it offers is sharp: “if I dramatically change the requirements behind this function, how many modules are affected?” The answer should be one.

Prompt

Use the pragmatic-programmer skill to evaluate our codebase for orthogonality and find the worst coupling — the places where changing one thing forces edits in several unrelated modules — then recommend where to introduce a repository or adapter layer so the database and the Stripe integration can each be swapped without touching business logic

Pragmatic Programmer

The skill’s most quietly powerful idea for a shipped app is the Broken Window Theory: one unrepaired bad design or “we’ll fix it later” hack gives permission for the next, and entropy accelerates. The discipline is to fix it now or board it up — a TODO with a real ticket, never a bare one. Combined with reversibility (abstract third-party dependencies behind your own interfaces so no vendor API leaks into business logic, and keep risky changes behind feature flags), this is how you keep the gains from this whole guide from eroding the moment you stop paying attention.

Prompt

Use the pragmatic-programmer skill to find the broken windows in our codebase — the untracked TODOs, the commented-out blocks, the one-off hacks with no ticket — and for each one either propose the fix or write a properly tracked boarding-up so the neglect stops spreading

Pragmatic Programmer

Phase 11 — Submit the result to a brutal, honest review

You’ve done the work. The last question is the one teams flinch from: is it actually good, or just better? Steve Jobs Design Review closes the loop by holding the whole experience to a binary standard — insanely great, or not done — starting from the principle “you’ve got to start with the customer experience and work backwards to the technology.”

This skill is not a feel-good retrospective. It runs a structured review: experience the product cold as a customer, name the one thing it must do, audit against simplicity and focus, and deliver a verdict with a specific cut list and fix list. Its most useful pressure is the demand to subtract — every feature you added during the year is a candidate for deletion, and “focusing is about saying no.” It will tell you what to remove, which is the move no other phase in this guide makes.

Prompt

Use the steve-jobs-design-review skill to review our app end to end the way Steve Jobs would — walk it cold as a new user, name the one thing it must do, count the steps to that core value, and tell me what to cut rather than what to add, ending with a binary verdict and a ranked fix list

Steve Jobs Design Review

It’s also the skill that audits the back of the fence — the surfaces nobody demos but users still feel: the empty states, the error copy, the 404 page, the billing screen, the cancellation flow, the receipt email. Jobs’ carpenter doesn’t use plywood on the back of the cabinet; this skill holds your unseen screens to your hero-screen standard, because that’s where users subconsciously read your craft.

Prompt

Use the steve-jobs-design-review skill to audit the back-of-the-fence surfaces in our app — the empty states, error messages, the cancellation flow, and the transactional emails — and tell me which ones fall below the quality bar of our main screens, with the specific fix each one needs

Steve Jobs Design Review

It pairs deliberately with the earlier UX phases: where UX Heuristics finds usability problems and Microinteractions adds polish, this skill judges whether the whole is something to be proud of. That’s the right note to end an improvement effort on.

Your checklist

Common mistakes

Reaching for the rewrite. The single most expensive mistake. A rewrite discards the edge cases your current code silently handles and reproduces the same problems in new code. Every skill here is built on changing the running system incrementally — refactoring-patterns and working-with-legacy-code exist precisely so you never have to choose between “untouchable” and “rewrite.”

Refactoring code that has no tests. This is editing and praying. refactoring-patterns only preserves behavior when tests can confirm it did; without them you can’t tell a clean-up from a regression. The fix is strict ordering: working-with-legacy-code first to get a characterization-test net, then refactoring-patterns. Never the reverse.

Mixing structural and behavioral changes in one commit. When a combined commit breaks something, you cannot tell which edit did it. Both refactoring-patterns and working-with-legacy-code insist on the discipline: refactor in one commit (tests green before and after), change behavior in a separate one. Keep the two hats off at the same time.

“Cleaning up” by adding more abstraction. Over-engineering masquerades as quality. software-design-philosophy is explicit that classitis — many shallow classes, each adding interface cost — increases complexity, and the cure is often to merge, not split. If an abstraction doesn’t hide significant complexity behind a simpler interface, it’s not pulling its weight.

Treating performance as a bandwidth problem. Teams add CDN tiers and bigger instances when the real cost is round trips and a blocked main thread. high-perf-browser is built on latency, not bandwidth, being the bottleneck — measure Core Web Vitals and fix the specific cause before spending on throughput.

Assuming the database default protects you. Most apps run on read-committed or snapshot isolation and assume serializable guarantees they don’t have. ddia-systems surfaces the write-skew and lost-update anomalies this permits — the bugs that “can’t happen” right up until a concurrent request makes them happen in production.

Polishing the demo path and ignoring the back of the fence. The error states, empty states, and cancellation flows are where users subconsciously judge your craft, and they’re the surfaces nobody reviews. steve-jobs-design-review audits exactly these, and microinteractions maps the empty/error/edge states the happy path skips.

Frequently asked questions

In what order should I actually run these eleven skills?

The order in this guide is deliberate: understand and baseline (clean-code), get a test net (working-with-legacy-code), refactor safely (refactoring-patterns), deepen the design (software-design-philosophy), harden the runtime (release-it), fix the data layer (ddia-systems), then the UX layers (microinteractions, ux-heuristics, high-perf-browser), and finally lock in habits (pragmatic-programmer) and judge the whole (steve-jobs-design-review). The one hard dependency is that working-with-legacy-code must precede refactoring-patterns for any untested module — tests before structural change, always. Everything else you can reorder to match where your app hurts most. A slow checkout starts at Phase 9; a fragile integration starts at Phase 5.

Can I use just one skill instead of the whole sequence?

Yes — each is self-contained and useful alone. If your only complaint is a slow page, install high-perf-browser and stop there. If one tangled service is blocking every feature, working-with-legacy-code plus refactoring-patterns is a complete two-skill workflow. The sequence is the maximal version for a team doing a deliberate quality push over a quarter; the skills don’t require each other except for that single tests-before-refactoring rule.

How do I keep the AI agent from doing too much at once?

This is the central risk, and the skills are designed around it. The instruction that matters most is to demand small, verified steps. Ask the agent to apply one named refactoring at a time and run the tests between each — that’s the explicit refactoring-patterns loop. Have it commit structural and behavioral changes separately. And tell it to show you the characterization tests before it changes any legacy code, so you’ve confirmed the safety net exists. An agent left to “clean this up” will reformat everything; an agent told to “apply Extract Method to this one block, confirm the suite is green, and commit” stays controllable.

We have almost no tests. Where do we even start?

Start with working-with-legacy-code, and don’t try to test everything — coverage follows change. Characterize only the branches your next change will touch. Find the seam that’s blocking instantiation (usually a constructor doing real work), break it with the least invasive move, pin the current behavior, then make your change inside the net. When even that isn’t feasible in the time you have, use Sprout Method or Sprout Class to add new tested code beside the untested host with a single call line. Coverage grows along the paths you actually touch — which beats a dedicated “testing project” that never gets funded.

How is this different from the refactoring guide and the website guide?

This guide is the broad improvement workflow for an app in production — code, resilience, data, and UX together. Refactor a codebase buried in technical debt goes deep on the code-health campaign when debt is the dominant problem: the legacy-code, refactoring, and design-philosophy skills as a sustained effort. Improve an existing website with AI skills covers the front-of-site surface — conversion, messaging, SEO, and page speed for a marketing site rather than an application. Use this guide as your map; pull the specialized guides when one dimension dominates.

Start improving today

Pick the phase where your app hurts most — the scary service, the slow page, the flaky checkout, the data race — and install the skills:

npx skills add wondelai/skills --all --global

Then open your agent and point it at a real artifact: a class, a flow, an endpoint. Start with a clean-code score on your most-changed module to get an honest baseline, or jump straight to working-with-legacy-code if the code you need to change has no tests. Improve one verified step at a time, commit after each, and let the score climb.

When the engine is solid and you want to go deeper on the code-health side, continue with Refactor a codebase buried in technical debt. Your shipped app doesn’t need a rewrite. It needs the next small, disciplined step — and now you know exactly which one to take.

How to Improve an Existing App with AI Skills

Phase 1 — Establish a quality baseline you can defend

Phase 2 — Refactor safely, one named transformation at a time

Phase 3 — Get untested code under a net before you touch it

Phase 4 — Deepen the design, don’t just tidy it

Phase 5 — Make it survive production, not just pass QA

Phase 6 — Fix the data layer, because data outlives code

Phase 7 — Polish the microinteractions users feel every day

Phase 8 — Remove the friction that makes users think

Phase 9 — Make it fast where speed is a feature

Phase 10 — Lock in the habits that keep debt from coming back

Phase 11 — Submit the result to a brutal, honest review

Your checklist

Common mistakes

Frequently asked questions

In what order should I actually run these eleven skills?

Can I use just one skill instead of the whole sequence?

How do I keep the AI agent from doing too much at once?

We have almost no tests. Where do we even start?

How is this different from the refactoring guide and the website guide?

Start improving today

Related guides

Get all 50 skills, free

Don’t guess your AI engineering level.
Measure it.

AI Developer Scorecard

CTO Scorecard

How to Improve an Existing App with AI Skills

Phase 1 — Establish a quality baseline you can defend

Phase 2 — Refactor safely, one named transformation at a time

Phase 3 — Get untested code under a net before you touch it

Phase 4 — Deepen the design, don’t just tidy it

Phase 5 — Make it survive production, not just pass QA

Phase 6 — Fix the data layer, because data outlives code

Phase 7 — Polish the microinteractions users feel every day

Phase 8 — Remove the friction that makes users think

Phase 9 — Make it fast where speed is a feature

Phase 10 — Lock in the habits that keep debt from coming back

Phase 11 — Submit the result to a brutal, honest review

Your checklist

Common mistakes

Frequently asked questions

In what order should I actually run these eleven skills?

Can I use just one skill instead of the whole sequence?

How do I keep the AI agent from doing too much at once?

We have almost no tests. Where do we even start?

How is this different from the refactoring guide and the website guide?

Start improving today

Related guides

How to Create a New App with AI Skills

How to Improve an Existing Business with AI Skills

How to Improve an Existing Website with AI Skills

Get all 50 skills, free

Don’t guess your AI engineering level. Measure it.

AI Developer Scorecard

CTO Scorecard

Don’t guess your AI engineering level.
Measure it.