A skill is a product. Iterate it like one.

Shipping a skill is easy. The hard part starts when you use it daily. Here’s how three weeks of real use forced me to simplify, test, and generalise my UX walkthrough skill to make it actually work.

June 20263 min read

00 Where evaluations fall short

In May, I built ux-audit to scan codebases for UX issues. It passed all my initial tests, but daily use showed three clear flaws:

Hard to read: The output looked like essays. It was hard to read, and impossible to scan on a phone.
Too subjective: The advice was mostly about taste, which led to debates rather than fixes.
Hardcoded: It only worked on one repository.

Evals vs. reality: Tests only show the code matches the specification. Real use tells you if the specification was wrong.

01 Cite design rules to end debates

Opinions like “this form feels slow” are easy to argue with. Citing a design rule makes the issue clear.

I added Nielsen’s Heuristics and the Laws of UX to the skill. I also added a rule: always cite one design principle at the end of a finding.

Impact: On a 375px screen, 32px touch targets increase tap errors. Standard is 44px. (Fitts's Law)

This forces the model to ground its feedback in real design principles while keeping the report short.

02 Format for quick scanning

An audit report is still a user interface. If a developer or product manager is reading it on a phone, they need the core details fast.

I changed the format to start with a simple summary table:

## Summary
- Score: 8/10 (up from 7.5)
- Issues: 2 Critical, 4 Important, 3 Polish
- Focus: Show transaction fee on checkout

I also renamed it from ux-audit to ux-walkthrough. An audit implies checking boxes. A walkthrough focuses on the user flow.

03 Test changes against real examples

Testing changes on general vibes leads to regressions. To keep things honest, I saved the original code and ran both versions on the same tasks.

I added clear assertions: verify the summary table exists, check for heuristic citations, and make sure the file and line numbers match real files.

| Metric       | Old Version | New Version |
|--------------|-------------|-------------|
| Tests Passed | 84%         | 100%        |
| Time         | 238s        | 235s        |
| Tokens       | 99k         | 106k        |

The old version failed the new assertions, which confirmed the new rules were working without breaking the rest of the code.

Good tests: A test that passes on both the old and new code does not measure your change. Build tests that actively show the difference.

04 Generalise only after it works

Building for abstract scenarios leads to over-engineering. I kept the skill locked to one project until the format was stable.

Once it worked well, I added a fallback setup: try to read the persona from the user prompt; if that is empty, look for it in the repository; if that fails, use the default settings.

05 Checklist

Use the skill in real work until you find a repeatable issue.

Save a copy of the working version before making edits.

Add test assertions that only the new version can pass.

Run both versions on the same input and compare the data.

Generalise only after the skill works on a specific project.

Try it yourself: the ux-walkthrough skill is free to download. Drop it into ~/.claude/skills/ and run it on your own codebase.