A skill is a product. Iterate it like one.
Shipping a skill is easy. The hard part starts when you use it daily. Here is how three weeks of real use forced me to simplify, test, and generalise my UX walkthrough skill to make it actually work.
00Where evaluations fall short
In May, I built ux-audit to scan codebases for UX issues. It passed all my initial tests, but daily use showed three clear flaws:
- Hard to read: The output looked like essays. It was hard to read, and impossible to scan on a phone.
- Too subjective: The advice was mostly about taste, which led to debates rather than fixes.
- Hardcoded: It only worked on one repository.
01Cite design rules to end debates
Opinions like “this form feels slow” are easy to argue with. Citing a design rule makes the issue clear.
I added Nielsen’s Heuristics and the Laws of UX to the skill. I also added a rule: always cite one design principle at the end of a finding.
Impact: On a 375px screen, 32px touch targets increase tap errors. Standard is 44px. (Fitts's Law)
This forces the model to ground its feedback in real design principles while keeping the report short.
02Format for quick scanning
An audit report is still a user interface. If a developer or product manager is reading it on a phone, they need the core details fast.
I changed the format to start with a simple summary table:
## Summary - Score: 8/10 (up from 7.5) - Issues: 2 Critical, 4 Important, 3 Polish - Focus: Show transaction fee on checkout
I also renamed it from ux-audit to ux-walkthrough. An audit implies checking boxes. A walkthrough focuses on the user flow.
03Test changes against real examples
Testing changes on general vibes leads to regressions. To keep things honest, I saved the original code and ran both versions on the same tasks.
I added clear assertions: verify the summary table exists, check for heuristic citations, and make sure the file and line numbers match real files.
| Metric | Old Version | New Version | |--------------|-------------|-------------| | Tests Passed | 84% | 100% | | Time | 238s | 235s | | Tokens | 99k | 106k |
The old version failed the new assertions, which confirmed the new rules were working without breaking the rest of the code.
04Generalise only after it works
Building for abstract scenarios leads to over-engineering. I kept the skill locked to one project until the format was stable.
Once it worked well, I added a fallback setup: try to read the persona from the user prompt; if that is empty, look for it in the repository; if that fails, use the default settings.
05Checklist
Use the skill in real work until you find a repeatable issue.
Save a copy of the working version before making edits.
Add test assertions that only the new version can pass.
Run both versions on the same input and compare the data.
Generalise only after the skill works on a specific project.