I had a mate who's still finding their feet with English ask me last week what I meant by something I'd said. Can't even remember the sentence now. Something throwaway. But watching them try to parse it, then watching myself try to explain it, was the thing that stuck with me.
Because the words weren't the problem. The words were fine. What they were missing was everything around the words. Who I was talking about. What we'd been chatting about earlier. The tone. The little half-joke buried in there that only really lands if you've lived in Australia for a bit. I had to unpack about ten minutes of context just to make one sentence make sense.
English does this constantly and we never notice. "We need to talk" from your boss isn't the same sentence as "we need to talk" from your mate at the pub. "I'm fine" can mean fine, or the exact opposite, sold entirely on the sigh. None of that is in the words. All of it is in the context.

You hand the model a sentence. Sometimes one line. It has to figure out what you meant, not what you typed. Except it doesn't know which repo you've been swearing at. Doesn't know "the auth thing" is the OAuth flow you rebuilt last Tuesday. Doesn't know your team banned that library, this repo uses tabs, that function is held together by a 2019 comment and hope.
So it guesses. And when it guesses wrong you get a beautifully formatted answer to a problem you never had.
Most of the noise right now is about the model. Bigger, smarter, faster. Fine. But that's not the broken bit for most people. The broken bit is the context. The harness around the model. What it sees, what it has to infer, how much of your actual situation makes it across the gap before it starts typing.
That's what this is about. The room, not the words in it.
What's a harness, then
I'm borrowing the word from Birgitta Böckeler's Harness Engineering, on Martin Fowler's site. Her definition is the cleanest one I've read: the harness is everything in an AI agent except the model itself. The controls around the model that increase your confidence in what comes out, and reduce how closely you have to supervise it.
Böckeler splits those controls two ways, and both splits matter for what follows.
The first split is guides versus sensors. Guides are feedforward. They steer the agent before it acts, so the first attempt is more likely to be the right one. Sensors are feedback. They observe what the agent did and flag problems after the fact, so it can self-correct. A spec is a guide. A failing test is a sensor.
The second split is computational versus inferential controls. Computational controls are deterministic. Tests, linters, type checkers. Fast, cheap, reliable. Inferential controls are AI judging AI. Slower and more expensive, but they can reason about whether the code is any good, not just whether it compiles.
The other phrase from the article I keep coming back to is keep quality left. Catch the problem in the spec, not the PR. In the PR, not in prod. Earlier is cheaper. Every skill below is an attempt to push checks left.
Where I started
I've always been in on AI coding, mostly from a curiosity perspective. The early attempts were weak. The model didn't understand best practices, solutions were overdone, first principles were out the window, context windows were tiny. Anything more than a small app or a tidy little refactor was fraught.
I remember a 1:1 with a junior developer back then. She was sheepish about how long she'd spent on a problem when the solution turned out simpler than she'd thought. We've all been there. There's a lot to be learned from that style of slog. Spending real time on something, braving through, finding the answer, and getting the win at the end. That challenge-carrot-payoff loop is one of the reasons this career still works for me.
It didn't take long for AI tools to start eating into that. I'd recently moved into an Engineering Manager role and was on the tools less anyway. I was using Cursor to build small things on the side. One of them, shakashuffle, was a quokka-themed planning poker app (the team estimation game where everyone reveals a card at once so the loudest voice doesn't anchor the room). I wanted my own team to use it because every other planning poker tool out there was clinical, joyless, and somehow expensive.
Getting the landing page to a level I was happy with using Claude 3.5 was a slog, but not the good kind. I spent more time steering the model away from regressions and bad patterns than I did making forward progress. Claude 3.7 with autonomous agents was a step up. Bigger context, better awareness of the codebase. Still not enough. I was writing rules in repo docs, scattering comments around, padding prompts with everything I thought it might need, and it would still trip over itself.
I'd played around with a few planning tools. When Amazon Kiro turned up with its style of spec-driven development, I was in. I'd already been deep in refining the flow from idea to implementation. The product vision, the UX research, the iterative narrowing, the task breakdown, the API schemas, the estimations. All the stuff a real team does before code gets written. Spec-driven development felt like a natural extension of that, just for one developer and an agent.
If you're a pure vibe coder, I'll admit, this might look like a lot of work. But if you want to stop the agent hallucinating details you didn't mention, or quietly refactoring three files it wasn't asked to touch, stick with me. This is the harness I use to stay in control of Claude, Kiro, Cursor, or whatever else, all of which (let's be honest) are gently working to abstract this stuff away from you.
The structure here is influenced by the paper Interpretable Context Methodology: Folder Structure as Agentic Architecture by Jake Van Clief and David McDermott at Eduba. The core idea: when planning, decisions, and context live in plain files, they stay readable, portable, and improvable. No vendor lock-in, no hidden state. The structure is part of the architecture.
My setup
For reference. Use whatever you want, this is just mine.
- IDE: WebStorm by JetBrains
- IDE plugin: Claude Code plugin
- LLM/agent: Claude Code
The four skills I've written to keep my development flow honest are all in open-coding-tools on GitHub:
- product-requirements-skill.md
- feature-development-skill.md
- code-review-skill.md
- testing-pyramid-skill.md
All of them write their artefacts to a .plans/ directory at the root of your repo.
How I actually use it
Same example end to end, because that's the only way the value clicks.
Say I have an idea. A web app that lists campervan-friendly camping locations with their amenities, but social, so people on the road can leave reviews and follow each other. I open Claude Code and prompt:
I have an idea for a web app that finds all the campervan-driveable camping locations and lists out their amenities. I want it to feel social. Create me a PRD for this using the skill.
Here's what happens.
product-requirements
product-requirements-skill.md on GitHub
---
name: prd-plan
description: Iteratively plan a Product Requirements Document (PRD) with the user via in-file checkbox Q&A, folding answers into the spec until signed off. Use when the user asks to "plan PRD-XXX", "create a PRD for X", "work on PRD-003", "iterate on PRD-XXX", or to turn a rough product idea into a fully specified PRD. PRDs are numbered sequentially (PRD-001, PRD-002, …).
metadata:
argument-hint: <PRD-XXX id, product name, or plan file path>
---
# PRD Plan Iteration
## Harness role
- **Control type:** guide (feedforward)
- **Regulation category:** behaviour
- **Lifecycle stage:** pre-design
- **Computational checks:** none — PRDs are upstream of code; the deterministic gate is declared later by the FDs that descend from this PRD
- **Inferential checks:** structured in-file Q&A on users, scope, data model, UX, and phasing — every question with concrete options and a recommendation
- **Frame:** [*Harness Engineering*](https://martinfowler.com/articles/harness-engineering.html) by Birgitta Böckeler
This skill runs a tight, in-file question-and-answer loop that builds and refines a Product Requirements Document (PRD)
until the user has signed it off. No code is written while this skill is running. The goal is a spec where every product
decision is explicit, every open question is closed, and every user correction has been folded back into the relevant
section(s) of the document — not just recorded as an answer.
A PRD is a complete product spec: who uses it, what it does, what the data model looks like, how the UX works, and how
the build is phased. It is written once, up front, and is the authoritative reference for the entire product build.
## When to use
Invoke this skill when:
- The user says "plan PRD-XXX", "create a PRD for X", "let's work on PRD-XXX", "iterate on PRD-XXX", "flesh out
PRD-XXX", or similar.
- The user has a rough product idea or an existing `.plans/product-requirements-document/` file they want to fully
specify.
- The user says "I answered the open questions, review and update" — that's the middle of this loop; resume it.
Do NOT invoke this skill when:
- The user is asking a quick question about an existing PRD. Just read it and answer.
- The user has asked for code changes directly. This skill never writes code.
## PRD document structure
Every PRD lives under `.plans/product-requirements-document/PRD-XXX - {slug}.md`, where `XXX` is a zero-padded
sequential number (PRD-001, PRD-002, …). Check the existing files in `.plans/product-requirements-document/` and pick
the next number. A template lives at `.plans/product-requirements-document/001-TEMPLATE.md` — copy it as the starting
point for a new PRD. A complete PRD has these sections (in order):
```
# PRD-XXX: {Product Name}
## Status
- [x] Open
- [ ] In-progress
- [ ] Partly implemented
- [ ] Done
## Overview
One paragraph: what the product is, who it's for, and the core problem it solves.
## Tech Stack
Framework, database, auth, any third-party services. Match existing monorepo patterns unless there's a reason not to.
## Users
Who uses this? What role(s) do they have? Is there public access? Authentication approach?
## Features
One H3 per major feature area. Each feature area lists what users can do (CRUD verbs, filters, actions). No implementation detail — just capabilities.
## Data Model
One table per entity showing Field, Type, Details. Include enums with their allowed values. Note FK relationships.
## UX Design
### Design Principles
2–4 one-liners that define the feel and priorities of the UI.
### Layout
ASCII wireframe(s) for the primary view(s). One diagram per major screen or state.
### Key Interactions
Named subsections for the most complex or non-obvious interactions (e.g. slide-out panel, inline confirmation, email generator drawer).
## Implementation Plan
Numbered phases, each with a short title and a bullet list of what gets built. Phases should be independently deployable or at least independently testable.
### Model routing
Each Implementation Plan phase includes a Model routing table so the cheapest viable model is picked deliberately rather than by reflex.
Template (paste verbatim into each phase, fill in per-step):
```markdown
#### Model routing
| Step | Model | Reason |
| ---- | ---------- | ------------------------------------------------------- |
| 1 | **haiku** | {one line — pure mechanical work} |
| 2 | **sonnet** | {one line — bounded judgement, well-specified} |
| 3 | **opus** | {one line — design / risky / cross-cutting / synthesis} |
If a sonnet/haiku step surfaces a non-trivial decision, escalate to the main session rather than guess.
```
**Rubric for picking the model:**
- **haiku** — pure mechanical: run a command, grep, count lines, file moves, single-line edits, install a dep, confirm
a number, paste known content. No judgement required; if the agent has to choose between options, it's not haiku-class.
- **sonnet** — bounded judgement: apply a pattern the PRD specifies, refactor following a recipe, fix lint errors with
clear rules, write boilerplate from a spec, classify hits into a fixed set of buckets. The agent makes small calls
inside well-defined rails.
- **opus** (= top-level session) — design, architecture, risky migrations, cross-cutting changes, synthesising multiple
inputs, anything where getting it wrong wastes more than the model-cost saving. Don't delegate this.
**Rules:**
- **Each step in the phase gets a row.** No row → no execution. If a step has no row, the Implementation Plan is incomplete.
- **Default towards cheaper.** If you're choosing between sonnet and opus and the work is well-specified by the PRD,
pick sonnet. Reserve opus for the parts where the PRD doesn't fully prescribe the answer.
- **The escalation line is mandatory.** Sonnet/haiku subagents must know they can return without guessing — they will
guess otherwise.
- **Re-evaluate after the design step.** Once the inventory/design step (usually opus) is done, the remaining steps are
often more mechanical than first thought; downgrade them if so.
## Open Questions
Active Q&A area (see format below). Moves to Decision Trail once all answered.
## Decision Trail
Table of resolved decisions (columns: ✅, Question, Decision, Why).
```
## Locating or creating the plan file
1. If the user passed a file path, read it.
2. If they passed an ID (e.g. `PRD-003`, `003`) or slug (e.g. `product-database`), glob for it under
`.plans/product-requirements-document/` and pick the matching file. Ignore `*TEMPLATE*` files.
3. If no file exists yet:
- Look at existing `PRD-XXX - *.md` files in `.plans/product-requirements-document/` and pick the next sequential
number (zero-padded to three digits).
- Read `.plans/product-requirements-document/001-TEMPLATE.md` and use it as the skeleton.
- Create the new file at `.plans/product-requirements-document/PRD-XXX - {slug}.md`, filling in what you know from
the user's description, and leaving sections as `_TBD_` placeholders where you need answers.
4. Confirm the file path and the assigned PRD number in your first reply so the user can correct you before you start
editing.
## The iteration loop
Each cycle has four beats:
### Beat 1 — Read current state
Read the full PRD file. Identify:
- Which sections are still vague, incomplete, or contradicting each other
- What's answered vs open in the Q&A section
- Any checkboxes the user has ticked or notes they've added since the last read
- Any new requirements the user added directly into prose sections
### Beat 2 — Process user answers
If the user has answered questions since last time:
- Fold each answer into EVERY section of the PRD it affects — not just the questions area. If an answer changes the data
model, rewrite the Data Model section. If it changes the layout, update the wireframe. If it changes what phases look
like, update the Implementation Plan.
- A decision is "processed" only when the PRD reads consistently as if the decision was always there. Leaving a resolved
answer orphaned in the Open Questions area while the main spec still says something contradictory is a failure mode —
do not do it.
- Watch for answers that contradict things written earlier. Reconcile explicitly; never silently keep the old wording.
- Watch for answers that reveal a project-wide preference (e.g. a locale/spelling rule, a tooling convention). These
belong in memory via the auto-memory system, not just in this one PRD.
### Beat 3 — Raise new questions
Surface new questions **in the PRD file** — NOT just in chat. The user works through these in their IDE and the PRD is
the durable artefact.
**In-file question format:**
For a question with discrete options, use checkboxes the user can tick directly in the file:
```markdown
#### QX. {short question title}
{One or two sentences explaining the question and why it matters for the product.}
- [ ] Option A — {short description of what this option means and when it's right}
- [ ] Option B — {short description}
- [ ] Option C — {short description}
- [ ] Other — notes:
- _{your note here}_
**Notes / reasoning:**
- _{anything you want me to know about the pick}_
```
Rules for writing questions:
- **Every question gets a "Notes" slot.** The user can always fill it with an alternative or a reason.
- **Give your current recommendation inline.** If you think Option B is right, say so in Option B's description and
explain the tradeoff in one line.
- **Never ask the user to pick between options you haven't described.** Vague questions like "how should we handle X?"
are a failure mode — always offer concrete options.
- **Don't ask questions whose answers are already derivable from the codebase.** Read existing apps first. Ask only for
judgement calls (scope, UX direction, priorities, business rules), not facts you can grep for.
- **Questions have IDs** (Q1, Q1a, Q1b, Q2, ...) so conversation can reference them precisely.
- **Cap one round at ~3 questions.** More than that is a survey, not iteration. Pick the ones that unblock the most
other decisions first.
- **PRD questions focus on product decisions**, not implementation detail. Good PRD questions: "Should this be
single-user or multi-role?", "Is archive soft-delete or hard-delete?", "Should filters persist across sessions?". Bad
PRD questions: "Which framework primitive should we use for this?", "Should this run on the server or the client?"
### Beat 4 — Report back and wait
Send the user a terse message (≤120 words) summarising:
- What you changed in the PRD based on their answers
- What new questions you raised (by ID) and where to find them in the file
- One unilateral decision you made if you had to make one — flag it so they can override
Then stop. The user will either answer the new questions (another cycle begins) or sign off.
## Managing the decision trail
- Under `## Open Questions`, only keep questions that are actually still open (unchecked or partially answered).
- Move resolved questions to `## Decision Trail` as a table: columns `✅ | Question | Decision | Why`. Keep each Why cell
to one sentence.
- Order the table chronologically — the order carries information about how the design evolved.
- When a question is resolved, fold the answer into the main PRD sections FIRST, then move the question to the decision
table. Never skip the fold-in step.
## Implementation handoff (kicking off the build)
A PRD doesn't get implemented directly — its `## Implementation Plan` phases get turned into FDs that do. When the user signals the PRD is ready to start building (phrases like "start building", "kick off Phase 1", "let's begin"):
1. **Re-read the PRD end-to-end** — wireframes, data model, and Implementation Plan must agree before any FD is spun up.
2. **Pick the active phase** by walking `## Implementation Plan` top-to-bottom. The next un-shipped phase is the scope.
3. **Mirror that phase into visible task tracking** using `TaskCreate` — one task per phase deliverable (e.g. "FD: auth flow", "FD: invite list view"). Each task description names the FD that will own the detail.
4. **Update tasks live as the FDs are created and shipped.** Mark a phase task `in_progress` when its FD is signed off, `completed` when the FD's overall status flips to `Done`. Do not batch.
5. **Mirror status back into the PRD.** Flip the phase's checkbox in the Implementation Plan, and update `## Status` at the top (`Open` → `In-progress` → `Partly implemented` → `Done`) as phases land.
6. **Don't write code from this skill.** This skill's implementation handoff is purely about the bridge between PRD phases and the FDs that descend from them. The FD's own Implementation handoff section governs the per-phase build.
7. **Done means both surfaces agree.** A PRD is fully implemented only when every phase task is `completed` AND the PRD status checkbox shows `Done`.
## Sign-off gate
You are finished ONLY when:
1. Every question in the file is answered (no unticked checkboxes, no `{your note here}` placeholders the user was
expected to fill).
2. All PRD sections are complete and self-consistent — Overview, Tech Stack, Users, Features, Data Model, UX Design, and
Implementation Plan all say consistent things. A developer could read this PRD cold and know what they're building.
3. **Every Implementation Plan phase has a populated Model routing table** — one row per step, model picked deliberately
from the rubric, escalation line present. See "Model routing" under `## Implementation Plan`. A phase without this
table isn't signed off, no matter how good the rest looks.
4. The user has explicitly confirmed they want to proceed. Common sign-off phrases: "looks good", "go", "approved", "
ship it", "done". If the user's latest message doesn't clearly sign off, ask: "Is this PRD ready, or do you want
another pass?" — one sentence, then stop.
Until all three are true, stay in the loop. Do not write any code. Do not read implementation files unless needed to
answer a planning question.
## Hard rules
- **No code edits during this skill.** Only edits to the PRD file and (where justified) the user's memory system.
- **Always fold answers into the main PRD sections.** Orphaned answers in the Open Questions area while the rest of the
PRD is stale is the most common failure mode — guard against it.
- **Never hide questions from the user by asking them only in chat.** If it's a product decision, it goes in the file
with a notes slot. Chat is for terse status updates between cycles.
- **Match the tech stack and patterns of the existing project** unless the user explicitly says otherwise. Read
`CLAUDE.md` (or equivalent project conventions doc) before raising a tech-stack question.
- **Wireframes are mandatory.** Every PRD must have at least one ASCII wireframe for the primary view before sign-off.
If none exists after the first cycle, add a skeleton one and ask the user to correct it.
- **Implementation Plan must have phases.** Each phase should be independently testable. If the user hasn't given phase
breakdown, propose one based on the feature list — it's easier to react than to generate from scratch.
- **Respect project-wide preferences** stored in memory. Default to those preferences; only raise a question if there's
genuine tension with the product's needs.
- **Keep update messages terse.** The PRD file is the durable artefact. Chat messages between cycles should be ≤120
words and never repeat content that's already in the file.
- **Ask before guessing.** If an answer is ambiguous, raise it as a new question in the next cycle rather than picking
unilaterally. If you must pick unilaterally because the decision is tiny and blocks progress, flag it explicitly in
the chat summary so the user can override.
- **Defer to smaller models for routine reads.** When you need to read a known file, grep for a specific symbol, or
fetch a single doc page during the planning loop, do it directly with `Read` / `Grep` / `WebFetch`. But anything that
fans out — exploring an unfamiliar subsystem, finding all call-sites of a symbol, summarising a long doc, comparing
multiple files — should be delegated to a subagent with `model: "haiku"` (or `"sonnet"` if the task needs reasoning).
Reserve the top-level Opus session for the synthesis work that actually needs it: reconciling user answers with the
PRD, judging tradeoffs, deciding question wording. The planning loop is mostly orchestration; don't pay Opus rates
for grep.
- **Every Implementation Plan phase has a Model routing table.** A phase without per-step model assignments isn't signed
off, even if every other section is complete. Use the template in "Model routing" under `## Implementation Plan`. The
sign-off gate enforces this.
## Example cycle
Initial state: user says "plan a PRD for {some new product}". No file exists yet.
Cycle 1:
- Create `.plans/product-requirements-document/PRD-XXX - {slug}.md` with skeleton sections filled from the user's
description.
- Identify three big unknowns: who the users are, the CRUD scope, and the boundary of one fuzzy feature area.
- Add Q1 (user roles), Q2 (CRUD scope), Q3 (the fuzzy feature's boundary).
- Send chat: "Created PRD skeleton at `.plans/product-requirements-document/PRD-XXX - {slug}.md`. Three questions at
the bottom — Q1 is the main unlock, it shapes the auth model and most of the UX."
- Stop.
Cycle 2 (user ticks answers and adds notes):
- Read the file.
- Fold answers into Overview, Users, Features, and UX sections. Update the wireframe to reflect the layout the user
described.
- Raise Q4 (a follow-on decision unblocked by the previous answers).
- Move Q1–Q3 to Decision Trail.
- Send chat: "Folded your answers into sections 3–5, updated wireframe. One new question: Q4 — take a look."
- Stop.
Cycle N (user says "looks good"):
- Verify sign-off gate.
- Confirm in chat: "PRD signed off. Ready to start building when you are."
- Exit the skill.
The PRD skill checks .plans/product-requirements-document/ to see what's already in there, picks the next number, reads the template, and creates a new file at .plans/product-requirements-document/PRD-001 - campervan-social.md. The skeleton lands with everything it can fill from my one-paragraph prompt. An Overview. A guess at the Tech Stack based on the repo conventions. A stub for Users. A Features list with the obvious bits. Data Model, UX Design, and Implementation Plan as placeholders.
What it doesn't do is fire twenty questions at me in chat. It adds three checkbox questions at the bottom of the PRD file. Something like:
#### Q1. Account model
Should reviews and posts be tied to accounts, or can people contribute anonymously?
- [ ] Accounts only (recommended, gives you moderation, abuse handling, trust signals)
- [ ] Anonymous browsing, accounts to post
- [ ] Fully anonymous with a captcha gate
- [ ] Other, notes:
- _your note here_
**Notes / reasoning:**
- _anything you want me to know about the pick_
I tick a box, write a note if I want, save the file. The skill reads the file again, folds my answer into the Users section, the Features section, and the Implementation Plan. Then it raises whatever follow-on questions my answer unlocked. Q1 about accounts might open Q4 about which auth provider, and Q5 about whether usernames are unique. The loop runs until every question is answered, the spec is internally consistent, and I sign off.
The questions I'd skip on my own are the ones the file makes me answer. Those are usually the ones that would've bitten me in week three.
That's skill one. A guide, in Böckeler's terms. Pure feedforward. It saves me from the most expensive failure mode in AI coding: building something nobody asked for, beautifully.
feature-development
feature-development-skill.md on GitHub
---
name: fd-plan
description: Iteratively plan a feature/fix design (FD) document with the user via in-file checkbox Q&A, folding answers into the spec until signed off. Use when the user asks you to "plan FD-XXX", "work on FD-XXX", "iterate on FD-XXX", or to turn a rough plan document into a fully specified one before any code is written.
metadata:
argument-hint: <FD-id or plan file path>
---
# FD Plan Iteration
## Harness role
- **Control type:** guide (feedforward)
- **Regulation category:** behaviour + maintainability
- **Lifecycle stage:** pre-implementation
- **Computational checks:** the FD declares the gate in its `## Computational gate` block — tests, type checks, lint at minimum, plus any perf budget, bundle ceiling, or accessibility budget the work touches. Keep quality left: name the cheap deterministic checks before generation starts
- **Inferential checks:** structured in-file Q&A on design, scope, and risk, with a recommendation per question and a notes slot
- **Frame:** [*Harness Engineering*](https://martinfowler.com/articles/harness-engineering.html) by Birgitta Böckeler
This skill runs a tight, in-file question-and-answer loop that refines a plan document (an "FD" — feature design / fix
design) until the user has signed it off. No implementation code is written while this skill is running. The goal is to
reach a spec where every decision is explicit, every open question is closed, and every user correction has been folded
back into the relevant section(s) of the document — not just recorded as an answer.
The FD is also a **living record**, not just a planning artefact. Once implementation begins (in a separate task), the
FD continues to evolve: each phase that ships gets an Implementation Notes block appended, capturing what actually
changed, deviations from the plan, and follow-ups. See "Phases and the implementation template" below — that template
applies during implementation, not during planning, but the FD's structure must accommodate it from day one.
## When to use
Invoke this skill when:
- The user says "plan FD-XXX", "let's work on FD-XXX", "iterate on FD-XXX", "flesh out FD-XXX", "review my answers in
FD-XXX", or similar.
- A plan document exists under `.plans/feature-development/FD-XXX - ....md` and the user wants to refine it before
implementing.
- The user says "I answered the open questions, review and update" — that's the middle of this loop, resume it.
Do NOT invoke this skill when:
- The user has asked for code changes directly. This skill never writes code.
- The user is asking a quick question about an existing FD. Just read it and answer.
## Locating the plan file
1. If the user passed an argument that looks like a file path, read that path.
2. If they passed an ID (e.g. `FD-005`, `005`), glob for it under `.plans/feature-development/` and pick the matching
file. Ignore `*TEMPLATE*` files.
3. If nothing matches or there are multiple candidates, ask the user which file they meant. Do not guess.
4. Always confirm the file path in your first reply so the user can correct you before you start editing.
## The iteration loop
Each cycle of the loop has four beats:
### Beat 1 — Read current state
Read the full plan file. Identify:
- Which sections of the spec are still vague (hand-wavy, TODO-ish, or contradicting each other)
- The current "Open questions" / "Decisions" / similar section — what's answered, what isn't
- Any checkboxes the user has ticked or notes they've added since the last read
- Any new requirements the user added directly into prose sections
### Beat 2 — Process user answers
If the user has answered questions since last time:
- Fold each answer into EVERY section of the spec it affects — not just the questions area. If an answer changes the
label rules, rewrite the Label Rules section. If it changes data fetching, rewrite the Data section. If it changes
click behaviour, rewrite the Click Handler section.
- A decision is "processed" only when the spec reads consistently as if the decision was always there. Leaving a
resolved answer orphaned in the Open Questions area while the main spec still says something contradictory is a
failure mode — do not do it.
- Watch for answers that contradict things you wrote earlier in this skill's run. Reconcile explicitly; never silently
keep the old wording.
- Watch for answers that reveal a project-wide preference (e.g. a coding-style rule, a tooling convention, a copy/locale
rule). These belong in memory via the auto-memory system, not just in this one FD.
### Beat 2b — Populate Acceptance Criteria
After processing user answers, update acceptance criteria. **Placement depends on FD shape:** flat FDs use a top-level
`## Acceptance Criteria` section; phased FDs put criteria inside each `### Phase N` block (under `#### Acceptance
criteria`) so each phase reads as a self-contained slice. Same Given-When-Then format either way. Each scenario block
uses:
```markdown
#### Scenario: {scenario name}
- **Given** {precondition}
- **When** {action}
- **Then** {outcome}
```
Rules:
- One scenario per key behaviour — not one per code path.
- Write scenarios a tester could execute against the running app, not against the implementation.
- Scenarios must be consistent with the rest of the spec. If an answer changes a behaviour, rewrite its scenario.
- These scenarios feed the `testing-pyramid` skill and the `principal-code-review` spec compliance check — keep them
precise enough to be a checklist.
### Beat 3 — Raise new questions
Answers almost always raise new questions. Surface them **in the plan file** using the in-file format below — NOT just
in chat. The user works through these in their IDE and the plan is the durable artefact.
**In-file question format:**
For a question with discrete options, use checkboxes the user can tick directly in the file:
```markdown
#### QX. {short question title}
{One or two sentences explaining the question and why it matters.}
- [ ] Option A — {short description of what this option means and when it's right}
- [ ] Option B — {short description}
- [ ] Option C — {short description}
- [ ] Other — notes:
- _{your note here}_
**Notes / reasoning:**
- _{anything you want me to know about the pick}_
```
Rules for writing questions:
- **Every question gets a "Notes" slot.** The user can always fill it with an alternative or a reason.
- **Give your current recommendation inline.** If you think Option B is right, say so in Option B's description and
explain the tradeoff in one line. The user is busy; don't make them derive your opinion from scratch.
- **Never ask the user to pick between options you haven't described.** Vague questions like "how should we handle X?"
are a failure mode — always offer concrete options.
- **Don't ask questions whose answers are already derivable from the codebase.** Read the code first. Ask the user only
for judgement calls (UX choices, priorities, scope), not facts you could grep for.
- **Questions have IDs** (Q1, Q1a, Q1b, Q2, ...) so subsequent conversation can reference them precisely.
- **Cap one round at ~3 questions.** More than that and you're not iterating, you're running a survey. If you have more,
pick the ones that unblock the most other decisions first.
### Beat 4 — Report back and wait
Send the user a terse message (≤120 words) summarising:
- What you changed in the spec based on their answers
- What new questions you raised (by ID) and where to find them in the file
- One unilateral decision you made if you had to make one — flag it so they can override
Then stop. The user will either answer the new questions (another cycle begins) or sign off.
## Managing the decision trail
Over multiple cycles the plan accumulates answered questions. Keep them visible but compact so the document doesn't
bloat:
- Under `## Open questions`, only keep questions that are actually still open (unchecked or partially answered).
Everything resolved moves to a "Decisions" or "Decision trail" section.
- Use a **table format** for resolved decisions — the user has expressed this preference (columns: ✅, Question,
Decision, Why). Keep each Why cell to one sentence so the table scans fast.
- Order the table by when each question was raised, not alphabetically. The order itself carries information about how
the design evolved.
- When a question is resolved, fold the answer into the main spec FIRST, then move the question to the decisions table.
Never skip the fold-in step.
## Risks section — always a table, always linked to tests
Once the spec is fleshed out enough that you're naming specific files, libraries, and framework primitives, you are ALSO
responsible for surfacing risks and new issues the chosen approach would introduce. These are not questions — they are
footguns you discovered during investigation that the user should see before implementation starts.
**Placement depends on FD shape.** Flat FDs put a single `## Risks & new issues surfaced by this investigation` section
near the bottom. Phased FDs put a `#### Risks & verification` block inside each `### Phase N` so risks travel with the
phase they apply to. Same table format either way; IDs stay unique across the whole FD (Phase 2's first risk continues
the numbering from Phase 1's last).
Surface as a **table**, not a list of prose paragraphs. Columns:
| ID | Risk | Mitigation | Verification |
|----|------|------------|--------------|
Rules:
- **IDs are `R1`, `R2`, …** and are referenced elsewhere in the doc (e.g. from Files to Modify, from Decisions, from the
Verification section).
- **Each row has at least one test ID** in the Verification column (`V1`, `V2`, …). If a risk genuinely cannot be
tested (e.g. "pre-existing limitation, documented baseline"), write `V_ (informational)` and add a corresponding entry
in the Verification section that records the known baseline.
- **Verification section mirrors the table**: every `Vn` referenced in the risks column must exist as a subsection under
`## Verification`, with the list of checks for that test. The Verification section header should say "Every risk Rn
below has at least one test. Test IDs are tagged with the risks they cover."
- **Mitigations are concrete actions**, not "be careful" — point at a specific code comment, an e2e test, a config
flag, or a resolved question (e.g. "Resolved by Qb option 2").
- **Order by severity then discovery order.** Highest-blast-radius risks first (compile-time blockers, data loss,
security). Informational / accepted-tradeoff risks last.
- **A risk that's resolved by an answered question cites that question** in the Mitigation cell ("Resolved by Qb (option
2)") so the audit trail is readable.
- **Don't repeat content** between the risks table and the Verification tests — the table is the index, the Verification
section has the actual steps.
When to add risks vs when to raise questions: if the risk requires a user decision, raise it as a numbered question (Qa,
Qb, …). If the risk is a gotcha with a clear mitigation, it goes in the table. A question can graduate to a risk once
the user answers it — keep the risk row (referencing the answered question) so the audit trail stays intact.
Watch for risks in these categories:
- **Config/flag prerequisites** (framework feature requires opt-in).
- **Silent coexistence issues** (new system + old system; what's the invalidation boundary?).
- **UX regressions from the refactor itself** (removing a lift breaks a live hint).
- **Cache-staleness windows** (SWR bought you speed, but users might see stale state for N seconds).
- **Cross-boundary cancellation/error propagation** (server actions, transitions, error boundaries).
- **State preservation across rollbacks** (optimistic updates, transitions, navigation).
- **Performance cliffs under cold cache / rate limits**.
- **Backward compatibility with saved user state** (password managers, sessionStorage, URL params).
## Phases and the implementation template
Most FDs deliver a single feature in one go. Some span multiple visible-to-the-user phases (e.g. ship Phase 1 minimal,
review, then ship Phase 2 logos). When an FD has phases, surface them as the dominant structure of the document — a
top-level `## Phases` section with each phase as a `###` heading.
### When to use phases
Add a `## Phases` section if any of the following is true:
- The user asks for "phased" delivery, "ship X first, then Y", or "let me see X before we do Y".
- A decision question's answer is "do X now, defer Y" (Q1's resolution, etc.).
- The work crosses a natural review boundary (a render the user wants to eyeball before more code lands).
- Total scope is large enough that landing it in one PR is risky.
If none of these apply, keep the FD flat — no `## Phases` section, no per-phase blocks. Don't manufacture phases for
ceremony. A flat FD with `## Solution Overview` + `## Files to Modify` is fine for single-shot work.
### Per-phase structure
Each phase is a `### Phase N — {short title}` heading. The phase block is **self-contained**: everything scoped to that
phase (acceptance scenarios, risks, verification, implementation notes) lives inside it, not scattered across top-level
sections. This is a deliberate readability rule — the user expects related things grouped together. Skim a phase
top-to-bottom and you have the full picture.
```markdown
### Phase N — {short title}
**Status**
- [ ] Not started
- [ ] In progress
- [ ] Shipped YYYY-MM-DD
- [ ] Deferred
#### Plan
{Tight summary of what this phase ships, what it doesn't, and why this slice. Pipeline steps, file targets, behaviours.
Riffing-level concise — same density as a flat FD's Solution Overview.}
#### Acceptance criteria
##### Scenario: {phase-scoped scenario name}
- **Given** {precondition}
- **When** {action}
- **Then** {outcome}
(One scenario per key behaviour for THIS phase.)
#### Risks & verification
| ID | Risk | Mitigation | Verification |
| --- | --- | --- | --- |
| Rn | {phase-scoped risk} | {concrete mitigation} | Vn |
##### Vn — {short title} (Rn)
- [ ] {concrete check, ticked when run}
#### Implementation notes
_(filled in during/after implementation — leave empty until then)_
```
Cross-phase content (Problem Description, the Decisions table, Open Questions) stays at the top of the FD as global
sections. Anything phase-scoped belongs inside the phase block.
**Status uses a checkbox list, not prose.** Mirrors the top-of-FD overall status block; consistent visual language
across the document.
**Verification entries use checkboxes** (`- [ ] {check}`) so reviewers can tick them off as they run them.
**Risk and verification IDs are unique across the whole FD**, even though they live inside phase blocks (so R3 in Phase
1, R4 in Phase 2 — never two R3s). Phase-2 IDs continue from where Phase 1 left off.
### The implementation template
Once a phase ships, fill in its `#### Implementation notes` block using this template. It is intentionally more verbose
than the Plan block — a reviewer with no session context should be able to reconstruct the change from these notes
alone.
```markdown
#### Implementation notes
##### Files touched
- `path/to/file.ts` — {what changed, in one line. Reference the function or section if non-obvious.}
- `path/to/other.ts` (new) — {what it is and why.}
(No new dependencies / Added dep `foo@1.2`. / No env vars added. / No DB changes.)
##### Deviations from the plan
- **{Heading of the deviation}.** {One short paragraph: what the plan said, what was actually done, why the change was
made, whether it was reversible. If a deviation should have looped back to the user but didn't, flag that explicitly.}
(Use `_None._` if the implementation matched the plan exactly.)
##### How to test (manual, this session)
1. {Concrete step a reviewer can run.}
2. {Next step.}
3. {What to look for to confirm the phase is good.}
##### Verification status
- V1 — {pass / fail / pending user manual test}.
- V2 — …
##### Follow-ups deferred
- {Anything the implementation surfaced that didn't ship in this phase. If big enough to warrant its own FD, say so.}
(Use `_None._` if nothing was deferred.)
```
Verbosity rule: **Plan blocks stay terse, Implementation notes get verbose.** During planning, riff at the Plan
density. During implementation, the notes are the durable record of what actually happened.
### Cost discipline during implementation
The same defer-to-smaller-models rule from the planning loop applies once you're implementing the FD's phases — and
arguably matters more, because implementation work fans out across many files. Concretely:
- **Reading the codebase to confirm a file path or current behaviour** → `Explore` subagent (Haiku-class).
- **Generating boilerplate** (migrations, route handlers from a template, test scaffolds) → `general-purpose` subagent
with `model: "haiku"` or `"sonnet"`, given the relevant spec slice as context.
- **Drafting Implementation-notes blocks** (Files touched, How-to-test, Verification status) → `general-purpose`
subagent with `model: "haiku"`; the top-level session reviews and edits.
- **Reserve Opus for**: applying user feedback to the spec, resolving spec/code contradictions, security-sensitive
edits, and anything where getting it wrong wastes more than the model-cost saving.
Cheapest path: top-level session decides *what* to do, subagent does it, top-level session reviews the diff. Don't
skip the review — small models will sometimes drift — but do skip the manual-typing toil.
#### Model routing table (mandatory in every Plan / phase Plan)
The defer-to-smaller-models rule above is theoretical until you write it down per step. **Every Plan block — flat or
phased — includes a Model routing table** so the cheapest viable model is picked deliberately rather than by reflex.
Template (paste verbatim into the Plan block, fill in per-step):
```markdown
#### Model routing
| Step | Model | Reason |
| ---- | ---------- | ------------------------------------------------------- |
| 1 | **haiku** | {one line — pure mechanical work} |
| 2 | **sonnet** | {one line — bounded judgement, well-specified} |
| 3 | **opus** | {one line — design / risky / cross-cutting / synthesis} |
If a sonnet/haiku step surfaces a non-trivial decision, escalate to the main session rather than guess.
```
**Rubric for picking the model:**
- **haiku** — pure mechanical: run a command, grep, count lines, file moves, single-line edits, install a dep, confirm
a number, paste known content. No judgement required; if the agent has to choose between options, it's not haiku-class.
- **sonnet** — bounded judgement: apply a pattern the FD specifies, refactor following a recipe, fix lint errors with
clear rules, write boilerplate from a spec, classify hits into a fixed set of buckets. The agent makes small calls
inside well-defined rails.
- **opus** (= top-level session) — design, architecture, risky migrations, cross-cutting changes, synthesising multiple
inputs, anything where getting it wrong wastes more than the model-cost saving. Don't delegate this.
**Rules:**
- **Each step in the Plan gets a row.** No row → no execution. If a step has no row, the Plan is incomplete.
- **Default towards cheaper.** If you're choosing between sonnet and opus and the work is well-specified by the FD,
pick sonnet. Reserve opus for the parts where the FD doesn't fully prescribe the answer.
- **The escalation line is mandatory.** Sonnet/haiku subagents must know they can return without guessing — they will
guess otherwise.
- **Re-evaluate after the design step.** Once the inventory/design step (usually opus) is done, the remaining steps are
often more mechanical than first thought; downgrade them if so.
### When implementation notes get written
Implementation notes are NOT written by the planning loop. They're written:
- During the implementation task, immediately after each phase's code lands (or in tight increments as code lands).
- Or retroactively, when the user asks ("update the FD with what you did", "add implementation notes for Phase N").
The planning skill's job is to (a) decide whether the FD has phases, (b) lay out the per-phase Plan blocks, and (c)
leave the Implementation notes placeholder so the structure is ready when implementation starts.
### Updating the Decisions table during implementation
If a unilateral decision was made during implementation (e.g. "coalesce horizontal runs in the EPS writer"), append it
to the existing Decisions table with an `(impl)` tag in the Question column so the audit trail captures decisions made
outside the planning loop. Keep the table chronological — implementation decisions go at the bottom.
## Implementation handoff
When the user signals the FD is ready to implement or fix (phrases like "ship it", "implement", "go", "start coding", "fix", "do it"):
1. **Read the latest FD end-to-end** before touching any code. The user may have edited it since the last cycle.
2. **Pick the active scope**:
- **Phased FD:** the next phase whose `Status` is not yet `Shipped`. Implement one phase at a time; do not jump ahead.
- **Flat FD:** the whole `## Files to Modify` list, ordered by the dependency between items.
3. **Mirror the chosen scope into visible task tracking** using `TaskCreate` — one task per discrete unit (one task per file-to-modify, or one task per `Vn` verification check, whichever the FD lists more concretely). Use the same wording the FD uses so the two surfaces line up.
4. **Update tasks live as work proceeds.** Mark each task `in_progress` before starting it, `completed` the moment the change lands. Do not batch — the user reads task state to know where you are.
5. **Mirror status back into the FD.** Update `## Status`, the per-phase `Status` checkboxes, and any `Files to Modify` / `Verification` rows as their referenced task completes. The FD and the task list must agree.
6. **Block on red.** If a `Vn` verification fails, leave the corresponding task `in_progress`, append a note to the phase's `## Implementation notes` block, and surface the failure in chat — don't silently mark the task complete.
7. **Done means both surfaces agree.** Only declare the implementation done when every task is `completed` AND the FD's overall `Status` and per-phase `Status` reflect "shipped".
## Sign-off gate
You are finished ONLY when:
1. Every question in the file is answered (no unticked checkboxes in open questions, no `{your note here}` placeholders
the user was expected to fill).
2. The spec sections (Problem, Data, Label rules, Click handler, Files to modify, Verification, etc.) are
self-consistent — you could hand the document to someone cold and they could implement it.
3. The `## Acceptance Criteria` section has at least one scenario per key user-visible behaviour.
4. **Every Plan block (flat or per-phase) has a populated Model routing table** — one row per step, model picked
deliberately from the rubric, escalation line present. See "Model routing table (mandatory in every Plan / phase
Plan)". A Plan without this table isn't signed off, no matter how good the rest looks.
5. The user has explicitly confirmed they want to proceed. Common sign-off phrases: "looks good, start coding", "go", "
implement it", "ship it", "approved". If the user's latest message doesn't clearly sign off, ask: "Is this ready to
implement, or do you want another pass?" — one sentence, then stop.
Until all five are true, stay in the loop. Do not write any code. Do not start edits to the implementation files. Do
not even read the implementation files unless you need them to answer a planning question.
## Spec graduation (post sign-off)
Once the user signs off, offer to graduate the acceptance criteria to a living spec file:
> "Want me to graduate the acceptance criteria to `/specs/{scope}/spec.md`? This creates a permanent, readable record
> of what {feature} is supposed to do — the `principal-code-review` skill will reference it automatically."
If the user agrees, write `/specs/{scope}/spec.md` with this structure (omit planning artefacts — decisions, risks, open
questions all stay in the FD):
```markdown
# {scope} — {feature short title}
> Source: FD-XXX — last updated YYYY-MM-DD
## Purpose
{one paragraph from the FD's Problem Description / Solution Overview}
## Acceptance Criteria
{paste the Acceptance Criteria section from the FD verbatim}
```
Rules for the spec file:
- Keep it short — purpose + acceptance criteria only. The FD has the reasoning.
- If a `/specs/{scope}/spec.md` already exists, merge the new scenarios in; do not overwrite the whole file.
- Update the spec when an FD changes existing behaviour, not for internal refactors.
- Link back to the FD number in the `> Source:` line so readers can find the decision trail.
## Hard rules
- **No code edits during this skill.** Only edits to the plan file and (where justified) to the user's memory system.
Implementation-notes updates to the FD happen in a separate task, not during the planning loop.
- **Always fold answers into the main spec.** Orphaned answers in an "Open questions" area while the rest of the spec is
stale is the most common failure mode of this skill — guard against it.
- **Never hide questions from the user by asking them only in chat.** If it's a decision that shapes the spec, it goes
in the file with a notes slot. Chat is for terse status updates between cycles.
- **Respect project-wide preferences** stored in memory. If a question touches one of them, default to the memory's
answer and only raise the question if there's a real tension.
- **Watch for project-wide feedback** while iterating. If the user tells you something that clearly applies beyond this
FD ("I prefer X over Y everywhere"), save it to auto-memory in the same cycle you're folding it into the spec — don't
wait for a separate invitation.
- **Keep update messages terse.** The plan file is the durable artefact. Chat messages between cycles should be ≤120
words and never repeat content that's already in the file.
- **Ask before guessing.** If an answer is ambiguous, raise it as a new question in the next cycle rather than picking
unilaterally. If you must pick unilaterally because the decision is tiny and blocks progress, flag it explicitly in
the chat summary so the user can override.
- **Risks are a table, not prose, and every row cites a test.** See "Risks section" above. Failing to tie each risk to a
`Vn` test ID in the Verification section is a failure mode — the whole point of documenting the risk is to ensure it
gets tested.
- **Defer to smaller models for routine reads.** When you need to read a known file, grep for a specific symbol, or
fetch a single doc page during the planning loop, do it directly with `Read` / `Grep` / `WebFetch`. But anything that
fans out — exploring an unfamiliar subsystem, finding all call-sites of a symbol, summarising a long doc, comparing
multiple files — should be delegated to a subagent with `model: "haiku"` (or `"sonnet"` if the task needs reasoning).
Reserve the top-level Opus session for the synthesis work that actually needs it: reconciling user answers with the
spec, judging risk severity, deciding question wording. The planning loop is mostly orchestration; don't pay Opus
rates for grep.
- **Every Plan block has a Model routing table.** A Plan section without per-step model assignments isn't signed off,
even if every other section is complete. Use the template in "Model routing table (mandatory in every Plan / phase
Plan)" above. The sign-off gate enforces this.
## Example cycle
Initial state: `.plans/feature-development/FD-XXX - {app} - {short slug}.md` exists with a rough problem statement, no
open questions, no data section.
Cycle 1:
- Read the file.
- Spec is sparse; a key sub-system (e.g. where state is persisted) isn't specified.
- Add an "Open questions" section with Q1 (the primary unlock — pick from a small set of concrete options), Q2 (a scope
question that depends on Q1), Q3 (a follow-on edge case). Each has checkbox options + notes slots + your
recommendation inline.
- Send a chat message: "Drafted 3 questions at the bottom of FD-XXX. Q1 is the main unlock — tick one and I'll build the
rest around your choice."
- Stop.
Cycle 2 (user ticks Q1's recommended option and adds a note on Q2):
- Read the file.
- Fold the chosen option into the relevant spec section, update "Files to Modify" to mention the affected modules,
update Verification to include the new checks.
- Process Q2's note as a partial answer; raise Q2a as a follow-up sub-question.
- Move Q1 to a new "Decisions" table.
- Send a chat message: "Folded the Q1 decision into sections 3 and 5. Raised Q2a as a follow-up — take a look."
- Stop.
Cycle N (user says "looks good, ship it"):
- Verify sign-off gate conditions.
- If the FD has phases, confirm each phase has a `#### Plan` block populated and an empty `#### Implementation notes`
placeholder ready.
- Confirm in chat: "Signed off. Switching out of planning mode — want me to start the implementation now?"
- Exit the skill. Implementation is a separate task.
Post-implementation (separate task, not part of the planning loop):
- After Phase N's code lands, append the Implementation notes block to that phase using the template above.
- Files touched, deviations, test instructions, verification status, follow-ups.
- Update the Status line on Phase N to `✅ Shipped YYYY-MM-DD`.
- If a unilateral implementation decision was made, log it in the Decisions table tagged `(impl)`.
- Update the top-of-FD Status checkboxes (`In-progress`, `Partly implemented`, `Done`) to match reality.
Signed-off PRD in hand, I'm not going to give the agent the whole document and say "build it." That's too much rope. I pick one feature. "Find a campsite within 30 minutes of my current location, filter by amenities."
Then:
Plan an FD for a location search feature in campervan-social
Same loop, different output. The skill creates .plans/feature-development/FD-001 - campervan - location-search.md, pulls the relevant constraints from the PRD, identifies the modules it'll touch (a server action, a client filter component, a Supabase query), and writes a Files to Modify section.
The bit I really like is the Risks table. Once it's investigated the codebase, it surfaces footguns it found while looking around. Stale-cache windows. A coexistence issue between an old hook and a new one. A spelling drift between en_AU and en_US in existing copy. Each risk row is tied to a verification step in a tests section, so nothing gets flagged without a plan to actually catch it.
Risks as prose are an essay you skim. Risks as a table are a checklist you can't pretend you didn't see. By the time I sign the FD off, I have a document I could hand to a contractor on the other side of the world and they'd build the same thing.
Still a guide. Still feedforward. The Risks table is the bit that quietly does double duty as a sensor brief, because each risk is paired with the test that'll catch it if it shows up.
code-review
code-review-skill.md on GitHub
---
name: principal-code-review
description: Review code changes with the judgement of a principal engineer who knows this monorepo intimately. Use when the user asks for a code review, says "what do you think of this", asks for feedback on a PR/diff/branch/recent commits, says "is this any good", "before I merge", "sanity check this", or pastes a diff and asks to review uncommitted changes.
---
# Principal Code Review
## Harness role
- **Control type:** sensor (feedback)
- **Regulation category:** maintainability + behaviour
- **Lifecycle stage:** pre-merge
- **Computational checks:** typecheck, test suite, lint — these run first; this skill only starts when they're green
- **Inferential checks:** severity-grouped semantic review of the diff against the project's conventions and the spec (FD or `/specs/`)
- **Frame:** [*Harness Engineering*](https://martinfowler.com/articles/harness-engineering.html) by Birgitta Böckeler
Review the pending or specified changes with the voice of an opinionated principal engineer who built and ships this codebase. Output is a markdown file in `.plans/code-review/` plus an inline verdict summary.
## When to trigger
Activate when the user asks for a code review, asks "what do you think of this", asks for feedback on a PR, diff, branch, or recent commits, or says things like "is this any good", "before I merge", "sanity check this". Also trigger when the user pastes a diff or asks to review uncommitted changes (`git diff`, `git diff --staged`, `git diff main...HEAD`).
## Reviewer mindset
Adopt the voice of an opinionated principal who has built and shipped this codebase. Direct, specific, no hedging. Comments should sound like a senior reviewing a colleague's PR, not a checklist. If something is fine, say it's fine and move on. If something is wrong, say why and what you'd do instead. No "consider doing X" weasel-wording when you mean "do X".
## Stack the reviewer must know
Before reviewing, identify the project's stack and conventions from `CLAUDE.md` (or equivalent project docs), the
package manifests, and the source layout. Note framework versions, language strictness settings, locale/spelling rules,
and any project-specific design-token or styling conventions.
## How to gather the diff
1. If the user pasted a diff, review that.
2. Otherwise run `git status` and `git diff` / `git diff --staged` / `git diff main...HEAD` as appropriate for what the user asked about ("uncommitted" → unstaged + staged; "this branch" → `main...HEAD`; "my PR" → `main...HEAD`).
3. Read the actual files at their current state for full context — diffs miss surrounding code.
4. If the change touches a project-specific design-token or styling system, re-read the relevant token/global stylesheet to verify usage is correct.
5. **Spec lookup**: search `.plans/feature-development/` for an FD whose scope and slug match the change. Also check `/specs/` for a living spec file for the affected area. If found, read the `## Acceptance Criteria` section — it becomes the baseline for the Spec Compliance section of the review.
## Severity model
Use four levels in this exact order:
- Blocker: must fix before merge (broken behaviour, security, data loss, sensitive-data exposure, build break, hard violation of a project-wide convention)
- Major: should fix before merge (architectural smell, perf regression, accessibility fail, missing error handling on a user path)
- Minor: fix when convenient (naming, small duplication, awkward types)
- Nit: take it or leave it (style preference, micro-optimisation)
Group findings by severity. If there are no blockers, say so up front so the author knows it's safe to merge after addressing the rest. Skip a severity heading entirely if its section is empty.
## What to actively look for
### Structure
**Architecture and boundaries** — where state lives, module/layer boundaries, premature abstraction, leaky abstractions, server vs client split where the framework distinguishes them, route/handler vs action choice.
**Framework specifics** — idiomatic use of the framework's primitives at the version in play (hooks, components, lifecycle, suspense/streaming, hydration, server-only vs client-only modules).
**Caching and rendering** — caching defaults, force-dynamic overuse, metadata, asset optimisation, route configuration, parallel/intercepting routing where applicable.
### Safety
**Domain-sensitive data** — handling of any sensitive or regulated data (PII, PHI, financial, secrets) — logging, analytics, error reports, URLs, third-party transports. Flag anything that could leak it.
**TypeScript** — `any`, `as` casts, non-null assertions, missing discriminated unions where state is modelled as separate booleans, untyped boundary data (form payloads, fetch responses) extracted without runtime validation.
**Workflow rules from `CLAUDE.md`** — code that violates project-wide conventions (e.g. directory-traversal anti-patterns, secret-file access, sleep/poll patterns, banned tooling shortcuts).
### UI
**Styling** — arbitrary values where a token exists, design-token discipline (semantic vs primitive layers), opacity-tint pitfalls, utility misuse, CSS-config drift. Flag hardcoded hex/colours when a token would do.
**Typography** — any drift from the project's defined font stack.
**Accessibility** — keyboard nav, focus states, aria, label associations, colour contrast, custom popovers/menus, multi-step or live-region announcements.
**Performance** — bundle bloat from over-importing UI libraries, large client components that could be server, unnecessary client-only directives, unmemoised expensive renders, image and font loading.
### UX
Check these against the rendered UI intent, not just the code structure. Only flag when the violation is clear from the diff — don't speculate about designs you can't see.
**Cognitive load** *(Cognitive Load, Miller's Law)* — single views that demand too many decisions at once; forms with more than ~7 ungrouped fields; steps collapsed into one screen that should be staged.
**Choice architecture** *(Hick's Law, Choice Overload)* — navigation or menus with too many ungrouped options; selects or radio groups that would benefit from chunking or progressive disclosure.
**Feedback & response time** *(Doherty Threshold)* — async actions (form submits, mutations, uploads) with no loading/pending state; success and error states missing or ambiguous after the operation completes.
**Target size & reachability** *(Fitts's Law)* — interactive elements (buttons, links, checkboxes) with touch targets below ~44×44px or positioned where they're hard to reach on mobile.
**Convention violations** *(Jakob's Law)* — custom interaction patterns (drag-to-dismiss, swipe-to-delete, inline edit) where a standard platform pattern exists and the custom one buys nothing.
**Visual hierarchy** *(Von Restorff Effect, Serial Position Effect)* — primary CTA not visually distinct from secondary actions; the most important item buried mid-list rather than first or last.
**Flow continuity** *(Zeigarnik Effect)* — multi-step flows that lose user progress on back-navigation or reload; incomplete states (drafts, partially filled forms) not persisted.
**Peak moments** *(Peak-End Rule)* — error states that are generic or buried; empty states and success/completion screens that are plain when they're a key user moment.
### Polish
**Spelling/locale** — any drift from the project's chosen locale (en_US vs en_GB/en_AU, etc.).
## How to deliver the review
Start with a one-line verdict ("ship it after the two minors", "blocker on the {area}, hold", etc). Then a 2-3 sentence summary of what changed and the reviewer's read on it.
If a spec was found (FD or `/specs/`), include a **Spec Compliance** section immediately after the summary, before severity findings. For each acceptance-criteria scenario, mark it:
- ✅ — implemented and verifiable in the diff
- ⚠️ — partially implemented or unclear from the diff (explain briefly)
- ❌ — not implemented (this is at minimum a Major finding; escalate to Blocker if the scenario covers a user-visible behaviour)
- — — not in scope for this change (e.g. scenario belongs to a later phase)
If no spec was found, omit the section entirely — don't write "no spec found" noise.
Then findings grouped by severity, each as: `file:line` — finding — why it matters — what to do. Quote the offending line if useful. Skip the severity heading if a section is empty. End with "Out of scope but worth noting" only if there's something genuinely worth flagging that wasn't part of the diff.
## Output location and filename
Every review is written to a file in `.plans/code-review/` at the repo root. Never dump the review into chat only — always write the file and then surface the verdict + key findings inline.
Filename format: `CR-XXX - scope - change-short-name.md`
- `CR-XXX` is a zero-padded sequential number (CR-001, CR-002, ...). Before writing, list `.plans/code-review/` and use the next number after the highest existing one. If the folder doesn't exist yet, create it and start at CR-001.
- `scope` is the area being reviewed (an app, package, or service name). If the change spans multiple scopes, use a name that captures the umbrella (e.g. `monorepo`). If it's not scope-specific (root config, CI, tooling), use `root`.
- `change-short-name` is a kebab-case 2-5 word summary of what changed (e.g. `auth-flow-refactor`, `submit-route-fix`, `picker-a11y`).
- Spaces around the dashes in the filename are intentional — match the format exactly.
Examples:
- `.plans/code-review/CR-001 - {scope} - {change-slug}.md`
- `.plans/code-review/CR-002 - {scope} - {change-slug}.md`
- `.plans/code-review/CR-003 - root - {change-slug}.md`
## File contents
The markdown file uses real markdown (no bare-line frontmatter — GitHub renders that as one paragraph). Match the shape in `CR-000 - code-review TEMPLATE.md`:
```markdown
# Code review — <scope> — <change-short-name>
## Header
- **ID:** CR-XXX
- **Scope:** <scope>
- **Change:** <change-short-name>
- **Date:** YYYY-MM-DD
- **Commit/branch:** <short SHA / branch name / "uncommitted">
- **Spec:** FD-XXX / `/specs/<scope>/spec.md` / none
- **Verdict:** <one-line verdict>
## Summary
…
## Spec compliance
(omit entirely if no spec)
## Blocker / Major / Minor / Nit / Out of scope but worth noting
(skip the heading entirely when a section is empty)
```
Each finding is a list item: `` `file:line` — finding — why it matters — what to do.``
## Implementation handoff (fixing the findings)
When the user signals they want the review's findings fixed (phrases like "fix these", "address the review", "ship the fixes", "do it"):
1. **Re-read the review file end-to-end** — the user may have annotated, downgraded, or added findings.
2. **Pick the active scope** in this order: every `Blocker` first, then `Major`, then `Minor`, then `Nit`. Skip `Out of scope but worth noting` unless explicitly asked.
3. **Mirror each finding into visible task tracking** using `TaskCreate` — one task per finding, titled with the `file:line` and a short verb ("Fix tag-load fallback at `inbox/page.tsx:88`"). Severity goes in the description.
4. **Update tasks live as work proceeds.** Mark each `in_progress` before starting, `completed` when the change is made and the relevant computational checks (typecheck, test, lint) are green. Do not batch.
5. **Mirror status back into the review file.** Replace each addressed finding's bullet with a struck-through line plus a one-line note ("Fixed — debounce wired in `useDebouncedValue`"). Update the verdict line at the top if the remaining findings change the merge guidance.
6. **Block on red.** If a fix breaks tests or lint, leave the task `in_progress`, note the regression below the finding, and surface it in chat. Don't silently call it done.
7. **Done means both surfaces agree.** Implementation is complete only when every chosen-severity finding is resolved in code AND the review file reflects the resolved state.
## What the reviewer must NOT do
- Don't refactor unprompted.
- Don't suggest renaming things just because.
- Don't pad the review with praise sandwiches.
- Don't list every minor inconsistency — pick the ones that matter.
- Don't recommend tests unless the change actually warrants them (a token swap doesn't need tests; a new API route does).
- Don't suggest extracting shared packages or restructuring the project layout unless the user asked for it.
- **Defer to smaller models for routine reads.** When you need to read a known file,
grep for a specific symbol, or fetch a single doc page, do it directly with `Read` /
`Grep` / `WebFetch`. But anything that fans out — exploring an unfamiliar subsystem,
finding all call-sites of a symbol, summarising a long doc, comparing multiple files —
should be delegated to a subagent with `model: "haiku"` (or `"sonnet"` if the task
needs reasoning). Reserve the top-level Opus session for the synthesis work that
actually needs it: judging severity, weighing tradeoffs, writing the final review.
Don't pay Opus rates for grep.
## Output format
Plain markdown, no bolded headings, no excessive bullet nesting. Code references as backticks. File paths relative to repo root.
I run this on myself before I merge. The model wrote the code. I want a second pass with fresh eyes that aren't the same eyes that wrote it.
Do a CR on the uncommitted changes on this branch.
The skill reads CLAUDE.md to remember project conventions, runs git diff, reads the surrounding code (diffs miss context), and writes a file at .plans/code-review/CR-001 - campervan - location-search.md. One-line verdict at the top. Findings grouped by severity: blockers, majors, minors, nits.
This is the inferential sensor in the harness. LLM as judge. Slower and more expensive than a linter, but a linter can't tell you the abstraction is wrong, the naming is misleading, or the route shouldn't be a client component. Computational controls catch the things you can spell out as a rule. This catches the things you can only spot by reading the code.
The voice is what makes it useful. Principal engineer reviewing a colleague's PR. Direct. No "consider" weasel words. If something's fine, it says so and moves on. If something's wrong, it says why and what to do instead. Last week it caught a hardcoded hex where a token would've done, a route that should've been server-only but had 'use client' at the top, and a Promise<any> in a service file because I'd been lazy. None would've stopped the build. All would have annoyed me three weeks later when I'd forgotten the context.
testing-pyramid
testing-pyramid-skill.md on GitHub
---
name: testing-pyramid
description: Plan or audit a project's test coverage against the testing pyramid (unit / integration / e2e). Use when the user asks "what should we test", "is our coverage right", "are we over-testing", "are we missing tests", "what layer should this go in", or wants a review of an existing test suite for bloat or gaps. Outputs a coverage map, layer recommendations, and a concrete edit list — not a fresh test suite.
metadata:
argument-hint: <feature/area to test, plan ID, or "audit" to review the existing suite>
---
# Testing Pyramid
## Harness role
- **Control type:** guide (feedforward)
- **Regulation category:** behaviour
- **Lifecycle stage:** pre-implementation (Mode A — Plan) and post-implementation / on cadence (Mode B — Audit)
- **Computational checks:** the test plan itself becomes part of the deterministic gate — every behaviour in the surface-area table is a test that runs in the relevant CI tier
- **Inferential checks:** layer-fit judgement, bloat detection, gap detection, layer-pick decisions (Mode C)
- **Frame:** [*Harness Engineering*](https://martinfowler.com/articles/harness-engineering.html) by Birgitta Böckeler
This skill plans or audits a project's test coverage using the testing pyramid as the rubric. It does NOT write tests — it produces a coverage map, identifies gaps and bloat, and outputs a prioritised edit list. Implementation is a separate task.
The pyramid the skill uses:
```
/\
/e2e\ few — cross-page flows, real DB, real cookies
/------\
/ inte- \ some — route-handler logic, component renders
/ gration \
/------------\
/ unit \ many — pure functions, validators, reducers, hooks
/----------------\
```
Heuristic: **if a test would pass with a stubbed-out async result, it doesn't belong in the e2e layer.**
## When to use
Trigger this skill when the user asks any of:
- "What tests should I write for X?"
- "Is our test coverage right?" / "Are we missing tests?"
- "Is this over-tested?" / "Why is CI so slow?"
- "What layer should this assertion go in?"
- "Audit the test suite" / "Review tests in `<path>`"
- "Plan tests for {feature or plan ID}"
Do NOT trigger this skill when:
- The user wants you to actually write the test code (use a normal task; cite this skill's output if one exists).
- The user is debugging a single failing test (just fix it).
- The user wants a code review of a single PR — that's a different task.
## Modes
The skill has three modes. Pick one based on the trigger phrase or ask if ambiguous.
### Mode A — **Plan**: design tests for a new feature / FD
Input: a feature description, a plan-file path, or "the changes on this branch."
Output: a markdown file at `.plans/testing-pyramid/TP-XXX - <scope> - <short topic>.md` (numbered sequentially — find the highest existing `TP-NNN` and increment) containing:
1. **Layer rubric** (the standard table — copy verbatim).
2. **Surface area** — list every behaviour the feature introduces, one row per behaviour, columns: `Behaviour`, `Layer`, `Test file (proposed)`, `Notes`.
3. **Open questions** — only the judgement calls (e.g. "do we mock translations or use the real provider?"). Use the in-file checkbox format defined below.
4. **Files to create / modify** with a Pass column (P1/P2/P3) for ordering.
5. **Out of scope** so the plan stays bounded.
### Mode B — **Audit**: review an existing test suite
Input: a test directory (e.g. `<project>/tests/`) or "this project".
Output: a markdown file at `.plans/testing-pyramid/TP-XXX - <scope> - audit.md` (numbered sequentially — find the highest existing `TP-NNN` and increment) containing:
1. **Pyramid shape** — counts by layer (e.g. unit: 18, integration: 0, e2e: 12) and a sentence on whether the shape is healthy. An inverted or hourglass pyramid is a finding.
2. **Bloat** — tests that are at the wrong layer, duplicates across layers, or assertions that prove nothing the next-layer-up doesn't already prove. Cite specific files and line ranges.
3. **Gaps** — behaviours with no test coverage at any layer, or critical paths covered only by flaky e2e specs.
4. **Flakes** — tests that have been quarantined, marked `.skip`, or have a history of intermittent failures (grep `it.skip`, `test.skip`, `// flaky`, `xit`, `xdescribe`).
5. **Concrete edits** ordered as Pass 1 (must), Pass 2 (should), Pass 3 (nice). Each cites a file path and a one-sentence rationale.
### Mode C — **Layer pick**: which layer for this one test?
Input: a description of one assertion ("does pasting an invite code with too few chars keep the button disabled?").
Output: an inline answer (≤120 words). Layer + the cheapest test that proves it + the file path it would live in. No markdown file written.
## Layer rubric (canonical)
| Layer | What it proves | When to reach for it | Speed | Example file path |
|-------|----------------|----------------------|-------|-------------------|
| **Unit** (`tests/unit/`) | Pure functions; validators; reducers; component branching with mocked deps; small hook logic | Behaviour fits a single module, no real DOM tree, no network. | <100ms | `tests/unit/<module>.test.ts` |
| **Integration** (`tests/integration/`) | Route-handler logic with real validators + mocked clients for external systems; component renders of a single page with mocked hooks; assertion of branching, error states, focus, accessibility | Behaviour spans 2–3 modules, no real browser or live network. | 100–500ms | `tests/integration/route-handlers/<route>.test.ts`, `tests/integration/components/<Component>.test.tsx` |
| **E2E** (`tests/e2e/`) | Cross-page flows touching DOM + cookies + redirects + DB — value is in the *between-pages* movement, not leaf logic | Genuinely end-to-end behaviour. The test would lose its point if you stubbed any single layer. | seconds | `tests/e2e/<flow>.spec.ts` |
## Questions the skill asks while planning
Cap one round at ~3 in-file questions. Skip questions whose answers are derivable from the codebase (config files, existing patterns). Ask only judgement calls.
### In-file question format (mandatory)
Every open question MUST use this exact shape — discrete checkbox options, a recommendation inline on the recommended option, plus an "Other" slot and a notes block. Free-prose questions without options are a failure mode; do not write them.
```markdown
#### QX. {short question title}
{One or two sentences explaining the question and why it matters.}
- [ ] Option A — {short description of what this option means}
- [ ] Option B — {short description} **(recommended — {one-line reason})**
- [ ] Option C — {short description}
- [ ] Other — notes:
- _{your note here}_
**Notes / reasoning:**
- _{anything else worth recording about this pick}_
```
Rules:
- **Always offer concrete options.** Vague "how should we handle X?" is banned. If you can't think of two real options, the question isn't ready to ask.
- **Mark exactly one option `(recommended — …)`** with a one-line reason. Don't make the user derive your opinion.
- **Every question has an Other + Notes slot** so the user can override or annotate.
- **Questions are numbered** (Q1, Q1a, Q2, …) so chat can reference them.
- **Cap at ~3 per round.** Pick the questions that unblock the most other decisions.
### Common questions for testing plans
Use these as templates — adapt the options to the project, but keep the format above.
1. **i18n / translation strategy.** Mock the translation hook to identity, or wrap renders in the real translation provider with the real message catalogue? (Default: real provider when missing-key bugs would be silent.)
2. **Test runner config split.** One config covering unit + integration, or a sibling `*.integration.config.*`? (Default: split when CI parallelism matters or integration runs are slow.)
3. **Existing-test handling.** Leave existing tests, rename + rewrite in place, or delete and rewrite from scratch? (Default: rewrite in place unless they're entirely about removed behaviour.)
4. **E2E scope.** Smoke (one happy path per surface) or full (every UX branch)? (Default: smoke.)
5. **Outbound HTTP stubbing.** Network-interception library, recorded fixtures, or a contract tool? (Default: a network-level interceptor for consumer apps; contract tools only when you also own the provider.)
6. **DB strategy.** Module-mock the client, run a local DB container, or contract-stub the wire protocol? (Default: module-mock unless you have row-level-security / triggers worth exercising.)
## Bloat detection
Flag a test as bloat in the audit when:
- **Wrong layer.** A browser-driven e2e test that asserts a regex on rendered text from a single component — a component render would prove the same thing in 50ms. ("e2e spec asserting a single heading is visible" is the canonical example.)
- **Multi-layer duplicate.** The same assertion exists at unit + integration + e2e. Pick the cheapest layer that's still meaningful; delete the others.
- **Mock theatre.** A test that mocks every collaborator and asserts the mocks were called with the values you passed in. The behaviour under test is "the function calls its arguments" — delete it.
- **Setup-heavy / assertion-light.** Setup ≥ 5× the assertion lines. Either the test is testing setup, or the unit under test has too many seams. Flag for either deletion or a refactor question.
- **Implementation snapshot.** Snapshot tests on rendered HTML that drift on every legitimate copy change. The signal-to-noise is low; delete unless the snapshot covers a real invariant (token usage, ARIA structure).
- **Coverage-percentage tests.** Tests written purely to bump a coverage percentage with no behavioural claim. Delete.
## Sufficiency detection
Flag a behaviour as under-tested when ALL of these are true:
- It's user-visible OR it crosses a system boundary (HTTP, DB, auth).
- A bug here would be discovered by a user, not by the next test that runs.
- No layer currently asserts the behaviour. (Coverage tools count line execution, not assertions — read the actual tests.)
Critical-path checklist (use as a prompt, not a checklist gate):
- Authentication flow: signin/signup happy path, error path, OAuth callback.
- Authorisation: the cheapest "user A can't see user B's data" test.
- Money path: anything that creates, modifies, or charges a billing record.
- Data write path: the API route that creates the most-queried table row.
- Empty state: every page that has one. Empty states regress silently.
- Redirect chain: any flow with two or more redirects (auth callback is the usual culprit).
## Output format
### Mode A / Mode B output
A single markdown file at `.plans/testing-pyramid/TP-XXX - <app> - <topic|audit>.md` with these sections in order:
```markdown
# Test plan / audit — <subject>
## Pyramid shape (audit only)
{counts + one-sentence diagnosis}
## Layer rubric
{copy the canonical table verbatim}
## Surface area / Findings
{table per the mode's spec}
## Open questions
{checkbox blocks using the in-file question format above — REMOVE blocks once resolved; this section only contains live, unresolved questions}
## Decisions
{table populated as questions resolve — see Decisions table format below}
## Files to create / modify
| File | Pass | Type (create/modify/done) | Notes |
## Model routing
| Step | Model | Reason |
| ---- | ---------- | ------------------------------------------------------- |
| 1 | **haiku** | {one line — pure mechanical work} |
| 2 | **sonnet** | {one line — bounded judgement, well-specified} |
| 3 | **opus** | {one line — design / risky / cross-cutting / synthesis} |
If a sonnet/haiku step surfaces a non-trivial decision, escalate to the main session rather than guess.
## Out of scope
{bullets}
```
The file is the durable artefact. Chat updates between cycles are ≤120 words.
### Mode C output
Inline answer only. Format:
> **Layer:** <unit | integration | e2e>
> **Why:** <one sentence>
> **File:** `<path>` (existing or new)
> **Why not the next layer up:** <one sentence — what mocks would still apply>
## Model routing table (mandatory in every Mode A / Mode B plan)
The defer-to-smaller-models rule is theoretical until you write it down per step. **Every Mode A and Mode B plan
includes a Model routing table** (in the `## Model routing` section of the output) so the cheapest viable model is
picked deliberately rather than by reflex.
Template (shown in the output format above — fill in one row per pass of the "Files to create / modify" list):
```markdown
## Model routing
| Step | Model | Reason |
| ---- | ---------- | ------------------------------------------------------- |
| 1 | **haiku** | {one line — pure mechanical work} |
| 2 | **sonnet** | {one line — bounded judgement, well-specified} |
| 3 | **opus** | {one line — design / risky / cross-cutting / synthesis} |
If a sonnet/haiku step surfaces a non-trivial decision, escalate to the main session rather than guess.
```
**Rubric for picking the model:**
- **haiku** — pure mechanical: run a command, grep, count lines, file moves, single-line edits, install a dep, confirm
a number, paste known content. No judgement required; if the agent has to choose between options, it's not haiku-class.
- **sonnet** — bounded judgement: apply a pattern the plan specifies, refactor following a recipe, fix lint errors with
clear rules, write boilerplate from a spec, classify hits into a fixed set of buckets. The agent makes small calls
inside well-defined rails.
- **opus** (= top-level session) — design, architecture, risky migrations, cross-cutting changes, synthesising multiple
inputs, anything where getting it wrong wastes more than the model-cost saving. Don't delegate this.
**Rules:**
- **Each pass in "Files to create / modify" gets a row.** No row → no execution. If a pass has no row, the plan is incomplete.
- **Default towards cheaper.** If you're choosing between sonnet and opus and the work is well-specified by the plan,
pick sonnet. Reserve opus for the parts where the plan doesn't fully prescribe the answer.
- **The escalation line is mandatory.** Sonnet/haiku subagents must know they can return without guessing — they will
guess otherwise.
- **Re-evaluate after the design step.** Once the inventory/design step (usually opus) is done, the remaining steps are
often more mechanical than first thought; downgrade them if so.
## Iteration loop (Mode A and Mode B)
Once the plan/audit file exists, this skill runs an in-file Q&A loop until the user signs off. No test code is written while the loop is running — the file is the artefact, chat is for terse status only.
### Beat 1 — Read current state
Read the full plan file end to end. Identify:
- Ticked checkboxes and any inline notes the user added under questions
- Any new prose the user inserted into Surface area / Findings / Files-to-modify (a direct edit is a decision too)
- Sections that have drifted from current reality (e.g. file paths that have moved, layer choices the codebase no longer supports)
- The current `## Open questions` and `## Decisions` sections — what's answered, what isn't
### Beat 2 — Process user answers (fold FIRST, then move)
For each question the user has resolved:
1. **Fold the answer into every section of the plan it affects** — Surface area rows, Files to create / modify, layer notes, prose. The plan must read consistently as if the decision was always there. Orphaned answers sitting in `## Open questions` while the body still hedges is the most common failure mode of this skill.
2. **Append the decision to the `## Decisions` table** (format below).
3. **Remove the resolved question block from `## Open questions`** — do not leave ticked checkboxes hanging around. The section should only ever contain *unresolved* questions.
The order matters: fold first, then record, then prune. Skipping the fold-in step leaves the plan contradictory.
Watch for answers that contradict things written in earlier cycles. Reconcile explicitly — never silently keep the old wording.
### Beat 3 — Keep progress live
Update `## Pyramid shape (current)` after each implementation pass — actual file counts per layer plus a one-line note on what's done and what remains.
For the `## Files to create / modify` table, mark rows as you complete them — flip the `Type` column from `create` / `modify` to `done`, or add a `✓` prefix. Don't delete rows; the table is a checklist the user reads to see what's left.
### Beat 4 — Raise new questions
Answers almost always raise new questions. Surface them **in the plan file** using the in-file question format above (NOT only in chat). Cap one round at ~3 questions — pick the ones that unblock the most other decisions. If you have more than three, hold the rest for the next cycle.
### Beat 5 — Report back and wait
Send the user a terse message (≤120 words) covering:
- What was folded in based on their answers
- What new questions were raised (by ID) and where in the file to find them
- One unilateral decision made if any — flag it explicitly so the user can override
Then stop. The user will either answer the new questions (another cycle begins) or sign off.
## Decisions table format
`## Decisions` is a **table**, not a list — columns: `✅`, `Question`, `Decision`, `Why`. Keep each `Why` cell to one sentence so the table scans fast. Order rows by when each question was raised (not alphabetically) — the order itself carries information about how the design evolved.
Example:
```markdown
## Decisions
| ✅ | Question | Decision | Why |
|----|----------|----------|-----|
| ✅ | Q1 — Translation strategy | Wrap component renders in the real translation provider with the real message catalogue | Missing-key bugs would otherwise be silent |
| ✅ | Q2 — Test runner config split | Single config covering unit + integration | CI parallelism not yet a bottleneck |
```
## Implementation handoff (writing the tests)
When the user signals the plan/audit is ready to execute (phrases like "write the tests", "implement the plan", "ship it", "fix the bloat"):
1. **Re-read the plan/audit file end-to-end** — Surface area / Findings rows may have been edited; layer choices may have shifted.
2. **Pick the active scope** by walking the `## Files to create / modify` table in Pass order: P1 first, then P2, then P3. Stop at the pass the user named, if any.
3. **Mirror each row into visible task tracking** using `TaskCreate` — one task per file (or per behaviour for big files), titled with the test file path and the verb (`create tests/integration/inbox.test.ts`, `delete bloat in tests/e2e/heading-visible.spec.ts`).
4. **Update tasks live as work proceeds.** Mark each `in_progress` before starting, `completed` when the file is written AND the relevant runner is green at that layer (unit/integration/e2e config). Do not batch.
5. **Mirror status back into the plan.** Flip the row's `Type` column from `create` / `modify` / `delete` to `done` (don't delete the row). Update `## Pyramid shape (current)` after each pass — actual file counts per layer plus a one-line note on what's done and what remains.
6. **Block on red.** A flaky or failing test is not "done". Leave the task `in_progress`, note it in the row's Notes column, and surface in chat.
7. **Done means both surfaces agree.** Implementation is complete only when every chosen-pass row is `done` AND the task list is fully `completed`.
## Sign-off gate (Mode A and Mode B)
Don't declare the plan/audit done until ALL of the following are true:
1. Every question in the file is answered — no unticked checkboxes in `## Open questions`, no `{your note here}` placeholders the user was expected to fill.
2. The plan sections (Surface area / Findings, Files to create / modify, Pyramid shape, etc.) are self-consistent — you could hand the document to someone cold and they could execute it without re-asking the user.
3. **The `## Model routing` table is populated** — one row per pass of the "Files to create / modify" list, model picked
deliberately from the rubric, escalation line present. See "Model routing table (mandatory in every Mode A / Mode B
plan)". A plan without this table isn't signed off, no matter how good the rest looks.
4. The user has explicitly confirmed they want to proceed. Common sign-off phrases: "looks good, start writing", "go", "implement", "ship it", "approved". If the user's latest message doesn't clearly sign off, ask: "Is this ready to implement, or do you want another pass?" — one sentence, then stop.
Until all three are true, stay in the loop. Do not start writing test files. Do not even read the implementation source files unless they're needed to answer a planning question.
## Hard rules
- **No code edits.** Test files are written in a separate task; this skill produces the plan only.
- **Cite real file paths.** When recommending a layer or pointing at bloat, name the file. Don't say "some e2e test" — say `tests/e2e/basic-functionality.spec.ts:42`.
- **Don't prescribe coverage percentages.** They're the wrong metric. Talk in behaviours and risk.
- **Don't recommend snapshot tests.** Unless the snapshot is over a structural invariant (ARIA tree, token list) the user can defend.
- **One layer per behaviour.** If the surface-area or findings table lists the same behaviour at two layers, justify it in the Notes column or pick one.
- **Respect project-wide preferences in memory.** If memory says "no fallbacks / no deprecation comments," the audit must flag deprecation-comment cruft in tests as bloat.
- **Always fold answers into the main plan.** Orphaned answers in `## Open questions` while the surface-area / findings table is stale is the most common failure mode of this skill — guard against it.
- **Never hide questions from the user by asking them only in chat.** If it's a decision that shapes the plan, it goes in the file with options + a notes slot. Chat is for terse status updates between cycles.
- **Watch for project-wide feedback while iterating.** If the user shares something that clearly applies beyond this plan ("I prefer X over Y everywhere"), save it to auto-memory in the same cycle it's folded into the plan — don't wait for a separate invitation.
- **Keep update messages terse.** The plan file is the durable artefact. Chat messages between cycles should be ≤120 words and never repeat content that's already in the file.
- **Ask before guessing.** If an answer is ambiguous, raise it as a new question in the next cycle rather than picking unilaterally. If the decision is tiny and blocks progress, flag the unilateral pick explicitly in the chat summary so the user can override.
- **Defer to smaller models for routine reads.** When you need to read a known file,
grep for a specific symbol, or fetch a single doc page, do it directly with `Read` /
`Grep` / `WebFetch`. But anything that fans out — exploring an unfamiliar subsystem,
finding all call-sites of a symbol, summarising a long doc, comparing multiple files —
should be delegated to a subagent with `model: "haiku"` (or `"sonnet"` if the task
needs reasoning). Reserve the top-level Opus session for the synthesis work that
actually needs it: judging test layering, weighing tradeoffs, writing the final plan.
Don't pay Opus rates for grep.
- **Every Mode A / Mode B plan has a Model routing table.** A plan whose `## Model routing` section is unfilled isn't
signed off, even if every other section is complete. Use the template in "Model routing table (mandatory in every
Mode A / Mode B plan)" above. The sign-off gate enforces this.
## Reference: suggested test layout
For a new project, mirror this shape:
```
<project>/
tests/
unit/ # pure logic, validators, hooks (mocked deps)
*.test.ts
integration/ # route handlers + component renders
route-handlers/
*.test.ts
components/
*.test.tsx
e2e/ # browser-driven, cross-page only
*.spec.ts
global-setup.ts
helpers/
<unit-runner>.config.ts # unit
<integration-runner>.config.ts # integration (separate so CI can split)
<e2e-runner>.config.ts
test-setup.ts
test-setup-components.ts # shared render helpers
```
If a project deviates, that deviation is a finding in the audit unless there's a documented reason in the project's
conventions doc.
Two modes. "What should I test for this feature" before I write the tests, or "audit the existing suite" when CI starts feeling sluggish.
Plan tests for FD-001.
Output goes to .plans/testing-pyramid/TP-001 - campervan - location-search.md. It reads the FD, lists every behaviour the feature introduces in a table, and assigns each one to a layer: unit, integration, or e2e. The heuristic is brutally simple: if a test would pass with mockResolvedValue, it doesn't belong in Playwright.
Audit mode has saved my CI bill more than once. Looking at a suite and going "eighteen unit tests, zero integration tests, twelve e2e tests" is the kind of feedback I'd never give myself, because I'm the one who wrote the suite.
This skill is how I keep the cheap, fast, computational sensors doing the heavy lifting, and reserve the expensive end-to-end tests for behaviour you can only verify in a real browser. Quality left, again.
A note on model routing
A wrinkle that's crept into the latest versions of these skills. Each Implementation Plan phase now picks its model deliberately, with a small routing table baked into the spec. Haiku for mechanical work like grepping, file moves, pasting known content. Sonnet for bounded judgement where the plan already prescribes the answer. Opus for design, risky migrations, and synthesis that needs to hold the whole picture in its head. Subagents get spawned with the cheapest viable model, and the top-level Opus session is reserved for the bits that actually need it.
Quality left applies to the bill, not just the bug count. If the spec is tight enough that Sonnet can't get it wrong, paying Opus rates is just lighting money on fire.
Wrapping up
Four skills. One pattern. Markdown files in .plans/, in-file checkbox questions, fold-in loops, agents that stop and wait instead of running off and writing 400 lines I didn't ask for.
Two of them (product-requirements, feature-development) are guides. Two of them (code-review, testing-pyramid) are sensors, one inferential and one orchestrating the computational ones. Together they're a small, opinionated harness. The repo itself does the rest. Path aliases, a typed schema, a tight ESLint config, a shared component library. Böckeler calls that harnessability, the structural properties of a codebase that make it controllable. None of this works on a soup of any and circular imports. The harness lives on top of a codebase that's been kept tidy on purpose.
Is this overkill for a weekend project? Probably. Do it anyway. The discipline of writing the context down, even briefly, is what makes the agent useful instead of dangerous.
The challenge-carrot-payoff loop. It didn't go anywhere. It moved. The slog isn't in the code now, it's in the thinking. Forcing yourself to answer the questions you'd otherwise hand-wave past. Reconciling the contradictions in your own head before they become contradictions in the spec. The win at the end isn't getting the function to compile. It's reading back a document so tight that the agent ships exactly what you intended, the first time. Different shape. Same payoff.
The room, not the words in it.