# Spec: Pass-rate logging > Plan 5 of 7 — Hyperguild Skill Migration. Loaded after `feature-spec` skill. ## Problem Statement Plan 6 (Mode 2 routing pod) needs a per-skill signal to decide whether to route a call to the local model or keep it on Claude. The natural signal is recent pass rate: a skill that succeeds 95% of the time on local is safe to route; a skill that succeeds 60% is not. Today there is no such signal — the `session_log` MCP exists (shipped in Plan 1) but skills don't reliably call it, and no endpoint computes pass rate from the resulting logs. Two consequences: 1. **Plan 6 cannot be trusted without baseline data.** Routing decisions made on guesses will produce regressions that erode confidence in Mode 2 entirely. 2. **The skill library has no observability.** When a skill regresses (model swap, prompt drift, environment change), there's no way to notice until a downstream task explicitly fails. **Why now:** Plans 1–4 are merged. Plan 5 instruments the discipline that Plan 6 will consume. Several weeks of usage data between Plan 5 merge and Plan 6 deploy will mean Plan 6 lands on real numbers, not synthetic. ## Success Criteria - [ ] After Plan 5 merges, every invocation of `tdd` (pilot skill) calls `session_log` at the end of each phase (red, green, refactor) with `final_status` ∈ {pass, fail, skip}. - [ ] At least 6 of the remaining "binary-outcome" skills get the same treatment: `code-review`, `debug`, `feature-spec`, `session-retrospective`, `trainer`, `spec-driven-dev`. (Skills with no clear pass/fail — `clean-code`, `cognitive-load`, `solid`, `refactoring`, `test-design`, `problem-analysis`, `user-stories`, `planning`, `atdd`, `gitea-ci` — are out of scope.) - [ ] A new HTTP REST endpoint `GET /pass-rate?skill=X&window=7d` on the brain pod returns valid JSON `{skill, window, pass, fail, skip, total, pass_rate}` for any skill name. Skills with no logged invocations return zeros (not 404, not error). Pass rate is `pass / (pass + fail)`; if `pass + fail == 0`, returns `pass_rate: null`. - [ ] The endpoint's aggregator normalizes legacy values: `pass` ≡ `ok`, `fail` ≡ `error`, `skip` ≡ `skipped`. No data loss when scanning historical logs. - [ ] An optional CLI subcommand `hyperguild brain pass-rate [--window 7d] [--json]` calls the endpoint and prints either human-readable (`tdd: 47 / 50 = 94% (window: 7d)`) or JSON. - [ ] `task check` passes (lint + test + vet + drift + govulncheck) on each task and on the merged branch. - [ ] One week post-merge, `GET /pass-rate?skill=tdd&window=7d` returns non-zero counts and a real `pass_rate`. ## Constraints - **Stdlib + existing deps only.** The endpoint adds to the existing ingestion pod's HTTP handler (Go, `net/http`). No new service, no new pod, no new persistence layer. - **No auth on `/pass-rate`.** Same model as the rest of the brain HTTP REST API: Tailscale-only network, no token. - **Schema:** the SKILL.md template uses `pass | fail | skip` for `final_status`. The aggregator treats `pass` and `ok` as equivalent, `fail` and `error` as equivalent, `skip` and `skipped` as equivalent. New writes from skills MUST use the new vocabulary; the aggregator handles both for read-back. - **Storage:** continues to use the existing JSONL files at `/brain/sessions/*.jsonl`. No format change. No materialized aggregates. If on-demand scans become slow (>500ms p99), revisit in a follow-up; not now. - **Backwards compatibility:** the existing `session_log` MCP tool's signature does not change. Its docstring should be updated to reflect the new vocabulary, but argument types stay the same. - **Pilot-before-rollout:** the first SKILL.md instrumentation (`tdd`) must dogfood successfully — at least one real `tdd` invocation post-instrumentation produces a session log entry — before the other six skills get their updates. ## Out of Scope - Plan 6 routing pod itself (the consumer of `/pass-rate`). - Materialized rolling counters (compute on-demand for now). - Auth, rate limiting, or per-user filtering on `/pass-rate`. - Dashboards or visualization (`hyperguild brain pass-rate` text/JSON is the only UI). - Real-time streaming or push notifications (`/pass-rate` is poll-only). - Skills with no clear binary outcome (the 10 skills listed in Success Criteria). - Per-model or per-mode breakdown (`session_log` already records `model_used`; the endpoint aggregates across all models for now). Plan 6 may want sharper aggregation; we'll add fields when it lands. - Migration of the one historical entry in `2026-04-17-validate-hyperguild.jsonl` from `pass` (which is the new vocabulary, by accident) — no migration needed. ## Technical Approach ### Component A — SKILL.md instrumentation pattern Each instrumented skill gets a standardized "Logging" subsection under its existing "Brain MCP Integration" section. The subsection names the required `session_log` fields with explicit copy-paste examples: ``` **At each phase end:** call `session_log` with: - `skill`: "" - `phase`: "" - `final_status`: "pass" | "fail" | "skip" - `message`: "" - `duration_ms`: - `project_root`: "" ``` The pilot SKILL.md (`~/dev/.skills/tdd/SKILL.md`) gets instrumented first. The implementation defines the contract; the rollout commits replicate the pattern across the other six SKILL.md files. Rationale: SKILL.md as the source of truth means the contract is visible to every agent that loads the skill — no hidden middleware. Mode-agnostic: the agent calls `session_log` whether it's Claude (Mode 1), Claude+routing (Mode 2), or Crush (Mode 3). The pattern is uniform; only the skill name + phase set differ. ### Component B — `/pass-rate` HTTP endpoint New handler at the existing ingestion pod, peer to `/query`, `/write`, `/ingest`, etc. ``` GET /pass-rate?skill=&window= → 200 { "skill": "tdd", "window": "7d", "pass": 47, "fail": 3, "skip": 0, "total": 50, "pass_rate": 0.94 } ``` Algorithm: 1. Parse `skill` (required) and `window` (default `7d`, accept Go-style `1h`, `12h`, `7d`, `30d`). 2. Walk `brain/sessions/*.jsonl` in the pod's volume. For each line: parse JSON, filter by `skill == query.skill` and `timestamp >= now - window`. 3. Tally `pass` (counts both `pass` and `ok`), `fail` (`fail` and `error`), `skip` (`skip` and `skipped`). 4. Compute `pass_rate = pass / (pass + fail)`; if `pass + fail == 0`, return `pass_rate: null`. 5. Return JSON. Rationale for on-demand: the JSONL files are append-only and small (one entry per skill phase, kilobytes per session at most). For the first months of Plan 5 usage, scanning all sessions for a single query is fast enough. If it ever isn't, a materialized index is a follow-up — the endpoint shape doesn't change. ### Component C — Optional CLI subcommand `hyperguild brain pass-rate [--window 7d] [--json]`. Adds a third nested verb under `brain` (sibling to `query` and `write`). Calls `GET /pass-rate?skill=<>&window=<>` via the existing `brainClient` infrastructure. Default human output: `tdd: 47 / 50 = 94% (window: 7d)`. `--json` passes through the response envelope. Rationale: shell access to pass-rate without curl + jq. Optional in the strict sense — Plan 6's routing pod will call the endpoint directly, not via the CLI — but cheap to add (one new method on `brainClient`, one new dispatch case in `runBrain`). ### Schema and normalization `session_log` JSONL line shape (unchanged today, codified by this plan): ```json { "session_id": "", "timestamp": "2026-05-03T20:30:00Z", "skill": "tdd", "phase": "red", "project_root": "/abs/path", "final_status": "pass", "duration_ms": 12345, "message": "Test written, function undefined, red confirmed." } ``` `final_status` values: - New writes (this plan onward): `pass | fail | skip` - Read aggregator accepts both new and legacy: `pass`/`ok` → pass, `fail`/`error` → fail, `skip`/`skipped` → skip - Anything else → counted as `skip` for safety (don't pollute pass/fail with malformed entries) ### Tests - Endpoint: table-driven tests with a temp `brain/sessions/` directory containing JSONL files spanning multiple skills, multiple statuses (both vocabularies), edge cases (empty file, malformed line, timestamp outside window, future timestamp). Tests run via `httptest.NewServer` against the real handler. - CLI: tests for `runBrainPassRate` against `httptest.Server` fake of `/pass-rate`. Human and `--json` output paths. - Pilot dogfood: after instrumenting `tdd/SKILL.md`, one real TDD task in this plan exercises the logging path. The corresponding session log entry verifies end-to-end. - `task check` per task. ## Risks - **Skills that don't reliably log produce missing data.** The aggregator returns zero counts for those, which Plan 6 may misread as "this skill always passes" or "this skill is broken". Mitigation: the endpoint returns `pass_rate: null` when `pass + fail == 0`, signalling "no data" distinct from "always passes". Plan 6 must check for null. - **Agents may forget to call `session_log` mid-skill.** No way to enforce in cloud Mode 1 — Claude may skip the call if instructions are unclear. Mitigation: SKILL.md template makes the call literal and copy-pasteable. After 1 week, if instrumentation rate is < 80% of expected calls, escalate; consider a wrapper at the routing-pod layer in Plan 6 as belt-and-suspenders. - **Schema drift between legacy `ok` and new `pass`.** Mitigation: the aggregator's normalization rule. Documented in the endpoint's response and in the `session_log` tool docstring update. - **`/pass-rate` walks all session files for each request.** With ~1 file per session and tens of sessions per week, this is microseconds today. At hundreds of files per day, may need a date-bounded directory layout. Mitigation: monitor; if scan time > 100ms p99, revisit. Not in this plan. - **The pilot may fail on the first dogfood.** If `tdd` instrumentation doesn't produce a log entry (e.g. agent didn't call `session_log`, JSON shape wrong, file permissions), the rollout to the other six skills is blocked until the pilot succeeds. Mitigation: explicit "pilot validates end-to-end" gate as the last step of Component A. - **Adding a third verb under `brain` slightly stretches the inline-router pattern.** Three verbs in a switch is still simple; if it grows to five, the CLI may want a per-verb registration map. Mitigation: deferred — three is fine.