From 51e01233a426cc8204bd9553df99969fb33e251e Mon Sep 17 00:00:00 2001 From: Mathias Bergqvist Date: Mon, 4 May 2026 14:29:26 +0200 Subject: [PATCH] docs(plan6): spec for Mode 2 routing pod Drafted via superpowers:feature-spec. Plan 6 of 7 in the skill migration. Surface frozen at 4 cost-routable skills (code_review, debug, retrospective, trainer); LiteLLM proxies model choice; pass- rate drives the route decision with default-to-local plus an env kill switch for the empty-data window. Plan 7 amendment baked in: internal/skills/{review,debug,retrospective,trainer} survive Plan 6. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-04-mode-2-routing-pod-design.md | 240 ++++++++++++++++++ 1 file changed, 240 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-04-mode-2-routing-pod-design.md diff --git a/docs/superpowers/specs/2026-05-04-mode-2-routing-pod-design.md b/docs/superpowers/specs/2026-05-04-mode-2-routing-pod-design.md new file mode 100644 index 0000000..7fc0b02 --- /dev/null +++ b/docs/superpowers/specs/2026-05-04-mode-2-routing-pod-design.md @@ -0,0 +1,240 @@ +# Spec: Mode 2 routing pod + +> Plan 6 of 7 — Hyperguild Skill Migration. Loaded after `feature-spec` skill. + +## Problem Statement + +Mode 2 (`client-local`) is the cost-and-sovereignty mode for paid client work — keep skill calls inside Tailscale, save tokens, but stay reliable. Plans 1–5 produced everything Mode 2 needs except the consumer: the brain MCP at `:30330` is live, four skills are instrumented to log `pass | fail | skip`, and `GET /pass-rate?skill=X&window=Y` returns honest numbers (or `null` when there is no data). What is still missing is the policy layer that reads pass-rate and acts on it. + +The supervisor pod (`:30320`) historically hosted full skill workers (`tdd_red/green/refactor`, `code_review`, `debug`, `spec`, `retrospective`, `trainer`, `tier`) but with no routing — every call ran local regardless of skill quality, and Claude Code in client-local mode silently lost access to Claude-quality work even when local was wrong. That's the regression Plan 6 fixes. + +**Why now:** the supervisor pod is scheduled for retirement (Plan 7) and the data plumbing for routing decisions exists but has no consumer. Without Plan 6, Plan 7 cannot land. + +## Success Criteria + +- [ ] A new pod `routing` is deployed via Flux at NodePort `:30310`, alongside (not replacing) the supervisor and ingestion pods. Image built by gitea CI, deployment manifest under `infra/k3s/apps/routing/`. `kubectl -n routing get deployment` shows `1/1 Ready`. +- [ ] `POST http://koala:30310/mcp` responds to `tools/list` with exactly four tools: `code_review`, `debug`, `retrospective`, `trainer`. Each tool's name + JSON schema is byte-identical to the supervisor's current advertisement (verified by snapshot test). +- [ ] Bearer-token auth via env var `ROUTING_MCP_TOKEN` (same opt-in pattern as `SUPERVISOR_MCP_TOKEN` shipped in `f49850d`). Empty token = no auth; populated token = `Authorization: Bearer ` required, otherwise HTTP 401 + JSON-RPC `-32001`. +- [ ] On every tool call, the pod queries `${BRAIN_URL}/pass-rate?skill=&window=7d` and applies a configurable policy: + - `pass_rate == null` → route to local (default-to-local) + - `pass_rate ≥ HYPERGUILD_ROUTE_LOCAL_FLOOR` (default `0.90`) → route to local + - `HYPERGUILD_ROUTE_LOCAL_CEIL ≤ pass_rate < FLOOR` (CEIL default `0.70`) → 50/50 deterministic sample (hash of canonical request body) + - `pass_rate < CEIL` → route to Claude +- [ ] Both routes resolve to a LiteLLM call: local route uses `HYPERGUILD_LOCAL_MODEL` (default `qwen35`), Claude route uses `HYPERGUILD_CLAUDE_MODEL` (default `claude-sonnet-4-6`). LiteLLM at `${LITELLM_BASE_URL}` (default `http://piguard:4000`) handles provider routing. The routing pod has no direct Anthropic SDK. +- [ ] Every routing decision is logged via `session_log` to the brain pod with `{skill: "_routing", phase: "decide", final_status: "skip", message: ": ", duration_ms, project_root}`. `final_status: "skip"` keeps these entries out of any skill's pass-rate aggregation. +- [ ] LiteLLM unreachable → fail open to a Claude decision *and* log `final_status: "fail"` for `_routing`. The pod must still serve requests even if LiteLLM is down for hours. +- [ ] `cmd/hyperguild/mode.go` updated: `mode client-local` writes the routing entry with `"headers": {"X-Hyperguild-Mode": "client-local"}` and the `_routing_pending` placeholder field is removed. The pod accepts but does not branch on the header (forward-compat only). +- [ ] `task check` (lint + test + vet + drift + govulncheck) passes on each task and on the merged branch. The CI gate that bit Plan 1 must not bite Plan 6 (per `feedback_per_task_verification` memory). +- [ ] A new `task smoke:routing` target boots the binary against the live LiteLLM at `piguard:4000` and the live `/pass-rate` at `koala:30330`, calls each of the four advertised tools once, and verifies a `_routing` entry appears in the brain via `GET /pass-rate?skill=_routing&window=1h`. This is the live-contract test (per `2026-05-03-fake-tests-vs-real-contract` brain entry); fake-server unit tests verify policy logic, the smoke step verifies the contract. +- [ ] Mode 1 (`cloud`) and Mode 3 (`sovereign`) are byte-identically unchanged. Verified by `git diff` showing no changes to `mode.go`'s `modeCloud` or `modeSovereign` functions. + +## Constraints + +- **Stdlib + existing deps only.** The routing pod reuses `internal/exec/litellm.go` (`NewLiteLLM`, `Complete`), `internal/registry`, and `internal/skills/{review,debug,retrospective,trainer}/`. No new third-party dependency. Auth code may be duplicated from `internal/mcp/server.go` or extracted to a shared helper — implementer's call. +- **No new persistence.** Pass-rate data lives in the brain pod's session JSONL files (Plan 5). Routing-decision logs land in the same place via `session_log`. Routing pod has no DB, no cache, no on-disk state beyond an optional in-memory pass-rate cache (TTL = 60 seconds — protects the brain from per-call hammering during an active session). +- **MCP wire format identical to supervisor's.** Tools have the same names and JSON schemas as today. A consumer switches modes by changing only the URL in `.mcp.json` — no schema-level differences. Snapshot tests pin this. +- **Pod must start and serve degraded.** If LiteLLM is down at startup, the pod still binds to `:3210`, advertises tools, and serves requests with the fail-open-to-Claude behavior described in success criteria. +- **`internal/skills/{review,debug,retrospective,trainer}/` survives Plan 6.** Plan 7's note about deleting them is amended: those four packages are reused by the routing pod and must NOT be deleted in Plan 7. Plan 7 deletes only `internal/skills/{tdd,spec}/`, the supervisor binary, the supervisor manifests, and frees NodePort `:30320`. This spec calls out the change so Plan 7's author doesn't delete needed code (per `2026-05-03-implicit-cleanup-third-category` brain entry). +- **No retries beyond fail-open.** A LiteLLM call that errors becomes a Claude decision and a `final_status: "fail"` log. No exponential backoff, no circuit breaker — that's policy for a future plan once the failure shape is observed. +- **Determinism in sampling.** When pass-rate is in the sample band (`CEIL ≤ pr < FLOOR`), the local-vs-Claude choice for a given request is reproducible: hash a canonical JSON of the request body, low bit picks local. Same input → same decision. Avoids per-call variance confusing the operator during a debugging session. + +## Out of Scope + +- **Plan 7 (supervisor retirement).** Separate plan, executed after Plan 6 stabilizes. Plan 6 leaves the supervisor pod running; nothing about supervisor changes in this plan. +- **Routing for `tdd_red/green/refactor`, `spec`, `tier`.** Per `project_per_skill_routing.md`, these are SKILL.md or CLI, not routing-pod tools. They never appear in the routing pod's `tools/list`. If a future plan changes that decision, it adds them then. +- **Routing for `brain_ingest`.** Already routed at the brain pod (Plan 1). No change. +- **Per-mode policy branching.** The pod accepts `X-Hyperguild-Mode` for forward-compat but treats absent or unknown values as `client-local`. No code path differs on the header value yet. +- **OAuth, IP allowlisting, rate limiting, audit logging.** Bearer-token only; same risk model as the supervisor MCP after `f49850d`. +- **Decision-log read endpoints.** Routing decisions land in the brain via `session_log`. Reads happen via the existing `GET /pass-rate` endpoint and JSONL inspection. No new read API. +- **Materialized routing-decision aggregates.** Out of scope for the same reason Plan 5 deferred materialized counters: on-demand scans are fast enough at current data volumes. +- **Tunable per-skill thresholds.** `FLOOR` and `CEIL` are global. If the operator decides `debug` needs a different floor than `code_review`, that's a follow-up plan with real data behind the choice. +- **Sampling beyond a 50/50 hash split.** No epsilon-decay schedules, no Thompson sampling, no per-skill exploration policies. Add when data justifies. +- **Migration of any existing supervisor-skill `.mcp.json` registrations.** Consumers update their `.mcp.json` (via `hyperguild mode client-local`) when they want Mode 2 behavior. No silent redirect. +- **Routing-pod-side prompt customization.** The four skill packages already own their prompts; the routing pod just calls into them via the existing `Skill` interface. Prompt edits remain a SKILL.md or `internal/skills//` concern. + +## Technical Approach + +### A. Binary layout: `cmd/routing/` + +A new Go binary at `cmd/routing/main.go`. Stdlib + `internal/*`. Wires: +1. Config from env (typed struct in `internal/config/routing.go` — peer to `Config` for the supervisor; deliberately a separate type because the surfaces are different and merging would force every routing-pod field onto the supervisor and vice versa). +2. `internal/exec/litellm.NewLiteLLM(...)` — same client the supervisor uses. +3. `internal/skills/{review,debug,retrospective,trainer}.New(...)` constructors, each receiving a `CompleteFunc` that wraps the routing decision (see C below). +4. `internal/registry.New()` populated with the four skills. +5. `internal/mcp.NewServer(reg, cfg.MCPAuthToken)` — reuse the existing handler with bearer auth from `f49850d`. The handler is generic; nothing in it is supervisor-specific. + +**Rationale:** the supervisor's runtime is already 80% of what the routing pod needs. Reusing it saves the routing pod from re-implementing skill dispatch, MCP protocol handling, and bearer auth. The only new code is the routing decision itself (C below) and the deployment manifests (G). + +### B. Configuration via env + +Typed struct, parsed at startup. New env vars introduced by Plan 6: + +| Env var | Default | Purpose | +|---|---|---| +| `ROUTING_PORT` | `3210` | Pod's HTTP port (NodePort `:30310` maps to this) | +| `ROUTING_MCP_TOKEN` | — | Bearer token, opt-in (empty = no auth) | +| `LITELLM_BASE_URL` | `http://piguard:4000` | LiteLLM proxy (reused) | +| `LITELLM_API_KEY` | — | Reused, sourced from `routing-secrets` Secret | +| `BRAIN_URL` | `http://ingestion.supervisor:3300` | In-cluster brain pod for `/pass-rate` and `session_log` | +| `HYPERGUILD_LOCAL_MODEL` | `qwen35` | Model name passed to LiteLLM for the local decision | +| `HYPERGUILD_CLAUDE_MODEL` | `claude-sonnet-4-6` | Model name for the Claude decision | +| `HYPERGUILD_ROUTE_LOCAL_FLOOR` | `0.90` | At/above this pass-rate, always local | +| `HYPERGUILD_ROUTE_LOCAL_CEIL` | `0.70` | Below this, always Claude. Between CEIL and FLOOR is the sample band. | +| `HYPERGUILD_PASS_RATE_TTL_SECONDS` | `60` | Per-skill in-memory cache TTL | + +**Rationale:** every value an operator might want to tune is an env var, not a hardcoded constant. Defaults are the recommendations from the kickoff and the per-skill-routing memory; sensible cluster values flow in via the Flux-managed Secret. No config file to manage. + +### C. Decision policy (`internal/routing/policy.go`) + +Pure function, no I/O: + +```go +type Decision int +const ( + DecideLocal Decision = iota + DecideClaude +) + +type Policy struct{ Floor, Ceil float64 } + +// Decide returns the routing decision. passRate may be nil when the brain has no data. +// requestHash is a deterministic 64-bit hash of the canonical request body — used only +// when passRate is in the sample band; same hash → same decision. +func (p Policy) Decide(passRate *float64, requestHash uint64) Decision { ... } +``` + +Rules (in order): +1. `passRate == nil` → `DecideLocal` (default-to-local) +2. `*passRate >= p.Floor` → `DecideLocal` +3. `*passRate < p.Ceil` → `DecideClaude` +4. Otherwise (sample band) → `requestHash & 1` picks local on `0`, claude on `1` + +**Rationale:** no I/O in the policy means the function is trivially testable (table-driven, no fixtures, no servers). Network calls happen in a wrapping layer that calls `Decide` — same separation as `internal/skills/*/skill.go` keeps prompt strings separate from `Complete` calls. Default-to-local rule is justified in `project_per_skill_routing.md`: the four advertised skills are exactly the skills marked "MCP→local" in that target architecture. + +### D. Pass-rate fetcher (`internal/routing/passrate.go`) + +```go +type Fetcher struct { + BaseURL string + HTTPClient *http.Client // 1s timeout + Cache *ttlCache // map[string]*float64 with 60s TTL, struct-internal +} + +func (f *Fetcher) Get(ctx context.Context, skill string) (*float64, error) +``` + +Calls `GET ${BaseURL}/pass-rate?skill=&window=7d`. On success, caches the parsed `pass_rate` (which may be `null`) for `HYPERGUILD_PASS_RATE_TTL_SECONDS`. On error, returns `(nil, err)`; the dispatch wrapper treats this as `*passRate == nil` and routes to local (the default-to-local fallback also covers brain-pod-down). + +**Rationale:** GET is correct REST per `2026-05-03-rest-semantics-vs-precedent` (this is a pure read with query params; it shouldn't follow the legacy POST-everywhere precedent). Cache TTL of 60s prevents per-call hammering during a tight Claude Code loop while staying fresh enough that a flapping pass-rate visibly affects routing within a minute. No persistence — restart loses cache, that's fine. + +### E. Dispatch wrapper + +The four skills are constructed with their existing `CompleteFunc` signature (`(ctx, model, system, user) (string, int64, error)`). The routing pod wraps it: + +```go +func (r *Router) Complete(ctx context.Context, skill, model, system, user string) (string, int64, error) { + pr, _ := r.fetcher.Get(ctx, skill) + decision := r.policy.Decide(pr, hashCanonical(system, user)) + chosenModel := r.cfg.ClaudeModel + if decision == DecideLocal { + chosenModel = r.cfg.LocalModel + } + out, ms, err := r.litellm.Complete(ctx, chosenModel, system, user) + r.logDecision(skill, decision, err, ms) + if err != nil { + // fail open: try Claude once if we routed local; if Claude also fails, return error. + if decision == DecideLocal { + chosenModel = r.cfg.ClaudeModel + out, ms, err = r.litellm.Complete(ctx, chosenModel, system, user) + r.logDecision(skill, DecideClaude, err, ms) // second log entry, marked fail if still erroring + } + return out, ms, err + } + return out, ms, nil +} +``` + +The skill packages don't know about routing — they receive a `CompleteFunc` and call it. The wrapper substitutes routing logic at construction time. + +**Rationale:** keeps the skill packages oblivious to mode. Same `internal/skills/review/` works under the supervisor (no routing) and under the routing pod (routed) without any conditional logic in the skill itself. Plan 7's deletion of the supervisor leaves the skills' shape intact for the routing pod. + +### F. Decision logging (`internal/routing/log.go`) + +After every decision, POST a session log entry to `${BRAIN_URL}/write` (the brain pod's existing endpoint, which appends to `brain/sessions/.jsonl`). Entry shape: + +```json +{ + "skill": "_routing", + "phase": "decide", + "final_status": "skip", + "message": ": (pass_rate=, model=)", + "duration_ms": , + "project_root": "", + "timestamp": "", + "session_id": "" +} +``` + +`final_status: "skip"` keeps these entries out of any real skill's pass-rate aggregation (Plan 5's aggregator counts only `pass`/`fail`). Operators can still query `GET /pass-rate?skill=_routing&window=7d` for routing-failure visibility (when LiteLLM down → `final_status: "fail"` in the second log entry). + +**Rationale:** closes the observability loop without adding a new endpoint or schema. `_routing` namespaces routing entries away from skill names. `skip` is the only honest classification — routing isn't itself a pass/fail event in the skill sense. + +### G. Deployment + +New manifest directory `infra/k3s/apps/routing/` mirroring `infra/k3s/apps/supervisor/`'s shape: + +- `namespace.yaml` — namespace `routing` (peer to `supervisor`) +- `deployment.yaml` — single replica, nodeSelector koala, image from gitea registry, `envFrom: secretRef: routing-secrets` +- `service.yaml` — ClusterIP on port 3210 +- `nodeport.yaml` — NodePort 30310 → service 3210 +- `secrets.enc.yaml` — SOPS-encrypted, contains `LITELLM_API_KEY` and (optionally) `ROUTING_MCP_TOKEN` +- `kustomization.yaml` — bundles the above + +The supervisor pod's CI image build pattern (gitea Actions → `gitea.d-ma.be/mathias/supervisor:`) is replicated for `gitea.d-ma.be/mathias/routing:`. Flux's existing image-automation will bump the manifest's image tag on each push. + +**Rationale:** copying the supervisor pod's manifest shape (rather than designing from scratch) is the YAGNI move. Flux + image automation already proven on supervisor; same pattern, same operator mental model. Mode 2 setup is now a Flux change, not a one-off `kubectl` ritual. + +### H. Live smoke test + +`task smoke:routing` (in the project Taskfile) does: +1. Boot the binary locally with `LITELLM_BASE_URL=http://piguard:4000` and `BRAIN_URL=http://koala:30330`. Bind to a random localhost port (so it doesn't conflict with anything else). +2. Send `tools/list` and assert four tool names. +3. For each tool, send a minimal valid `tools/call`. Don't assert on response content — assert response shape (no error, has content). +4. After all four calls, query `GET http://koala:30330/pass-rate?skill=_routing&window=1h` and assert `total >= 4`. +5. Tear down. + +Skipped automatically when LiteLLM is unreachable or when run outside Tailscale (tier 3) — emits a `SKIP` line and exits 0. `task check` does NOT include `task smoke:routing` (CI runner doesn't have Tailscale); operator runs it manually before bumping production. + +**Rationale:** unit tests with `httptest.Server` fakes verify the policy and the dispatch wrapper logic. The smoke test is the only thing that will catch a contract drift between the routing pod's `Complete` calls and the actual LiteLLM API, or a schema drift between `/pass-rate` and what the fetcher expects (per `2026-05-03-fake-tests-vs-real-contract`). + +### I. Mode-template update (`cmd/hyperguild/mode.go`) + +`modeClientLocal` is amended: +- The `routing` entry's `url` stays at `http://koala:30310/mcp`. +- A new key `headers` is added with `{"X-Hyperguild-Mode": "client-local"}`. +- The placeholder `_routing_pending` field is **removed**, since the routing pod now exists. + +Tests in `cmd/hyperguild/mode_test.go` are updated to assert the new structure. README in `cmd/hyperguild/README.md` updated to drop the "not deployed yet" note. + +**Rationale:** Plan 4 deliberately scaffolded the placeholder for Plan 6 to fill in. This is the fill-in. Removing `_routing_pending` is the implicit cleanup the kickoff anticipates — making it explicit in the spec avoids a Plan-completeness gap (per `2026-05-03-implicit-cleanup-third-category`). + +## Risks + +- **Empty pass-rate window in the first weeks.** Plans 3–5 merged on 2026-05-03; usage data has not accumulated. With default-to-local active for all four routed skills, the first weeks of Mode 2 = "everything goes local." If local quality is rough on `code_review` or `debug`, the operator's first impression of Mode 2 is bad, and confidence in Plan 6 erodes before data lands. **Mitigation:** the FLOOR / CEIL are env-tunable. If local quality is unworkable in the first week, set `HYPERGUILD_ROUTE_LOCAL_FLOOR=2.0` (impossible threshold) and the pod becomes default-to-Claude with no code change. This is a deliberate kill switch for the early window. + +- **LiteLLM-as-single-dependency.** The routing pod has exactly one upstream LLM provider: `piguard:4000`. If LiteLLM is misconfigured (wrong model name routed to wrong provider, expired Anthropic key in LiteLLM's config), every routing-pod call returns garbage. **Mitigation:** the smoke test catches gross misconfig before deploy; once deployed, LiteLLM's own `/health` endpoint is the canary (the pod doesn't probe it — operator monitors LiteLLM separately). If a deeper failure mode emerges, add a routing-pod liveness probe in a follow-up. + +- **Skill-schema drift.** The routing pod's `tools/list` is asserted byte-identical to the supervisor's via snapshot test. If someone evolves the supervisor's schemas between Plan 6 merge and Plan 7 (a long window), the snapshot drifts. **Mitigation:** the spec documents that Plan 6 freezes the schemas; supervisor edits to skill schemas are out of scope until Plan 7 deletes the supervisor. This is a soft constraint enforced by the spec, not by code. If the supervisor genuinely needs a schema change before Plan 7, that's a separate plan. + +- **Flux drift on `kubectl rollout restart`.** Demonstrated during the bearer-auth rollout earlier today: Flux server-side-applies the deployment every 30s and strips the `kubectl.kubernetes.io/restartedAt` annotation, which deletes the new ReplicaSet's pod. **Mitigation:** the Plan 6 implementer prompt and the README note that `kubectl delete pod -l app=routing` is the correct way to force a restart on Flux-managed deployments — the existing ReplicaSet recreates without an annotation Flux can revert. (This finding is worth a brain entry; capture in retrospective.) + +- **Mode header not forwarded by Claude Code.** Plan 6 assumes Claude Code propagates `headers` from `.mcp.json`. The bearer-auth rollout proved this works for `Authorization`. The same path should work for `X-Hyperguild-Mode`. **Mitigation:** the pod treats absent header as `client-local` (the only mode that registers the pod). If forwarding silently breaks, behavior is identical — header is forward-compat only. + +- **Sample-band hash collision producing skewed routing.** Hash inputs are `(system, user)` strings. If skill prompts produce highly similar bodies (debug bug A vs debug bug B with similar wording), low-bit hash distribution might cluster on one side. **Mitigation:** at the volumes Plan 6 expects (single operator, ~10s of routed calls/hour at peak), bias is statistically invisible. If volume ever rises, swap `hash & 1` for a stronger split. Not the first failure mode worth pre-engineering. + +## Cross-references + +- Spec for Plan 5 (consumer of `/pass-rate`): `docs/superpowers/specs/2026-05-03-pass-rate-logging-design.md` +- Spec for Plan 4 (which scaffolded the `:30310` placeholder): `docs/superpowers/specs/2026-05-03-hyperguild-cli-design.md` +- Auto-memory entries `project_three_modes`, `project_skill_migration_plans`, `project_per_skill_routing`, `feedback_per_task_verification`, `feedback_sudo` +- Brain entries `2026-05-03-rest-semantics-vs-precedent`, `2026-05-03-aggregator-normalization-backwards-compat`, `2026-05-03-fake-tests-vs-real-contract`, `2026-05-03-implicit-cleanup-third-category`, `2026-05-03-code-reviewer-output-as-candidates`, `2026-05-03-done-with-concerns-vs-blocked`, `2026-05-03-verification-depth-formula`, `2026-05-03-plan-canonical-dispatch-ephemeral`