Files
hyperguild/docs/superpowers/plans/2026-05-04-mode-2-routing-pod.md
Mathias Bergqvist b6bcc93048
All checks were successful
CI / Lint / Test / Vet (push) Successful in 10s
CI / Mirror to GitHub (push) Successful in 3s
docs(plan6): implementation plan for Mode 2 routing pod
14 TDD-shaped tasks across two worktrees: hyperguild for code
(internal/routing package, cmd/routing binary, Dockerfile, CD
workflow, mode template, smoke test, docs) and infra for the
k3s manifests (deployment, service, nodeport, SOPS-encrypted
secret). Plan 7 amendment baked in: internal/skills/{review,
debug,retrospective,trainer} survive Plan 6 — Plan 7 only
deletes tdd, spec, and the supervisor binary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:53:03 +02:00

2450 lines
74 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Mode 2 Routing Pod Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Ship a thin policy pod at `koala:30310` that routes the four cost-routable skill calls (`code_review`, `debug`, `retrospective`, `trainer`) to a LiteLLM-proxied local or Claude model based on per-skill pass rate. Replaces the unconditional supervisor-runs-locally behavior in client-local mode.
**Architecture:** New Go binary at `cmd/routing/`, reusing `internal/skills/{review,debug,retrospective,trainer}/`, `internal/exec/litellm.go`, `internal/registry`, and `internal/mcp` (bearer-auth handler from `f49850d`). A new `internal/routing` package adds (a) a pure-function decision policy, (b) a TTL-cached pass-rate fetcher, (c) a session-log decision logger, and (d) a router that wraps a `CompleteFunc` so the existing skill packages stay routing-oblivious. Deployed via Flux at NodePort `:30310` alongside the supervisor and ingestion pods.
**Tech Stack:** Go 1.26 stdlib (`net/http`, `crypto/sha256`, `encoding/json`, `time`, `sync`); existing `testify` for tests; SOPS-encrypted Secret in the infra repo; gitea CI buildctl→skopeo; Flux Kustomize reconciliation.
---
## Plan 6 of 7 — Hyperguild Skill Migration
Plans 15 merged. Plan 6 is the substantive routing-pod plan; Plan 7 (supervisor retirement) follows.
**Spec:** `docs/superpowers/specs/2026-05-04-mode-2-routing-pod-design.md` (committed `51e0123`).
### Two worktrees
- **Hyperguild worktree:** `~/Documents/local-dev/AI/hyperguild/.worktrees/mode-2-routing-pod/` on branch `feat/mode-2-routing-pod`. Contains the Go code, Dockerfile addition, CD workflow update, mode-template update, README, and smoke test.
- **Infra worktree:** `~/Documents/local-dev/AI/infra/.worktrees/mode-2-routing-pod/` on branch `feat/routing-pod-manifests`. Contains the k3s manifests for the new pod plus the SOPS-encrypted Secret.
Each task's "Files" header names the worktree. Implementer subagents must `cd` into the named worktree before any read/edit/git operation. Plan paths describe the post-merge canonical state (per `2026-05-03-plan-canonical-dispatch-ephemeral` brain entry); dispatch prompts add the worktree translation.
### Verification convention
Per task, the implementer runs `task check` (lint + test + vet + drift + govulncheck), not just `go test ./...`. CI's lint gate caught a Plan-1 errcheck regression that local tests missed (per `feedback_per_task_verification` memory). Append `//nolint:errcheck` to any `fmt.Fprint*` to stdout/stderr that ignores its return value. Ignored errors on `defer resp.Body.Close()` use `defer func() { _ = resp.Body.Close() }()`.
### Status taxonomy for implementer subagents
- `DONE` — task completed, all checks green, verification commands ran clean.
- `DONE_WITH_CONCERNS` — task completed, but the implementer noticed a plan bug, an environmental anomaly, or related code that looks suspicious. Controller decides: doc-patch, follow-up commit, or accept and roll on (per `2026-05-03-done-with-concerns-vs-blocked` brain entry).
- `BLOCKED` — implementer cannot complete the assigned work. Controller re-dispatches with more context.
- `NEEDS_CONTEXT` — implementer needs information not in the dispatch (rare; usually a doc bug).
### Code-reviewer expectations
The reviewer agent surfaces candidate improvements; the controller filters. Per `2026-05-03-code-reviewer-output-as-candidates`, reject reviewer suggestions that add helpers for single-use sites, abstractions for hypothetical futures, or stylistic refactors that diverge from the plan's heredocs. Apply genuine bugs and security findings; defer the rest.
### Flux operational note
The auth rollout (commit `afe9a08` in infra) demonstrated that Flux server-side-applies the `routing` Deployment every ~30s and strips any `kubectl rollout restart` annotation, deleting the new ReplicaSet's pod. To force a pod restart on a Flux-managed deployment, use `kubectl -n <ns> delete pod -l app=<name>` — the existing ReplicaSet recreates without an annotation Flux can revert.
### Plan 7 amendment baked in
`internal/skills/{review,debug,retrospective,trainer}/` are reused by the routing pod and **must not be deleted in Plan 7**. Plan 7 deletes only `internal/skills/{tdd,spec}/`, the supervisor binary, the supervisor manifests, and frees NodePort `:30320`. The implementer of Plan 7 must read this paragraph and the matching note in the spec before deleting anything.
## File Structure
### Hyperguild worktree
| Path | Action | Responsibility |
|---|---|---|
| `internal/config/routing.go` | create | `RoutingConfig` typed struct, `LoadRouting()` env parser |
| `internal/config/routing_test.go` | create | Defaults + env-override tests |
| `internal/routing/policy.go` | create | `Decision` enum, `Policy.Decide(passRate, hash) Decision` |
| `internal/routing/policy_test.go` | create | Table-driven coverage of all four rules |
| `internal/routing/hash.go` | create | `CanonicalHash(system, user) uint64` (SHA-256 prefix) |
| `internal/routing/hash_test.go` | create | Determinism + low-bit distribution sanity |
| `internal/routing/passrate.go` | create | `Fetcher` with TTL cache, calls `GET /pass-rate` |
| `internal/routing/passrate_test.go` | create | `httptest.Server`; cache hit/miss, error path |
| `internal/routing/log.go` | create | `Logger.LogDecision(...)` posts to brain MCP `session_log` |
| `internal/routing/log_test.go` | create | `httptest.Server` capture + body shape assertion |
| `internal/routing/router.go` | create | `Router.Run(...)` wraps fetcher + policy + logger + LiteLLM |
| `internal/routing/router_test.go` | create | Mocked fetcher/logger/litellm; route + fail-open paths |
| `internal/routing/snapshot_test.go` | create | Asserts routing pod's `tools/list` byte-equals captured snapshot |
| `internal/routing/testdata/tools_list.snapshot.json` | create | Snapshot from current supervisor advertisement |
| `cmd/routing/main.go` | create | Wires Config → LiteLLM → Router → Skills → Registry → MCP server |
| `cmd/routing/main_test.go` | create | Integration test with fakes for LiteLLM + brain |
| `cmd/hyperguild/mode.go:74-87` | modify | `modeClientLocal` adds `headers: X-Hyperguild-Mode`, removes `_routing_pending` |
| `cmd/hyperguild/mode_test.go` | modify | Updated assertion for the new shape |
| `cmd/hyperguild/README.md` | modify | Drop "not deployed yet" note; document the header |
| `Dockerfile.routing` | create | Builds `cmd/routing`, bakes `config/`, runs as non-root, no claude CLI |
| `.gitea/workflows/cd.yml` | modify | Build + push routing image; sed `routing/deployment.yaml` in infra |
| `Taskfile.yml` | modify | Add `smoke:routing` task |
| `scripts/smoke-routing.sh` | create | Boots binary, hits each tool, asserts brain has `_routing` entries |
| `README.md` | modify | Mode 2 + new env vars + routing pod URL |
| `.context/PROJECT.md` | modify | Document `koala:30310/mcp` + the four routed skills |
### Infra worktree
| Path | Action | Responsibility |
|---|---|---|
| `k3s/apps/routing/namespace.yaml` | create | Namespace `routing` |
| `k3s/apps/routing/deployment.yaml` | create | One-replica Deployment, koala nodeSelector, image from gitea registry |
| `k3s/apps/routing/service.yaml` | create | ClusterIP `routing` on port 3210 |
| `k3s/apps/routing/nodeport.yaml` | create | NodePort 30310 → service 3210 |
| `k3s/apps/routing/secrets.enc.yaml` | create | SOPS-encrypted `LITELLM_API_KEY` + optional `ROUTING_MCP_TOKEN` |
| `k3s/apps/routing/kustomization.yaml` | create | Bundles the above |
| `k3s/apps/kustomization.yaml` | modify | Add `routing` to the apps list |
---
## Task 1: `RoutingConfig` struct + env parser
**Worktree:** hyperguild
Typed config struct for the routing pod. New struct (not appended to `Config`) because the routing pod's surface differs from the supervisor's; merging would force every routing field onto the supervisor and vice versa.
**Files:**
- Create: `internal/config/routing.go`
- Create: `internal/config/routing_test.go`
- [ ] **Step 1: Write the failing test**
Create `internal/config/routing_test.go`:
```go
package config_test
import (
"testing"
"github.com/mathiasbq/supervisor/internal/config"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestLoadRoutingDefaults(t *testing.T) {
for _, k := range []string{
"ROUTING_PORT", "ROUTING_MCP_TOKEN", "LITELLM_BASE_URL", "LITELLM_API_KEY",
"BRAIN_URL", "HYPERGUILD_LOCAL_MODEL", "HYPERGUILD_CLAUDE_MODEL",
"HYPERGUILD_ROUTE_LOCAL_FLOOR", "HYPERGUILD_ROUTE_LOCAL_CEIL",
"HYPERGUILD_PASS_RATE_TTL_SECONDS",
} {
t.Setenv(k, "")
}
cfg, err := config.LoadRouting()
require.NoError(t, err)
assert.Equal(t, "3210", cfg.Port)
assert.Equal(t, "", cfg.MCPAuthToken)
assert.Equal(t, "http://piguard:4000", cfg.LiteLLMBaseURL)
assert.Equal(t, "http://ingestion.supervisor:3300", cfg.BrainURL)
assert.Equal(t, "qwen35", cfg.LocalModel)
assert.Equal(t, "claude-sonnet-4-6", cfg.ClaudeModel)
assert.InDelta(t, 0.90, cfg.RouteLocalFloor, 1e-9)
assert.InDelta(t, 0.70, cfg.RouteLocalCeil, 1e-9)
assert.Equal(t, 60, cfg.PassRateTTLSeconds)
}
func TestLoadRoutingFromEnv(t *testing.T) {
t.Setenv("ROUTING_PORT", "3250")
t.Setenv("ROUTING_MCP_TOKEN", "tok-xyz")
t.Setenv("LITELLM_BASE_URL", "http://localhost:4000")
t.Setenv("LITELLM_API_KEY", "lk")
t.Setenv("BRAIN_URL", "http://localhost:3300")
t.Setenv("HYPERGUILD_LOCAL_MODEL", "qwen2-7b")
t.Setenv("HYPERGUILD_CLAUDE_MODEL", "claude-opus-4-7")
t.Setenv("HYPERGUILD_ROUTE_LOCAL_FLOOR", "0.85")
t.Setenv("HYPERGUILD_ROUTE_LOCAL_CEIL", "0.65")
t.Setenv("HYPERGUILD_PASS_RATE_TTL_SECONDS", "30")
cfg, err := config.LoadRouting()
require.NoError(t, err)
assert.Equal(t, "3250", cfg.Port)
assert.Equal(t, "tok-xyz", cfg.MCPAuthToken)
assert.Equal(t, "http://localhost:4000", cfg.LiteLLMBaseURL)
assert.Equal(t, "lk", cfg.LiteLLMAPIKey)
assert.Equal(t, "http://localhost:3300", cfg.BrainURL)
assert.Equal(t, "qwen2-7b", cfg.LocalModel)
assert.Equal(t, "claude-opus-4-7", cfg.ClaudeModel)
assert.InDelta(t, 0.85, cfg.RouteLocalFloor, 1e-9)
assert.InDelta(t, 0.65, cfg.RouteLocalCeil, 1e-9)
assert.Equal(t, 30, cfg.PassRateTTLSeconds)
}
func TestLoadRoutingRejectsBadFloat(t *testing.T) {
t.Setenv("HYPERGUILD_ROUTE_LOCAL_FLOOR", "not-a-number")
_, err := config.LoadRouting()
require.Error(t, err)
assert.Contains(t, err.Error(), "HYPERGUILD_ROUTE_LOCAL_FLOOR")
}
```
- [ ] **Step 2: Run the test to confirm it fails**
```bash
cd ~/Documents/local-dev/AI/hyperguild/.worktrees/mode-2-routing-pod
go test ./internal/config/... -run TestLoadRouting -v
```
Expected: FAIL — `undefined: config.LoadRouting` and `undefined: config.RoutingConfig`.
- [ ] **Step 3: Write the implementation**
Create `internal/config/routing.go`:
```go
package config
import (
"fmt"
"os"
"strconv"
)
// RoutingConfig holds the runtime configuration for the routing pod.
// Separate from Config because the routing pod's surface differs from the supervisor's.
type RoutingConfig struct {
Port string // ROUTING_PORT, default 3210
MCPAuthToken string // ROUTING_MCP_TOKEN, optional bearer token
LiteLLMBaseURL string // LITELLM_BASE_URL, default http://piguard:4000
LiteLLMAPIKey string // LITELLM_API_KEY
BrainURL string // BRAIN_URL, default http://ingestion.supervisor:3300
LocalModel string // HYPERGUILD_LOCAL_MODEL, default qwen35
ClaudeModel string // HYPERGUILD_CLAUDE_MODEL, default claude-sonnet-4-6
RouteLocalFloor float64 // HYPERGUILD_ROUTE_LOCAL_FLOOR, default 0.90
RouteLocalCeil float64 // HYPERGUILD_ROUTE_LOCAL_CEIL, default 0.70
PassRateTTLSeconds int // HYPERGUILD_PASS_RATE_TTL_SECONDS, default 60
}
func LoadRouting() (RoutingConfig, error) {
cfg := RoutingConfig{
Port: envOr("ROUTING_PORT", "3210"),
MCPAuthToken: os.Getenv("ROUTING_MCP_TOKEN"),
LiteLLMBaseURL: envOr("LITELLM_BASE_URL", "http://piguard:4000"),
LiteLLMAPIKey: os.Getenv("LITELLM_API_KEY"),
BrainURL: envOr("BRAIN_URL", "http://ingestion.supervisor:3300"),
LocalModel: envOr("HYPERGUILD_LOCAL_MODEL", "qwen35"),
ClaudeModel: envOr("HYPERGUILD_CLAUDE_MODEL", "claude-sonnet-4-6"),
}
floor, err := parseFloatEnv("HYPERGUILD_ROUTE_LOCAL_FLOOR", 0.90)
if err != nil {
return RoutingConfig{}, err
}
cfg.RouteLocalFloor = floor
ceil, err := parseFloatEnv("HYPERGUILD_ROUTE_LOCAL_CEIL", 0.70)
if err != nil {
return RoutingConfig{}, err
}
cfg.RouteLocalCeil = ceil
ttl, err := parseIntEnv("HYPERGUILD_PASS_RATE_TTL_SECONDS", 60)
if err != nil {
return RoutingConfig{}, err
}
cfg.PassRateTTLSeconds = ttl
return cfg, nil
}
func parseFloatEnv(key string, def float64) (float64, error) {
v := os.Getenv(key)
if v == "" {
return def, nil
}
f, err := strconv.ParseFloat(v, 64)
if err != nil {
return 0, fmt.Errorf("config: %s: %w", key, err)
}
return f, nil
}
func parseIntEnv(key string, def int) (int, error) {
v := os.Getenv(key)
if v == "" {
return def, nil
}
n, err := strconv.Atoi(v)
if err != nil {
return 0, fmt.Errorf("config: %s: %w", key, err)
}
return n, nil
}
```
- [ ] **Step 4: Run the test to confirm it passes**
```bash
go test ./internal/config/... -run TestLoadRouting -v
```
Expected: PASS — three subtests green.
- [ ] **Step 5: Run `task check`**
```bash
task check 2>&1 | tail -20
```
Expected: lint clean, test green, vet clean, no drift, govulncheck clean.
- [ ] **Step 6: Commit**
```bash
git add internal/config/routing.go internal/config/routing_test.go
git commit -m "feat(routing): RoutingConfig + LoadRouting"
```
---
## Task 2: Decision policy
**Worktree:** hyperguild
Pure-function policy with no I/O. Decision rules in priority order: null → local; ≥floor → local; <ceil → claude; otherwise sample-band hash split.
**Files:**
- Create: `internal/routing/policy.go`
- Create: `internal/routing/policy_test.go`
- [ ] **Step 1: Write the failing test**
Create `internal/routing/policy_test.go`:
```go
package routing_test
import (
"testing"
"github.com/mathiasbq/supervisor/internal/routing"
"github.com/stretchr/testify/assert"
)
func ptr(f float64) *float64 { return &f }
func TestPolicyDecide(t *testing.T) {
p := routing.Policy{Floor: 0.9, Ceil: 0.7}
cases := []struct {
name string
passRate *float64
hash uint64
want routing.Decision
}{
{"null pass rate → local", nil, 0, routing.DecideLocal},
{"null pass rate, hash irrelevant → local", nil, 0xDEADBEEF, routing.DecideLocal},
{"at floor → local", ptr(0.9), 0, routing.DecideLocal},
{"above floor → local", ptr(0.95), 0, routing.DecideLocal},
{"below ceil → claude", ptr(0.5), 0, routing.DecideClaude},
{"at ceil → sample-band even-hash → local", ptr(0.7), 0, routing.DecideLocal},
{"sample band, even hash → local", ptr(0.8), 2, routing.DecideLocal},
{"sample band, odd hash → claude", ptr(0.8), 3, routing.DecideClaude},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
assert.Equal(t, tc.want, p.Decide(tc.passRate, tc.hash))
})
}
}
```
- [ ] **Step 2: Run the test**
```bash
go test ./internal/routing/... -run TestPolicyDecide -v
```
Expected: FAIL — package `internal/routing` does not exist.
- [ ] **Step 3: Write the implementation**
Create `internal/routing/policy.go`:
```go
package routing
// Decision is the route picked for a single skill call.
type Decision int
const (
DecideLocal Decision = iota
DecideClaude
)
func (d Decision) String() string {
if d == DecideLocal {
return "local"
}
return "claude"
}
// Policy holds the floor/ceil thresholds for routing decisions.
//
// Rules (in order):
//
// 1. passRate == nil → DecideLocal (default-to-local for cost-routable skills)
// 2. *passRate >= Floor → DecideLocal (trust local)
// 3. *passRate < Ceil → DecideClaude (don't trust local)
// 4. otherwise (sample band) → requestHash low bit picks: 0=local, 1=claude
type Policy struct {
Floor float64
Ceil float64
}
// Decide returns the routing decision for a single call.
// requestHash is consulted only when passRate is in the sample band [Ceil, Floor).
func (p Policy) Decide(passRate *float64, requestHash uint64) Decision {
if passRate == nil {
return DecideLocal
}
if *passRate >= p.Floor {
return DecideLocal
}
if *passRate < p.Ceil {
return DecideClaude
}
if requestHash&1 == 0 {
return DecideLocal
}
return DecideClaude
}
```
- [ ] **Step 4: Run the test to confirm it passes**
```bash
go test ./internal/routing/... -run TestPolicyDecide -v
```
Expected: PASS — eight subtests green.
- [ ] **Step 5: Run `task check`**
```bash
task check 2>&1 | tail -10
```
- [ ] **Step 6: Commit**
```bash
git add internal/routing/policy.go internal/routing/policy_test.go
git commit -m "feat(routing): decision policy"
```
---
## Task 3: Canonical request hash
**Worktree:** hyperguild
SHA-256-based hash of `(system, user)` for deterministic sample-band routing. Same prompt pair → same decision across calls.
**Files:**
- Create: `internal/routing/hash.go`
- Create: `internal/routing/hash_test.go`
- [ ] **Step 1: Write the failing test**
Create `internal/routing/hash_test.go`:
```go
package routing_test
import (
"testing"
"github.com/mathiasbq/supervisor/internal/routing"
"github.com/stretchr/testify/assert"
)
func TestCanonicalHashDeterministic(t *testing.T) {
a := routing.CanonicalHash("system one", "user one")
b := routing.CanonicalHash("system one", "user one")
assert.Equal(t, a, b, "same inputs must produce same hash")
}
func TestCanonicalHashDistinguishesInputs(t *testing.T) {
cases := [][2]string{
{"sys", "user"},
{"sys", "user2"},
{"sys2", "user"},
{"", "system\x00user"}, // separator collision attempt
{"system\x00user", ""},
}
seen := make(map[uint64]bool)
for _, c := range cases {
h := routing.CanonicalHash(c[0], c[1])
assert.False(t, seen[h], "collision on %v", c)
seen[h] = true
}
}
func TestCanonicalHashLowBitDistribution(t *testing.T) {
// Sanity check: across 1000 distinct inputs, low-bit split is roughly even.
zeros, ones := 0, 0
for i := 0; i < 1000; i++ {
h := routing.CanonicalHash("sys", string(rune('a'+(i%26)))+string(rune(i)))
if h&1 == 0 {
zeros++
} else {
ones++
}
}
// Allow ±15% deviation from 500/500. Tighter would be flaky on real data.
assert.InDelta(t, 500, zeros, 150)
assert.InDelta(t, 500, ones, 150)
}
```
- [ ] **Step 2: Run the test**
```bash
go test ./internal/routing/... -run TestCanonicalHash -v
```
Expected: FAIL — `undefined: routing.CanonicalHash`.
- [ ] **Step 3: Write the implementation**
Create `internal/routing/hash.go`:
```go
package routing
import (
"crypto/sha256"
"encoding/binary"
)
// CanonicalHash returns a deterministic 64-bit hash of (system, user).
// Used to make sample-band routing decisions reproducible: identical input
// strings produce the same hash on every call, independent of process state.
//
// Inputs are joined with a 0x00 byte separator before hashing — distinguishes
// (system="ab", user="cd") from (system="abcd", user="").
func CanonicalHash(system, user string) uint64 {
h := sha256.New()
h.Write([]byte(system))
h.Write([]byte{0})
h.Write([]byte(user))
sum := h.Sum(nil)
return binary.BigEndian.Uint64(sum[:8])
}
```
- [ ] **Step 4: Run tests + `task check`**
```bash
go test ./internal/routing/... -run TestCanonicalHash -v
task check 2>&1 | tail -10
```
Expected: PASS, all checks green.
- [ ] **Step 5: Commit**
```bash
git add internal/routing/hash.go internal/routing/hash_test.go
git commit -m "feat(routing): canonical request hash"
```
---
## Task 4: Pass-rate fetcher with TTL cache
**Worktree:** hyperguild
HTTP client that calls `GET ${BrainURL}/pass-rate?skill=X&window=7d`, caches the response (`*float64`, possibly nil) for `TTL`. On error, returns `(nil, err)` so the dispatch wrapper falls through to default-to-local.
**Files:**
- Create: `internal/routing/passrate.go`
- Create: `internal/routing/passrate_test.go`
- [ ] **Step 1: Write the failing test**
Create `internal/routing/passrate_test.go`:
```go
package routing_test
import (
"context"
"encoding/json"
"net/http"
"net/http/httptest"
"sync/atomic"
"testing"
"time"
"github.com/mathiasbq/supervisor/internal/routing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestFetcherGetReturnsPassRate(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
assert.Equal(t, http.MethodGet, r.Method)
assert.Equal(t, "/pass-rate", r.URL.Path)
assert.Equal(t, "tdd", r.URL.Query().Get("skill"))
assert.Equal(t, "7d", r.URL.Query().Get("window"))
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]any{"skill": "tdd", "pass_rate": 0.94})
}))
defer srv.Close()
f := routing.NewFetcher(srv.URL, "7d", time.Minute)
pr, err := f.Get(context.Background(), "tdd")
require.NoError(t, err)
require.NotNil(t, pr)
assert.InDelta(t, 0.94, *pr, 1e-9)
}
func TestFetcherGetReturnsNilWhenNoData(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
_ = json.NewEncoder(w).Encode(map[string]any{"skill": "novel", "pass_rate": nil})
}))
defer srv.Close()
f := routing.NewFetcher(srv.URL, "7d", time.Minute)
pr, err := f.Get(context.Background(), "novel")
require.NoError(t, err)
assert.Nil(t, pr)
}
func TestFetcherCachesWithinTTL(t *testing.T) {
var calls int32
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
atomic.AddInt32(&calls, 1)
_ = json.NewEncoder(w).Encode(map[string]any{"pass_rate": 0.5})
}))
defer srv.Close()
f := routing.NewFetcher(srv.URL, "7d", time.Minute)
for i := 0; i < 5; i++ {
_, err := f.Get(context.Background(), "tdd")
require.NoError(t, err)
}
assert.Equal(t, int32(1), atomic.LoadInt32(&calls), "should hit upstream once and serve four times from cache")
}
func TestFetcherSurfacesUpstreamError(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
http.Error(w, "boom", http.StatusInternalServerError)
}))
defer srv.Close()
f := routing.NewFetcher(srv.URL, "7d", time.Minute)
pr, err := f.Get(context.Background(), "tdd")
require.Error(t, err)
assert.Nil(t, pr)
}
```
- [ ] **Step 2: Run the test**
```bash
go test ./internal/routing/... -run TestFetcher -v
```
Expected: FAIL — `undefined: routing.NewFetcher`.
- [ ] **Step 3: Write the implementation**
Create `internal/routing/passrate.go`:
```go
package routing
import (
"context"
"encoding/json"
"fmt"
"net/http"
"net/url"
"sync"
"time"
)
// Fetcher reads /pass-rate from the brain pod with a per-skill TTL cache.
type Fetcher struct {
BaseURL string
Window string
TTL time.Duration
HTTP *http.Client
mu sync.Mutex
cache map[string]cachedRate
}
type cachedRate struct {
value *float64
at time.Time
}
type passRateResponse struct {
PassRate *float64 `json:"pass_rate"`
}
// NewFetcher returns a Fetcher that calls baseURL + /pass-rate with the
// given window string. If ttl is zero, defaults to 60 seconds. The HTTP
// client uses a 1-second total timeout.
func NewFetcher(baseURL, window string, ttl time.Duration) *Fetcher {
if ttl == 0 {
ttl = 60 * time.Second
}
return &Fetcher{
BaseURL: baseURL,
Window: window,
TTL: ttl,
HTTP: &http.Client{Timeout: time.Second},
cache: make(map[string]cachedRate),
}
}
// Get returns the pass rate for the named skill, or nil if no data exists,
// or an error if the brain is unreachable. Caches successful results.
func (f *Fetcher) Get(ctx context.Context, skill string) (*float64, error) {
f.mu.Lock()
if c, ok := f.cache[skill]; ok && time.Since(c.at) < f.TTL {
v := c.value
f.mu.Unlock()
return v, nil
}
f.mu.Unlock()
u := fmt.Sprintf("%s/pass-rate?skill=%s&window=%s",
f.BaseURL, url.QueryEscape(skill), url.QueryEscape(f.Window))
req, err := http.NewRequestWithContext(ctx, http.MethodGet, u, nil)
if err != nil {
return nil, fmt.Errorf("passrate: build request: %w", err)
}
resp, err := f.HTTP.Do(req)
if err != nil {
return nil, fmt.Errorf("passrate: request: %w", err)
}
defer func() { _ = resp.Body.Close() }()
if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("passrate: server returned status %d", resp.StatusCode)
}
var body passRateResponse
if err := json.NewDecoder(resp.Body).Decode(&body); err != nil {
return nil, fmt.Errorf("passrate: decode: %w", err)
}
f.mu.Lock()
f.cache[skill] = cachedRate{value: body.PassRate, at: time.Now()}
f.mu.Unlock()
return body.PassRate, nil
}
```
- [ ] **Step 4: Run tests + `task check`**
```bash
go test ./internal/routing/... -run TestFetcher -v
task check 2>&1 | tail -10
```
- [ ] **Step 5: Commit**
```bash
git add internal/routing/passrate.go internal/routing/passrate_test.go
git commit -m "feat(routing): pass-rate fetcher with TTL cache"
```
---
## Task 5: Decision logger
**Worktree:** hyperguild
Posts a `session_log` MCP call to the brain pod's `/mcp` endpoint after every routing decision. Best-effort: returns errors but the caller does not block real work on them.
**Files:**
- Create: `internal/routing/log.go`
- Create: `internal/routing/log_test.go`
- [ ] **Step 1: Write the failing test**
Create `internal/routing/log_test.go`:
```go
package routing_test
import (
"context"
"encoding/json"
"io"
"net/http"
"net/http/httptest"
"testing"
"github.com/mathiasbq/supervisor/internal/routing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestLoggerLogDecision(t *testing.T) {
var captured map[string]any
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
assert.Equal(t, http.MethodPost, r.Method)
assert.Equal(t, "/mcp", r.URL.Path)
body, _ := io.ReadAll(r.Body)
require.NoError(t, json.Unmarshal(body, &captured))
_ = json.NewEncoder(w).Encode(map[string]any{"jsonrpc": "2.0", "id": 1, "result": map[string]any{"content": []map[string]any{{"type": "text", "text": "ok"}}}})
}))
defer srv.Close()
l := routing.NewLogger(srv.URL)
err := l.LogDecision(context.Background(), routing.LogEntry{
SessionID: "sess-1",
Skill: "code_review",
Decision: "local",
Message: "model=qwen35, pass_rate=0.94",
ProjectRoot: "/home/x/proj",
DurationMs: 1234,
Failed: false,
})
require.NoError(t, err)
params := captured["params"].(map[string]any)
assert.Equal(t, "tools/call", captured["method"])
assert.Equal(t, "session_log", params["name"])
args := params["arguments"].(map[string]any)
assert.Equal(t, "_routing", args["skill"])
assert.Equal(t, "decide", args["phase"])
assert.Equal(t, "skip", args["final_status"])
assert.Contains(t, args["message"].(string), "code_review: local")
assert.Equal(t, "sess-1", args["session_id"])
assert.Equal(t, "/home/x/proj", args["project_root"])
assert.Equal(t, float64(1234), args["duration_ms"])
}
func TestLoggerLogFailure(t *testing.T) {
var captured map[string]any
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
body, _ := io.ReadAll(r.Body)
_ = json.Unmarshal(body, &captured)
_ = json.NewEncoder(w).Encode(map[string]any{"jsonrpc": "2.0", "id": 1, "result": map[string]any{}})
}))
defer srv.Close()
l := routing.NewLogger(srv.URL)
err := l.LogDecision(context.Background(), routing.LogEntry{
SessionID: "s", Skill: "debug", Decision: "local", Message: "litellm down", Failed: true,
})
require.NoError(t, err)
args := captured["params"].(map[string]any)["arguments"].(map[string]any)
assert.Equal(t, "fail", args["final_status"])
}
func TestLoggerSurfacesUpstreamError(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
http.Error(w, "down", http.StatusBadGateway)
}))
defer srv.Close()
l := routing.NewLogger(srv.URL)
err := l.LogDecision(context.Background(), routing.LogEntry{Skill: "x", SessionID: "y", Decision: "local"})
require.Error(t, err)
}
```
- [ ] **Step 2: Run the test**
```bash
go test ./internal/routing/... -run TestLogger -v
```
Expected: FAIL — `undefined: routing.NewLogger`.
- [ ] **Step 3: Write the implementation**
Create `internal/routing/log.go`:
```go
package routing
import (
"bytes"
"context"
"encoding/json"
"fmt"
"net/http"
"time"
)
// LogEntry describes a single routing decision to log via the brain MCP.
type LogEntry struct {
SessionID string
Skill string // the original skill the call routed (e.g., "code_review")
Decision string // "local" or "claude" or "claude_fallback"
Message string // free-form, e.g. "model=qwen35, pass_rate=0.94"
ProjectRoot string
DurationMs int64
Failed bool // true → final_status: "fail"; false → "skip"
}
// Logger posts session_log entries to a brain MCP at BrainURL + /mcp.
type Logger struct {
BrainURL string
HTTP *http.Client
}
// NewLogger creates a Logger with a 2-second HTTP timeout.
func NewLogger(brainURL string) *Logger {
return &Logger{
BrainURL: brainURL,
HTTP: &http.Client{Timeout: 2 * time.Second},
}
}
// LogDecision posts a session_log MCP call. Errors are returned but the caller
// MUST NOT block real work on them — logging is best-effort.
func (l *Logger) LogDecision(ctx context.Context, e LogEntry) error {
status := "skip"
if e.Failed {
status = "fail"
}
payload := map[string]any{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": map[string]any{
"name": "session_log",
"arguments": map[string]any{
"session_id": e.SessionID,
"skill": "_routing",
"phase": "decide",
"final_status": status,
"message": fmt.Sprintf("%s: %s — %s", e.Skill, e.Decision, e.Message),
"duration_ms": e.DurationMs,
"project_root": e.ProjectRoot,
},
},
}
body, err := json.Marshal(payload)
if err != nil {
return fmt.Errorf("log: marshal: %w", err)
}
req, err := http.NewRequestWithContext(ctx, http.MethodPost, l.BrainURL+"/mcp", bytes.NewReader(body))
if err != nil {
return fmt.Errorf("log: build request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
resp, err := l.HTTP.Do(req)
if err != nil {
return fmt.Errorf("log: request: %w", err)
}
defer func() { _ = resp.Body.Close() }()
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("log: server returned status %d", resp.StatusCode)
}
return nil
}
```
- [ ] **Step 4: Run tests + `task check`**
```bash
go test ./internal/routing/... -run TestLogger -v
task check 2>&1 | tail -10
```
- [ ] **Step 5: Commit**
```bash
git add internal/routing/log.go internal/routing/log_test.go
git commit -m "feat(routing): decision logger via brain MCP session_log"
```
---
## Task 6: Router (dispatch wrapper)
**Worktree:** hyperguild
Composes Fetcher + Policy + Logger + a `CompleteFunc`. The wrapper is what the four skill packages receive as their `CompleteFunc`. On a local-route error, it falls open by retrying once on the Claude model.
**Files:**
- Create: `internal/routing/router.go`
- Create: `internal/routing/router_test.go`
- [ ] **Step 1: Write the failing test**
Create `internal/routing/router_test.go`:
```go
package routing_test
import (
"context"
"encoding/json"
"errors"
"net/http"
"net/http/httptest"
"sync"
"testing"
"time"
"github.com/mathiasbq/supervisor/internal/routing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
type fakeLLM struct {
mu sync.Mutex
calls []struct{ Model, System, User string }
resp string
err error
errOn string // if non-empty, only the named model errors
}
func (f *fakeLLM) Complete(_ context.Context, model, system, user string) (string, int64, error) {
f.mu.Lock()
defer f.mu.Unlock()
f.calls = append(f.calls, struct{ Model, System, User string }{model, system, user})
if f.errOn == model {
return "", 0, f.err
}
if f.err != nil && f.errOn == "" {
return "", 0, f.err
}
return f.resp, 100, nil
}
func newRouter(t *testing.T, llm *fakeLLM, passRate float64) (*routing.Router, *httptest.Server, *httptest.Server) {
t.Helper()
brain := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
switch r.URL.Path {
case "/pass-rate":
_ = json.NewEncoder(w).Encode(map[string]any{"pass_rate": passRate})
case "/mcp":
_ = json.NewEncoder(w).Encode(map[string]any{"jsonrpc": "2.0", "id": 1, "result": map[string]any{}})
}
}))
t.Cleanup(brain.Close)
r := &routing.Router{
Fetcher: routing.NewFetcher(brain.URL, "7d", time.Minute),
Logger: routing.NewLogger(brain.URL),
Policy: routing.Policy{Floor: 0.9, Ceil: 0.7},
LocalModel: "qwen35",
ClaudeModel: "claude-sonnet-4-6",
Complete: llm.Complete,
}
return r, brain, brain
}
func TestRouterRoutesLocalAtHighPassRate(t *testing.T) {
llm := &fakeLLM{resp: "ok"}
r, _, _ := newRouter(t, llm, 0.95)
out, _, err := r.Run(context.Background(), routing.RunInput{
Skill: "code_review", System: "sys", User: "user", SessionID: "s1", ProjectRoot: "/p",
})
require.NoError(t, err)
assert.Equal(t, "ok", out)
llm.mu.Lock()
defer llm.mu.Unlock()
require.Len(t, llm.calls, 1)
assert.Equal(t, "qwen35", llm.calls[0].Model)
}
func TestRouterRoutesClaudeAtLowPassRate(t *testing.T) {
llm := &fakeLLM{resp: "ok"}
r, _, _ := newRouter(t, llm, 0.3)
_, _, err := r.Run(context.Background(), routing.RunInput{
Skill: "code_review", System: "sys", User: "user", SessionID: "s2",
})
require.NoError(t, err)
llm.mu.Lock()
defer llm.mu.Unlock()
require.Len(t, llm.calls, 1)
assert.Equal(t, "claude-sonnet-4-6", llm.calls[0].Model)
}
func TestRouterFailsOpenLocalErrorToClaude(t *testing.T) {
llm := &fakeLLM{resp: "ok-after-fallback", err: errors.New("local boom"), errOn: "qwen35"}
r, _, _ := newRouter(t, llm, 0.95) // would route local
out, _, err := r.Run(context.Background(), routing.RunInput{
Skill: "code_review", System: "sys", User: "user", SessionID: "s3",
})
require.NoError(t, err)
assert.Equal(t, "ok-after-fallback", out)
llm.mu.Lock()
defer llm.mu.Unlock()
require.Len(t, llm.calls, 2)
assert.Equal(t, "qwen35", llm.calls[0].Model)
assert.Equal(t, "claude-sonnet-4-6", llm.calls[1].Model)
}
func TestRouterDefaultsToLocalWhenBrainUnreachable(t *testing.T) {
// Brain returns 500 → fetcher errors → router treats pass rate as nil → local.
brain := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
http.Error(w, "down", http.StatusInternalServerError)
}))
defer brain.Close()
llm := &fakeLLM{resp: "ok"}
r := &routing.Router{
Fetcher: routing.NewFetcher(brain.URL, "7d", time.Minute),
Logger: routing.NewLogger(brain.URL),
Policy: routing.Policy{Floor: 0.9, Ceil: 0.7},
LocalModel: "qwen35",
ClaudeModel: "claude-sonnet-4-6",
Complete: llm.Complete,
}
_, _, err := r.Run(context.Background(), routing.RunInput{
Skill: "code_review", System: "sys", User: "user", SessionID: "s4",
})
require.NoError(t, err)
llm.mu.Lock()
defer llm.mu.Unlock()
require.Len(t, llm.calls, 1)
assert.Equal(t, "qwen35", llm.calls[0].Model)
}
```
- [ ] **Step 2: Run the test**
```bash
go test ./internal/routing/... -run TestRouter -v
```
Expected: FAIL — `undefined: routing.Router`, `undefined: routing.RunInput`.
- [ ] **Step 3: Write the implementation**
Create `internal/routing/router.go`:
```go
package routing
import (
"context"
"fmt"
"log/slog"
)
// CompleteFunc matches the signature used by every skill package's Config.
type CompleteFunc func(ctx context.Context, model, system, user string) (string, int64, error)
// RunInput captures the per-call inputs the dispatch wrapper needs.
type RunInput struct {
Skill string
System string
User string
SessionID string
ProjectRoot string
}
// Router composes a pass-rate fetcher, a decision policy, a session logger,
// and a LiteLLM client. Skill packages receive Router.Run as their CompleteFunc.
type Router struct {
Fetcher *Fetcher
Logger *Logger
Policy Policy
LocalModel string
ClaudeModel string
Complete CompleteFunc
}
// Run executes one skill call: decides local vs claude, calls LiteLLM, logs the
// decision. On local-side error, falls open by retrying once on the Claude model.
func (r *Router) Run(ctx context.Context, in RunInput) (string, int64, error) {
pr, ferr := r.Fetcher.Get(ctx, in.Skill)
if ferr != nil {
slog.Warn("router: pass-rate unreachable, defaulting to local", "skill", in.Skill, "err", ferr)
pr = nil
}
hash := CanonicalHash(in.System, in.User)
decision := r.Policy.Decide(pr, hash)
model := r.ClaudeModel
if decision == DecideLocal {
model = r.LocalModel
}
out, ms, err := r.Complete(ctx, model, in.System, in.User)
_ = r.Logger.LogDecision(ctx, LogEntry{
SessionID: in.SessionID,
Skill: in.Skill,
Decision: decision.String(),
Message: fmt.Sprintf("model=%s, pass_rate=%s", model, formatPassRate(pr)),
ProjectRoot: in.ProjectRoot,
DurationMs: ms,
Failed: err != nil,
})
if err != nil && decision == DecideLocal {
slog.Warn("router: local failed, falling open to claude", "skill", in.Skill, "err", err)
out, ms, err = r.Complete(ctx, r.ClaudeModel, in.System, in.User)
_ = r.Logger.LogDecision(ctx, LogEntry{
SessionID: in.SessionID,
Skill: in.Skill,
Decision: "claude_fallback",
Message: fmt.Sprintf("model=%s, after-local-error", r.ClaudeModel),
ProjectRoot: in.ProjectRoot,
DurationMs: ms,
Failed: err != nil,
})
}
return out, ms, err
}
func formatPassRate(pr *float64) string {
if pr == nil {
return "null"
}
return fmt.Sprintf("%.2f", *pr)
}
```
- [ ] **Step 4: Run tests + `task check`**
```bash
go test ./internal/routing/... -run TestRouter -v
task check 2>&1 | tail -10
```
- [ ] **Step 5: Commit**
```bash
git add internal/routing/router.go internal/routing/router_test.go
git commit -m "feat(routing): router dispatch wrapper"
```
---
## Task 7: Snapshot test for tool-schema parity
**Worktree:** hyperguild
Capture the supervisor's current advertisement of the four routed skills (`code_review`, `debug`, `retrospective`, `trainer`) into a JSON snapshot file. Add a test that spins up a registry with the same four skill packages and asserts `tools/list` output byte-equals the snapshot. Pins the schema contract so a downstream change in any skill package fails the routing pod's test loudly.
**Files:**
- Create: `internal/routing/testdata/tools_list.snapshot.json`
- Create: `internal/routing/snapshot_test.go`
- [ ] **Step 1: Capture the supervisor's current advertisement**
```bash
cd ~/Documents/local-dev/AI/hyperguild/.worktrees/mode-2-routing-pod
mkdir -p internal/routing/testdata
go run ./cmd/supervisor &
SUPERVISOR_PID=$!
sleep 2
curl -sS -X POST http://localhost:3200/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
| jq '.result.tools | map(select(.name == "code_review" or .name == "debug" or .name == "retrospective" or .name == "trainer")) | sort_by(.name)' \
> internal/routing/testdata/tools_list.snapshot.json
kill $SUPERVISOR_PID
wait $SUPERVISOR_PID 2>/dev/null
```
If the supervisor binary requires extra env vars to start, set them inline:
```bash
SUPERVISOR_CONFIG_DIR=./config/supervisor go run ./cmd/supervisor &
```
Inspect the file:
```bash
cat internal/routing/testdata/tools_list.snapshot.json | jq 'length'
```
Expected: `4`.
- [ ] **Step 2: Write the failing test**
Create `internal/routing/snapshot_test.go`:
```go
package routing_test
import (
"context"
"encoding/json"
"os"
"sort"
"testing"
iexec "github.com/mathiasbq/supervisor/internal/exec"
"github.com/mathiasbq/supervisor/internal/registry"
"github.com/mathiasbq/supervisor/internal/skills/debug"
"github.com/mathiasbq/supervisor/internal/skills/retrospective"
"github.com/mathiasbq/supervisor/internal/skills/review"
"github.com/mathiasbq/supervisor/internal/skills/trainer"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
// TestToolsListMatchesSupervisorSnapshot pins the four routed skills' tool
// definitions to the supervisor's current advertisement. If a skill package
// changes its schema, this test fails loudly so the snapshot can be updated
// in lockstep with the consumer.
func TestToolsListMatchesSupervisorSnapshot(t *testing.T) {
complete := func(_ context.Context, _, _, _ string) (string, int64, error) {
return "", 0, nil
}
_ = iexec.NewLiteLLM // keep import for future use
reg := registry.New()
reg.Register(review.New(review.Config{
SkillPrompt: "stub",
DefaultModel: "stub",
CompleteFunc: complete,
}))
reg.Register(debug.New(debug.Config{
SkillPrompt: "stub",
DefaultModel: "stub",
CompleteFunc: complete,
}))
reg.Register(retrospective.New(retrospective.Config{
SkillPrompt: "stub",
DefaultModel: "stub",
CompleteFunc: complete,
}))
reg.Register(trainer.New(trainer.Config{
ReaderPrompt: "stub",
WriterPrompt: "stub",
DefaultModel: "stub",
CompleteFunc: complete,
}))
tools := reg.Tools()
// Filter to the four routed skills only (registry may expose additional tools).
wanted := map[string]bool{"code_review": true, "debug": true, "retrospective": true, "trainer": true}
var routed []registry.ToolDef
for _, td := range tools {
if wanted[td.Name] {
routed = append(routed, td)
}
}
sort.Slice(routed, func(i, j int) bool { return routed[i].Name < routed[j].Name })
got, err := json.MarshalIndent(routed, "", " ")
require.NoError(t, err)
want, err := os.ReadFile("testdata/tools_list.snapshot.json")
require.NoError(t, err)
// Normalize both via re-encode so whitespace differences don't dominate.
var gotV, wantV any
require.NoError(t, json.Unmarshal(got, &gotV))
require.NoError(t, json.Unmarshal(want, &wantV))
gotN, _ := json.MarshalIndent(gotV, "", " ")
wantN, _ := json.MarshalIndent(wantV, "", " ")
assert.Equal(t, string(wantN), string(gotN),
"tool advertisement drifted from supervisor snapshot — update testdata/tools_list.snapshot.json deliberately if the schema change is intentional")
}
```
If the actual skill tool name is `review` rather than `code_review` (or vice versa), discover by inspecting `internal/skills/review/skill.go`'s `Tools()` and adjust both the snapshot capture filter and the test's `wanted` map. Use the discovered name throughout the rest of the plan.
- [ ] **Step 3: Run the test**
```bash
go test ./internal/routing/... -run TestToolsListMatchesSupervisorSnapshot -v
```
Expected: PASS — the snapshot was captured from the same registry the test exercises. If FAIL, the captured names differ from the wanted map; reconcile names per the note above.
- [ ] **Step 4: `task check`**
```bash
task check 2>&1 | tail -10
```
- [ ] **Step 5: Commit**
```bash
git add internal/routing/snapshot_test.go internal/routing/testdata/tools_list.snapshot.json
git commit -m "test(routing): pin tool-schema parity with supervisor"
```
---
## Task 8: `cmd/routing/main.go` wiring
**Worktree:** hyperguild
Compose the binary: load config, build LiteLLM client, build Fetcher/Logger/Router, register the four skills, mount on the existing `internal/mcp` server with bearer auth.
**Files:**
- Create: `cmd/routing/main.go`
- Create: `cmd/routing/main_test.go`
- [ ] **Step 1: Write the integration test first**
Create `cmd/routing/main_test.go`:
```go
package main_test
import (
"context"
"encoding/json"
"net/http"
"net/http/httptest"
"os/exec"
"strings"
"testing"
"time"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
// TestRoutingPodEndToEnd boots the binary against fake LiteLLM + brain servers,
// calls tools/list and one tools/call, and verifies the brain saw a session_log POST.
func TestRoutingPodEndToEnd(t *testing.T) {
if testing.Short() {
t.Skip("end-to-end binary boot")
}
var brainHits int
llm := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
_ = json.NewEncoder(w).Encode(map[string]any{
"choices": []map[string]any{{"message": map[string]any{"role": "assistant", "content": "stub"}}},
})
}))
defer llm.Close()
brain := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
switch r.URL.Path {
case "/pass-rate":
_ = json.NewEncoder(w).Encode(map[string]any{"pass_rate": 0.95})
case "/mcp":
brainHits++
_ = json.NewEncoder(w).Encode(map[string]any{"jsonrpc": "2.0", "id": 1, "result": map[string]any{}})
}
}))
defer brain.Close()
bin := buildRouting(t)
cmd := exec.Command(bin)
cmd.Env = append(cmd.Env,
"ROUTING_PORT=33310",
"LITELLM_BASE_URL="+llm.URL,
"LITELLM_API_KEY=stub",
"BRAIN_URL="+brain.URL,
"SUPERVISOR_CONFIG_DIR=./config/supervisor",
"PATH="+osPath(),
)
require.NoError(t, cmd.Start())
t.Cleanup(func() { _ = cmd.Process.Kill() })
require.NoError(t, waitForPort(t, "127.0.0.1:33310", 5*time.Second))
resp := mcpCall(t, "http://127.0.0.1:33310/mcp", `{"jsonrpc":"2.0","id":1,"method":"tools/list"}`)
assert.Contains(t, resp, "code_review")
resp = mcpCall(t, "http://127.0.0.1:33310/mcp", `{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"code_review","arguments":{"project_root":"/tmp","files":["README.md"]}}}`)
_ = resp // shape varies by skill; we only need a 200
// Wait briefly for the async session_log to land.
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) && brainHits < 2 {
time.Sleep(50 * time.Millisecond)
}
assert.GreaterOrEqual(t, brainHits, 2, "expected at least one /pass-rate hit and one /mcp session_log hit")
}
```
Add helpers in the same file:
```go
func buildRouting(t *testing.T) string {
t.Helper()
bin := t.TempDir() + "/routing"
out, err := exec.Command("go", "build", "-o", bin, "./cmd/routing").CombinedOutput()
require.NoError(t, err, "build failed: %s", out)
return bin
}
func waitForPort(_ *testing.T, addr string, dur time.Duration) error {
deadline := time.Now().Add(dur)
for time.Now().Before(deadline) {
c, err := http.Get("http://" + addr + "/healthz")
if err == nil {
c.Body.Close()
return nil
}
// fallback: try /mcp tools/list — it'll 400 but TCP open is enough
conn, err := http.NewRequest(http.MethodPost, "http://"+addr+"/mcp", strings.NewReader(`{}`))
if err == nil {
r, err := http.DefaultClient.Do(conn)
if err == nil {
r.Body.Close()
return nil
}
}
time.Sleep(50 * time.Millisecond)
}
return context.DeadlineExceeded
}
func mcpCall(t *testing.T, url, body string) string {
t.Helper()
r, err := http.Post(url, "application/json", strings.NewReader(body))
require.NoError(t, err)
defer r.Body.Close()
var b strings.Builder
_, _ = b.ReadFrom(r.Body)
return b.String()
}
func osPath() string {
for _, e := range append([]string{}, exec.Command("env").Env...) {
if strings.HasPrefix(e, "PATH=") {
return strings.TrimPrefix(e, "PATH=")
}
}
return "/usr/bin:/bin"
}
```
- [ ] **Step 2: Run the test**
```bash
go test ./cmd/routing/... -v
```
Expected: FAIL — `cmd/routing/main.go` doesn't exist.
- [ ] **Step 3: Write the binary**
Create `cmd/routing/main.go`:
```go
// cmd/routing/main.go
package main
import (
"context"
"log/slog"
"net/http"
"os"
"time"
"github.com/mathiasbq/supervisor/internal/config"
iexec "github.com/mathiasbq/supervisor/internal/exec"
"github.com/mathiasbq/supervisor/internal/mcp"
"github.com/mathiasbq/supervisor/internal/registry"
"github.com/mathiasbq/supervisor/internal/routing"
"github.com/mathiasbq/supervisor/internal/skills/debug"
"github.com/mathiasbq/supervisor/internal/skills/retrospective"
"github.com/mathiasbq/supervisor/internal/skills/review"
"github.com/mathiasbq/supervisor/internal/skills/trainer"
)
func main() {
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
slog.SetDefault(logger)
cfg, err := config.LoadRouting()
if err != nil {
logger.Error("config load failed", "err", err)
os.Exit(1)
}
// Load prompts from config dir (same files the supervisor uses).
configDir := envOr("SUPERVISOR_CONFIG_DIR", "/app/config/supervisor")
mustRead := func(path string) string {
b, err := os.ReadFile(configDir + "/" + path)
if err != nil {
logger.Error("read prompt failed", "path", path, "err", err)
os.Exit(1)
}
return string(b)
}
llm := iexec.NewLiteLLM(cfg.LiteLLMBaseURL, cfg.LiteLLMAPIKey, 0)
router := &routing.Router{
Fetcher: routing.NewFetcher(cfg.BrainURL, "7d", time.Duration(cfg.PassRateTTLSeconds)*time.Second),
Logger: routing.NewLogger(cfg.BrainURL),
Policy: routing.Policy{Floor: cfg.RouteLocalFloor, Ceil: cfg.RouteLocalCeil},
LocalModel: cfg.LocalModel,
ClaudeModel: cfg.ClaudeModel,
Complete: llm.Complete,
}
// Skill packages call CompleteFunc(ctx, model, system, user) — no session_id
// or project_root in the signature. Rather than modifying every skill's API
// (and inflating Plan 6's blast radius), the routing pod logs every decision
// under a fixed session_id "_routing". Operators query
// `GET /pass-rate?skill=_routing&window=...` to inspect routing health; per-
// session correlation is sacrificed for a much simpler implementation.
const routingSessionID = "_routing"
wrap := func(skillName string) routing.CompleteFunc {
return func(ctx context.Context, _, system, user string) (string, int64, error) {
// The model param is ignored: the router picks the model based on policy.
return router.Run(ctx, routing.RunInput{
Skill: skillName,
System: system,
User: user,
SessionID: routingSessionID,
ProjectRoot: "",
})
}
}
reg := registry.New()
reg.Register(review.New(review.Config{
SkillPrompt: mustRead("review.md"),
DefaultModel: cfg.LocalModel,
CompleteFunc: wrap("code_review"),
}))
reg.Register(debug.New(debug.Config{
SkillPrompt: mustRead("debug.md"),
DefaultModel: cfg.LocalModel,
CompleteFunc: wrap("debug"),
}))
reg.Register(retrospective.New(retrospective.Config{
SkillPrompt: mustRead("retrospective.md"),
DefaultModel: cfg.LocalModel,
CompleteFunc: wrap("retrospective"),
}))
reg.Register(trainer.New(trainer.Config{
ReaderPrompt: mustRead("trainer-reader.md"),
WriterPrompt: mustRead("trainer-writer.md"),
DefaultModel: cfg.LocalModel,
CompleteFunc: wrap("trainer"),
}))
srv := mcp.NewServer(reg, cfg.MCPAuthToken)
mux := http.NewServeMux()
mux.Handle("/mcp", srv)
mux.HandleFunc("/healthz", func(w http.ResponseWriter, _ *http.Request) {
w.WriteHeader(http.StatusOK)
})
addr := ":" + cfg.Port
logger.Info("routing pod starting", "addr", addr,
"local", cfg.LocalModel, "claude", cfg.ClaudeModel,
"floor", cfg.RouteLocalFloor, "ceil", cfg.RouteLocalCeil)
if err := http.ListenAndServe(addr, mux); err != nil {
logger.Error("server stopped", "err", err)
os.Exit(1)
}
}
func envOr(key, def string) string {
if v := os.Getenv(key); v != "" {
return v
}
return def
}
```
If the existing skill packages' `Config` field names differ from what's used here (e.g. `SkillPrompt` vs `Prompt`), adjust by reading each package's `skill.go`.
- [ ] **Step 4: Run integration test + `task check`**
```bash
go test ./cmd/routing/... -v
task check 2>&1 | tail -15
```
Expected: PASS for both.
- [ ] **Step 5: Commit**
```bash
git add cmd/routing/main.go cmd/routing/main_test.go
git commit -m "feat(routing): cmd/routing binary"
```
---
## Task 9: Update `mode client-local` template
**Worktree:** hyperguild
Replace the `_routing_pending` placeholder with a real `headers` block carrying `X-Hyperguild-Mode: client-local`. URL stays at `koala:30310/mcp`.
**Files:**
- Modify: `cmd/hyperguild/mode.go`
- Modify: `cmd/hyperguild/mode_test.go`
- Modify: `cmd/hyperguild/README.md`
- [ ] **Step 1: Update the failing test**
In `cmd/hyperguild/mode_test.go`, find the existing `TestModeClientLocal` (or equivalent). Add an assertion for the new shape:
```go
func TestModeClientLocalHasRoutingHeader(t *testing.T) {
tmp := t.TempDir() + "/mcp.json"
out := &bytes.Buffer{}
stderr := &bytes.Buffer{}
require.NoError(t, runMode(context.Background(), []string{"client-local", "--out", tmp}, nil, out, stderr))
body, err := os.ReadFile(tmp)
require.NoError(t, err)
var doc map[string]any
require.NoError(t, json.Unmarshal(body, &doc))
servers := doc["mcpServers"].(map[string]any)
routing := servers["routing"].(map[string]any)
assert.Equal(t, "http://koala:30310/mcp", routing["url"])
assert.NotContains(t, routing, "_routing_pending", "placeholder should be removed once Plan 6 ships")
headers, ok := routing["headers"].(map[string]any)
require.True(t, ok, "routing entry should have headers block")
assert.Equal(t, "client-local", headers["X-Hyperguild-Mode"])
}
```
- [ ] **Step 2: Run the test**
```bash
go test ./cmd/hyperguild/... -run TestModeClientLocal -v
```
Expected: FAIL — `_routing_pending` is still there OR `headers` is missing.
- [ ] **Step 3: Update `mode.go`**
Replace the `routing` entry inside `modeClientLocal`:
```go
"routing": map[string]any{
"url": "http://koala:30310/mcp",
"description": "Mode 2 routing pod — routes skill calls to LiteLLM/local",
"headers": map[string]any{
"X-Hyperguild-Mode": "client-local",
},
},
```
- [ ] **Step 4: Update `cmd/hyperguild/README.md`**
Find the section that mentions "Plan 6 — routing pod not deployed yet" and rewrite that paragraph:
```markdown
The `routing` entry points at `koala:30310/mcp` (the routing pod, deployed
in Plan 6). The `X-Hyperguild-Mode: client-local` header is forward-compat
for future modes; the pod treats absent or unknown values as `client-local`.
```
- [ ] **Step 5: Run tests + `task check`**
```bash
go test ./cmd/hyperguild/... -run TestModeClientLocal -v
task check 2>&1 | tail -10
```
- [ ] **Step 6: Commit**
```bash
git add cmd/hyperguild/mode.go cmd/hyperguild/mode_test.go cmd/hyperguild/README.md
git commit -m "feat(hyperguild): mode client-local writes routing headers"
```
---
## Task 10: `Dockerfile.routing` + CD workflow extension
**Worktree:** hyperguild
Add a Dockerfile for the routing binary and extend the CD workflow to build + push the image and update the infra repo's routing deployment manifest.
**Files:**
- Create: `Dockerfile.routing`
- Modify: `.gitea/workflows/cd.yml`
- [ ] **Step 1: Write `Dockerfile.routing`**
```dockerfile
# syntax=docker/dockerfile:1
# ── Build stage ───────────────────────────────────────────────────────────────
FROM golang:1.26-bookworm AS builder
ARG VERSION=dev
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -trimpath -ldflags="-s -w -X main.version=${VERSION}" \
-o /out/routing ./cmd/routing
# ── Runtime stage ─────────────────────────────────────────────────────────────
FROM gcr.io/distroless/base-debian12
COPY --from=builder /out/routing /usr/local/bin/routing
COPY config/ /app/config/
ENV SUPERVISOR_CONFIG_DIR=/app/config/supervisor
ENV ROUTING_PORT=3210
EXPOSE 3210
USER 65532:65532
ENTRYPOINT ["/usr/local/bin/routing"]
```
- [ ] **Step 2: Extend `.gitea/workflows/cd.yml`**
Add an `env:` entry:
```yaml
env:
SERVICE: supervisor
IMAGE: gitea.d-ma.be/mathias/supervisor
INGESTION_IMAGE: gitea.d-ma.be/mathias/ingestion
ROUTING_IMAGE: gitea.d-ma.be/mathias/routing
INFRA_REPO: git@gitea.d-ma.be:mathias/infra.git
BUILDKIT_HOST: unix:///run/buildkit/buildkitd.sock
```
Add a new step after the ingestion build step:
```yaml
- name: Build and push routing image
run: |
set -e
trap 'rm -f /tmp/routing-image.tar' EXIT
IMAGE_TAG="${{ github.sha }}"
echo "Building ${ROUTING_IMAGE}:${IMAGE_TAG}"
buildctl --addr "${BUILDKIT_HOST}" build \
--frontend dockerfile.v0 \
--local context=. \
--local dockerfile=. \
--opt filename=Dockerfile.routing \
--opt build-arg:VERSION="${IMAGE_TAG}" \
--output type=oci,dest=/tmp/routing-image.tar
skopeo copy \
oci-archive:/tmp/routing-image.tar \
docker://${ROUTING_IMAGE}:${IMAGE_TAG} \
--dest-creds "${{ secrets.REGISTRY_CREDS }}"
echo "Built and pushed ${ROUTING_IMAGE}:${IMAGE_TAG}"
```
In the "Update infra repo" step, add a third sed and update the commit:
```yaml
sed -i "s|gitea.d-ma.be/mathias/routing:.*|gitea.d-ma.be/mathias/routing:${IMAGE_TAG}|" \
"k3s/apps/routing/deployment.yaml"
git config user.email "cd-bot@d-ma.be"
git config user.name "CD Bot"
git add "k3s/apps/${SERVICE}/deployment.yaml" \
"k3s/apps/${SERVICE}/ingestion-deployment.yaml" \
"k3s/apps/routing/deployment.yaml"
git commit -m "chore(deploy): supervisor+ingestion+routing → ${IMAGE_TAG}"
```
- [ ] **Step 3: Validate the YAML locally**
```bash
yq eval '.jobs.deploy.steps | length' .gitea/workflows/cd.yml
```
Expected: a number greater than the original (one new step added).
- [ ] **Step 4: Commit**
The workflow change is hot — once pushed, CD will try to build the routing image. Until the infra repo has `k3s/apps/routing/deployment.yaml`, the sed line is a no-op (sed succeeds because the file isn't matched anywhere; but the `git add` will fail). Two options:
**Option A (preferred):** Land the infra-repo manifests (Tasks 1112) in the infra worktree FIRST, push them so they exist on `infra` main, then push this commit. Order: Tasks 11 → 12 → 10.
**Option B:** Land the workflow change with a guard, then drop the guard once manifests exist.
Implementer should pick Option A. After the manifests are in place:
```bash
git add Dockerfile.routing .gitea/workflows/cd.yml
git commit -m "build(routing): Dockerfile + CD workflow"
```
DO NOT push this commit until Tasks 11 and 12 have been pushed to the infra repo's `main`.
---
## Task 11: Routing pod manifests (infra worktree)
**Worktree:** infra
Create the k3s manifests for the routing pod. Mirror the supervisor's structure for operator familiarity.
**Files:**
- Create: `k3s/apps/routing/namespace.yaml`
- Create: `k3s/apps/routing/deployment.yaml`
- Create: `k3s/apps/routing/service.yaml`
- Create: `k3s/apps/routing/nodeport.yaml`
- Create: `k3s/apps/routing/kustomization.yaml`
- Modify: `k3s/apps/kustomization.yaml`
- [ ] **Step 1: `namespace.yaml`**
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: routing
```
- [ ] **Step 2: `deployment.yaml`**
The image tag will be bumped by CD; seed it with a placeholder that gets overwritten on first deploy.
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: routing
namespace: routing
spec:
replicas: 1
selector:
matchLabels:
app: routing
template:
metadata:
labels:
app: routing
spec:
nodeSelector:
kubernetes.io/hostname: koala
imagePullSecrets:
- name: gitea-registry
containers:
- name: routing
image: gitea.d-ma.be/mathias/routing:initial
ports:
- containerPort: 3210
envFrom:
- secretRef:
name: routing-secrets
env:
- name: ROUTING_PORT
value: "3210"
- name: LITELLM_BASE_URL
value: "http://piguard:4000"
- name: BRAIN_URL
value: "http://ingestion.supervisor:3300"
- name: HYPERGUILD_LOCAL_MODEL
value: "qwen35"
- name: HYPERGUILD_CLAUDE_MODEL
value: "claude-sonnet-4-6"
- name: HYPERGUILD_ROUTE_LOCAL_FLOOR
value: "0.90"
- name: HYPERGUILD_ROUTE_LOCAL_CEIL
value: "0.70"
- name: HYPERGUILD_PASS_RATE_TTL_SECONDS
value: "60"
readinessProbe:
httpGet:
path: /healthz
port: 3210
initialDelaySeconds: 2
periodSeconds: 10
```
The `gitea-registry` imagePullSecret needs to exist in the `routing` namespace. If only present in `supervisor`, copy it (Step 6 below).
- [ ] **Step 3: `service.yaml`**
```yaml
apiVersion: v1
kind: Service
metadata:
name: routing
namespace: routing
spec:
selector:
app: routing
ports:
- port: 3210
targetPort: 3210
protocol: TCP
```
- [ ] **Step 4: `nodeport.yaml`**
```yaml
apiVersion: v1
kind: Service
metadata:
name: routing-nodeport
namespace: routing
spec:
type: NodePort
selector:
app: routing
ports:
- port: 3210
targetPort: 3210
nodePort: 30310
protocol: TCP
```
- [ ] **Step 5: `kustomization.yaml`** (inside `k3s/apps/routing/`)
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- namespace.yaml
- deployment.yaml
- service.yaml
- nodeport.yaml
- secrets.enc.yaml
```
`secrets.enc.yaml` is added in Task 12; reference it now so the directory is complete.
- [ ] **Step 6: Add `routing` to the apps `kustomization.yaml`**
Modify `k3s/apps/kustomization.yaml`:
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- imagepullsecret
- registry
- gitea
- infra-mcp
- supervisor
- cobalt-dingo
- routing
```
If `imagepullsecret/` only seeds the secret in specific namespaces, ensure `routing` is added to that list — inspect `k3s/apps/imagepullsecret/` and follow the existing pattern.
- [ ] **Step 7: Validate manifest syntax with `kustomize build`**
```bash
cd ~/Documents/local-dev/AI/infra/.worktrees/mode-2-routing-pod
kustomize build k3s/apps/routing 2>&1 | head -20
```
Expected: valid YAML output, no errors. If `secrets.enc.yaml` is referenced but missing, suppress for now by temporarily commenting that line; uncomment in Task 12.
- [ ] **Step 8: Commit (do NOT push yet)**
```bash
git add k3s/apps/routing/ k3s/apps/kustomization.yaml
git commit -m "feat(routing): k3s manifests for the new pod"
```
Push happens after Task 12 (with the encrypted Secret) so the kustomization is consistent on first Flux apply.
---
## Task 12: Routing-secrets Secret + Flux verification
**Worktree:** infra
Encrypt and add the `routing-secrets` Secret. The Secret carries `LITELLM_API_KEY` (reused from supervisor's secret) and optionally a `ROUTING_MCP_TOKEN` for bearer auth.
**Files:**
- Create: `k3s/apps/routing/secrets.enc.yaml`
- [ ] **Step 1: Generate a token (or skip auth for first deploy)**
```bash
# generate (or omit ROUTING_MCP_TOKEN for unauthenticated first deploy):
openssl rand -hex 32
```
Record the value; it will be set in the operator's shell env when Mode 2 is exercised in any project.
- [ ] **Step 2: Decode the cluster's age key**
```bash
export SOPS_AGE_KEY="$(kubectl get secret sops-age -n flux-system -o jsonpath='{.data.age\.agekey}' | base64 -d)"
[ -n "$SOPS_AGE_KEY" ] && echo "age key loaded ($(echo -n "$SOPS_AGE_KEY" | wc -c) bytes)" || (echo "FAIL"; exit 1)
```
- [ ] **Step 3: Pull `LITELLM_API_KEY` value from the supervisor's secret**
Decrypt the supervisor's Secret to read the existing value:
```bash
LITELLM_API_KEY="$(sops -d k3s/apps/supervisor/secrets.enc.yaml | yq eval '.stringData.DMABE_LLMAPI_KEY' -)"
[ -n "$LITELLM_API_KEY" ] && echo "found litellm key" || (echo "FAIL: empty"; exit 1)
```
(`DMABE_LLMAPI_KEY` is the supervisor's name for the LiteLLM key — same value, different env-var name in the consumer.)
- [ ] **Step 4: Create the routing Secret**
```bash
cat > /tmp/routing-secrets.yaml <<EOF
apiVersion: v1
kind: Secret
metadata:
name: routing-secrets
namespace: routing
type: Opaque
stringData:
LITELLM_API_KEY: "${LITELLM_API_KEY}"
ROUTING_MCP_TOKEN: "<paste-token-from-step-1-or-leave-empty>"
EOF
```
Edit `/tmp/routing-secrets.yaml` and paste the token (or leave the field as `""` for unauthenticated first deploy).
- [ ] **Step 5: Encrypt with SOPS**
```bash
sops --encrypt --age age15xez8pcmgg3daxpuqnye9ewawvzjtallheddcrq88ph573yle3nsr5hdq6 \
--encrypted-regex '^(stringData|data)$' \
/tmp/routing-secrets.yaml \
> k3s/apps/routing/secrets.enc.yaml
rm /tmp/routing-secrets.yaml
unset SOPS_AGE_KEY LITELLM_API_KEY
```
Verify the file:
```bash
head -10 k3s/apps/routing/secrets.enc.yaml
```
Expected: `apiVersion: v1`, `kind: Secret`, `stringData:` with `ENC[...]` values.
- [ ] **Step 6: `kustomize build` re-check**
```bash
kustomize build k3s/apps/routing | head -30
```
Expected: namespaces, deployment, services, and a Secret with encrypted data fields. Should succeed.
- [ ] **Step 7: Commit and push (this is the Flux activation)**
```bash
git add k3s/apps/routing/secrets.enc.yaml
git commit -m "feat(routing): SOPS-encrypted routing-secrets"
git pull --rebase origin main
git push origin main
```
`git pull --rebase` accommodates intervening CD-bot commits on `main` (per the auth-rollout precedent earlier today).
- [ ] **Step 8: Wait for Flux to reconcile**
```bash
NEW_SHA=$(git rev-parse HEAD)
until kubectl -n flux-system get kustomization apps -o jsonpath='{.status.lastAppliedRevision}' 2>/dev/null | grep -qE "${NEW_SHA:0:7}"; do
sleep 3
done
echo "Flux applied $NEW_SHA"
```
The pod will be in `ImagePullBackOff` because the `:initial` placeholder image doesn't exist yet — that's expected. The CD workflow (Task 10) will publish the real image and bump the tag.
- [ ] **Step 9: Verify expected partial state**
```bash
kubectl -n routing get all
```
Expected: namespace, deployment (0/1 ready), service, nodeport-service. Pod is in `ErrImagePull` until Task 10 runs end-to-end.
---
## Task 13: `task smoke:routing` live-contract test
**Worktree:** hyperguild
Boots the routing binary against the real `piguard:4000` LiteLLM and the real `koala:30330` brain. Calls each of the four advertised tools once, verifies a `_routing` entry appears in the brain.
**Files:**
- Create: `scripts/smoke-routing.sh`
- Modify: `Taskfile.yml`
- [ ] **Step 1: Write `scripts/smoke-routing.sh`**
```bash
#!/usr/bin/env bash
set -euo pipefail
# Boot the routing binary and exercise its four tools against live deps.
# Skipped when LITELLM_BASE_URL or BRAIN_URL is unreachable.
LITELLM_BASE_URL="${LITELLM_BASE_URL:-http://piguard:4000}"
BRAIN_URL="${BRAIN_URL:-http://koala:30330}"
if ! curl -sS --max-time 2 "${LITELLM_BASE_URL}/v1/models" >/dev/null 2>&1; then
echo "SKIP: LITELLM at ${LITELLM_BASE_URL} unreachable"
exit 0
fi
if ! curl -sS --max-time 2 "${BRAIN_URL}/query" -X POST -d '{"query":"x","k":1}' -H 'Content-Type: application/json' >/dev/null 2>&1; then
echo "SKIP: BRAIN at ${BRAIN_URL} unreachable"
exit 0
fi
PORT=33310
BIN=$(mktemp)
trap 'rm -f $BIN; pkill -P $$ -f "$BIN" 2>/dev/null || true' EXIT
go build -o "$BIN" ./cmd/routing
LITELLM_BASE_URL="$LITELLM_BASE_URL" BRAIN_URL="$BRAIN_URL" \
ROUTING_PORT="$PORT" SUPERVISOR_CONFIG_DIR="$(pwd)/config/supervisor" \
"$BIN" &
BIN_PID=$!
# Wait for the binary to bind.
for _ in $(seq 1 50); do
curl -sS "http://127.0.0.1:${PORT}/healthz" >/dev/null 2>&1 && break
sleep 0.1
done
call_tool() {
local tool="$1"
local args="$2"
curl -sS -X POST "http://127.0.0.1:${PORT}/mcp" \
-H 'Content-Type: application/json' \
-d "{\"jsonrpc\":\"2.0\",\"id\":1,\"method\":\"tools/call\",\"params\":{\"name\":\"${tool}\",\"arguments\":${args}}}" \
| jq -e '.result // .error' > /dev/null
}
echo "calling tools/list..."
curl -sS -X POST "http://127.0.0.1:${PORT}/mcp" \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
| jq -r '.result.tools | map(.name) | sort | .[]'
echo "calling each tool..."
call_tool code_review '{"project_root":"/tmp","files":["README.md"],"session_id":"smoke-1"}'
call_tool debug '{"project_root":"/tmp","problem":"smoke test","session_id":"smoke-1"}'
call_tool retrospective '{"project_root":"/tmp","session_id":"smoke-1"}'
call_tool trainer '{"project_root":"/tmp","session_id":"smoke-1"}'
echo "checking brain has _routing entries..."
sleep 2
COUNT=$(curl -sS "${BRAIN_URL}/pass-rate?skill=_routing&window=1h" | jq -r '.total // 0')
if [ "${COUNT}" -lt 4 ]; then
echo "FAIL: expected ≥4 _routing entries in last 1h, got ${COUNT}"
exit 1
fi
echo "PASS: smoke:routing"
```
Make it executable:
```bash
chmod +x scripts/smoke-routing.sh
```
The exact `arguments` shape per tool may need to be adjusted based on each skill's required fields. If a smoke call returns a JSON-RPC error like "missing required argument", read the failing tool's `Tools()` definition in `internal/skills/<skill>/skill.go` and add the required field with a stub value.
- [ ] **Step 2: Add the Taskfile target**
In `Taskfile.yml`, append to the `tasks:` map:
```yaml
smoke:routing:
desc: Boot the routing pod against live LiteLLM + brain and verify _routing logs land
cmds:
- bash scripts/smoke-routing.sh
```
- [ ] **Step 3: Run it**
```bash
task smoke:routing
```
Expected: SKIP if offline; PASS otherwise.
- [ ] **Step 4: Commit**
```bash
git add scripts/smoke-routing.sh Taskfile.yml
git commit -m "test(routing): live-contract smoke target"
```
---
## Task 14: Documentation updates
**Worktree:** hyperguild
Update the project-level docs to describe Mode 2 + the new env vars + the routing-pod URL.
**Files:**
- Modify: `README.md`
- Modify: `.context/PROJECT.md`
- [ ] **Step 1: Update `README.md`'s "Key env vars" table**
Append:
```markdown
| `ROUTING_PORT` | `3210` | Routing pod's listen port |
| `ROUTING_MCP_TOKEN` | — | Optional bearer token for the routing MCP HTTP endpoint |
| `BRAIN_URL` | `http://ingestion.supervisor:3300` | Routing pod → brain (in-cluster) |
| `HYPERGUILD_LOCAL_MODEL` | `qwen35` | Local model for routed-to-local skill calls |
| `HYPERGUILD_CLAUDE_MODEL` | `claude-sonnet-4-6` | Claude model for routed-to-Claude skill calls |
| `HYPERGUILD_ROUTE_LOCAL_FLOOR` | `0.90` | At/above pass rate, route to local |
| `HYPERGUILD_ROUTE_LOCAL_CEIL` | `0.70` | Below pass rate, route to Claude. Between CEIL and FLOOR is the sample band. |
| `HYPERGUILD_PASS_RATE_TTL_SECONDS` | `60` | Per-skill pass-rate cache TTL |
```
In the architecture diagram block at the top of the README, add the routing pod:
```
Your Claude Code session (in any project)
│ MCP over HTTP (Tailscale)
├──▶ supervisor :3200 (NodePort 30320 on koala) — skill workers: tdd, debug, spec, …
├──▶ routing :3210 (NodePort 30310 on koala) — Mode 2 only: code_review, debug, retrospective, trainer
└──▶ brain :3300 (NodePort 30330 on koala) — brain_query, brain_write, brain_ingest, session_log
```
- [ ] **Step 2: Update `.context/PROJECT.md`**
Find the "MCP endpoints" section and add a third bullet:
```markdown
- **`routing`** at `http://koala:30310/mcp` — Mode 2 routing pod. Advertises
the same four cost-routable skills as the supervisor (`code_review`,
`debug`, `retrospective`, `trainer`) but per-call decides whether to use
a local model or Claude based on the brain's `/pass-rate` response.
Bearer auth via `ROUTING_MCP_TOKEN` (opt-in). Only `mode client-local`
registers this endpoint; Mode 1 and Mode 3 do not.
```
- [ ] **Step 3: Run `task context:sync` so derived adapters update**
```bash
task context:sync
```
This regenerates `CLAUDE.md`, `AGENTS.md`, `.cursorrules`, `.aider.conventions.md`, and `.context/system-prompt.txt` from the canonical sources.
- [ ] **Step 4: `task check`**
```bash
task check 2>&1 | tail -10
```
Expected: drift check green (regenerated adapters tracked).
- [ ] **Step 5: Commit**
```bash
git add README.md .context/PROJECT.md CLAUDE.md AGENTS.md .cursorrules .aider.conventions.md .context/system-prompt.txt
git commit -m "docs(routing): document Mode 2 routing pod + env vars"
```
---
## Final verification before merge
After all 14 tasks land, on the hyperguild worktree's branch:
- [ ] **Run the full check chain**
```bash
cd ~/Documents/local-dev/AI/hyperguild/.worktrees/mode-2-routing-pod
task check 2>&1 | tail -15
```
Expected: 0 issues across lint, test, vet, drift, govulncheck.
- [ ] **Run smoke test if Tailscale available**
```bash
task smoke:routing
```
Expected: PASS or SKIP (with a clear reason).
- [ ] **Verify the snapshot test still passes**
The skill packages can drift between when the snapshot was captured and merge time. Re-run:
```bash
go test ./internal/routing/... -run TestToolsListMatchesSupervisorSnapshot -v
```
If it fails because of an intentional schema change in the merge window, re-capture the snapshot per Task 7's Step 1 and commit the update with a clear message.
- [ ] **Push the hyperguild branch and merge**
```bash
git push -u origin feat/mode-2-routing-pod
```
Open a PR (or merge to main if the workflow allows direct push). Once merged, gitea CI builds the routing image and CD pushes the image-tag bump to the infra repo.
- [ ] **Verify Flux applies the new image and the pod becomes Ready**
```bash
NEW_SHA=$(git -C ~/Documents/local-dev/AI/hyperguild rev-parse main)
echo "Watching for image tag $NEW_SHA on routing deployment..."
until kubectl -n routing get deployment routing -o jsonpath='{.spec.template.spec.containers[0].image}' 2>/dev/null | grep -qE "${NEW_SHA:0:7}"; do
sleep 5
done
kubectl -n routing rollout status deployment/routing --timeout=120s
```
Expected: deployment becomes `1/1 Ready` with the new image.
If the pod stays `Pending` or `ImagePullBackOff` past 2 minutes, check:
```bash
kubectl -n routing describe pod -l app=routing | tail -30
kubectl -n routing logs -l app=routing --tail=50
```
- [ ] **Final live verification**
```bash
# tools/list should return 4 tools
curl -sS -X POST http://koala:30310/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
| jq '.result.tools | length'
# expected: 4
# auth check (only meaningful if ROUTING_MCP_TOKEN is set on the pod)
curl -isS -X POST http://koala:30310/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' | head -1
# expected: 401 if token set, 200 otherwise
```
- [ ] **Restart pod the Flux-friendly way if needed**
For any post-merge restart that doesn't ride a fresh image bump, use `kubectl delete pod` (not `kubectl rollout restart` — Flux strips the annotation):
```bash
kubectl -n routing delete pod -l app=routing
```
The existing ReplicaSet recreates the pod, picking up any Secret data changes on startup.