Mathias 2b7bbe38c7
All checks were successful
CI / Lint / Test / Vet (push) Successful in 11s
CI / Mirror to GitHub (push) Successful in 4s
docs(eval): record M4 + M4b scorer runs — phase 2 gate cleared (infra#72)
Tier-weighted retrieval against the qa-2026-05.md 20-question set:

| run                            | top-1 | top-3 |
|--------------------------------|-------|-------|
| baseline (pre-phase-1)         | 20%   | 65%   |
| post phase 1 (parser+content)  | 20%   | 70%   |
| post M4 (tier weighting)       | 30%   | 75%   |
| post M4b (entities → K tier)   | 35%   | 80%   |

Net Phase 2 lift: +15pt top-1, +15pt top-3 — comfortably above the
≥10pt close-gate set in infra#72.

Three remaining misses are content-keyword issues, not structure
issues (the questions don't share enough lexical surface with the
target entries to surface via BM25 alone). Vector search would
help here but the iguana embedder is off-mesh (see infra#64).
2026-05-25 18:51:29 +02:00

hyperguild

An MCP server that acts as a disciplined AI supervisor for Claude Code sessions. Instead of letting Claude Code do whatever it wants, hyperguild enforces structured workflows (TDD red/green/refactor), logs every session, and accumulates learnings into a searchable brain.

How it works

Your Claude Code session (in any project)
    │
    │  MCP over HTTP (Tailscale)
    ├──▶ supervisor  :3200 (NodePort 30320 on koala) — skill workers: tdd, debug, spec, …
    ├──▶ routing     :3210 (NodePort 30310 on koala) — Mode 2 only: review, debug, retrospective, trainer
    └──▶ brain       :3300 (NodePort 30330 on koala) — brain_query, brain_write, brain_ingest, session_log
                       │
                       └─ also serves the legacy REST endpoints (/query, /write, /ingest, …)
    │
    ▼
brain/
├── sessions/       — JSONL log, one file per session_id
├── wiki/           — searchable knowledge (full-text)
│   ├── concepts/
│   ├── entities/
│   └── sources/
├── raw/            — retrospective output, staged for review
└── training-data/  — SFT/DPO/RL data (Phase 2)

Phase 1 tools (available now)

Tool What it does
tdd_red Writes a failing test for a spec, verifies it fails
tdd_green Writes the minimal implementation to make tests pass
tdd_refactor Cleans up implementation while keeping tests green
session_log Appends a structured entry to the session JSONL log
retrospective Reads the session log, identifies novel learnings, writes to brain/raw/
brain_query Full-text search over brain/wiki/
brain_write Writes a note to brain/raw/ (with optional YAML frontmatter)
tier Returns the current connectivity tier (1=cloud, 2=LAN, 3=offline)

Start the servers

# Requires goreman: go install github.com/mattn/goreman@latest
task start    # starts ingestion (:3300) + supervisor (:3200) via goreman
task stop     # kills both by port

Connect a project

Create .mcp.json in your project root:

{
  "mcpServers": {
    "supervisor": {
      "type": "http",
      "url": "http://koala:30320/mcp"
    },
    "brain": {
      "type": "http",
      "url": "http://koala:30330/mcp"
    }
  }
}

Two MCP servers are exposed today, both reachable over Tailscale:

  • supervisor at koala:30320 — skill workers (tdd_red/green/refactor, review, debug, spec, retrospective, trainer, tier).
  • brain at koala:30330 — knowledge access (brain_query, brain_write, brain_ingest, brain_ingest_raw) and session_log. Hosted by the ingestion service directly, no separate pod.

No local binary or stdio shim is required — Claude Code talks to both via HTTP.

Open Claude Code in your project — run /mcp to confirm both servers are listed.

A typical TDD session

1. Call tdd_red    → spec in, failing test file out
2. Call tdd_green  → test path in, implementation out
3. Call tdd_refactor → impl + test in, cleaned code out
4. Call session_log  → log each phase result
5. Call retrospective → extracts learnings → brain/raw/
6. Review brain/raw/, move worthy notes to brain/wiki/concepts/
7. Future sessions: call brain_query to retrieve relevant context

Tier detection

The supervisor probes connectivity at call time:

Tier Label Condition
1 full-online Can reach api.anthropic.com
2 lan-only Can reach LiteLLM but not Anthropic
3 airplane No external connectivity

Key env vars

Variable Default Purpose
INGEST_BRAIN_DIR ../brain Brain directory for ingestion server
INGEST_PORT 3300 Ingestion server port
SUPERVISOR_CONFIG_DIR ./config/supervisor Skill discipline files
SUPERVISOR_SESSIONS_DIR ./brain/sessions JSONL session logs
INGEST_BASE_URL http://localhost:3300 Supervisor → ingestion
LITELLM_BASE_URL LiteLLM proxy for Tier 2 model routing
SUPERVISOR_MCP_TOKEN Optional bearer token for the supervisor MCP HTTP endpoint; when empty, no auth is enforced
ROUTING_PORT 3210 Routing pod's listen port
ROUTING_MCP_TOKEN Optional bearer token for the routing MCP HTTP endpoint
BRAIN_URL http://ingestion.supervisor:3300 Routing pod → brain (in-cluster)
HYPERGUILD_FAST_MODEL koala/qwen35-9b-fast Fast model for high-pass-rate skill calls
HYPERGUILD_THINKING_MODEL iguana/gemma4-26b Thinking model for low-pass-rate skill calls
HYPERGUILD_ROUTE_LOCAL_FLOOR 0.90 At/above pass rate, route to fast model
HYPERGUILD_ROUTE_LOCAL_CEIL 0.70 Below pass rate, route to thinking model. Between CEIL and FLOOR is the sample band.
HYPERGUILD_PASS_RATE_TTL_SECONDS 60 Per-skill pass-rate cache TTL

Operator note: LiteLLM at LITELLM_BASE_URL must register both HYPERGUILD_FAST_MODEL and HYPERGUILD_THINKING_MODEL for routing to do useful work. If a model is missing, LiteLLM returns 4xx, the routing pod's fast route fails, the fail-open retry on the thinking model likely also fails (since both are missing), and the only signal is final_status: "fail" on _routing entries in the brain.

Phase 2 (planned)

  • review skill — structured code review with iron law enforcement
  • debug skill — hypothesis-driven debugging sessions
  • spec skill — generates specs from conversations
  • trainer — extracts SFT/DPO pairs from session logs for fine-tuning
Description
MCP supervisor for disciplined Claude Code sessions
Readme 3 MiB
Languages
Go 97.3%
Shell 1.8%
Python 0.6%
Dockerfile 0.2%