mathias/hyperguild

Fork 0

Files

Mathias Bergqvist 76f195de2a

CI / Lint / Test / Vet (push) Successful in 1m8s

Details

CI / Mirror to GitHub (push) Successful in 4s

Details

docs: model orchestration design spec for Phase 3

2026-04-20 07:45:32 +02:00

12 KiB

Raw Blame History

Model Orchestration Design

Date: 2026-04-20
Status: Approved for implementation

Problem statement

The hyperguild supervisor currently spawns a claude --print subprocess for every skill call. The model routing config (models.yaml) exists but is dead weight — the model name is injected as text into the task prompt and ignored. Every skill call costs Claude tokens regardless of task complexity or data sensitivity.

Goal

Route skill work to the most appropriate model — weighing cost, latency, and quality — with Claude acting as the real supervisor: verifying outputs and deciding when to escalate. Local models on owned hardware handle the common case; Claude escalates through a chain to frontier models only when local quality is insufficient.

Success criteria

Each skill dispatches generation to its configured local model via LiteLLM by default
Claude verifies every local output and either accepts or escalates
Escalation walks a per-skill chain (local small → local large → Sonnet → Opus) with one attempt per tier
Every attempt (model, tier, duration, warm state, verdict) is logged in the session JSONL
Cloud tiers (Sonnet/Opus) self-certify — no separate verifier call
Zero changes to skill handlers — they call ExecutorFn exactly as today
LiteLTMBaseURL already in config; no new env vars required beyond LLAMA_SWAP_URL

Constraints

One attempt per tier before escalating (no retry within a tier)
Anthropic T&C: Claude is called normally via Anthropic API; local models are called directly via LiteLLM HTTP — no API redirection
models.yaml remains the single routing config file

Out of scope

Auto-rerouting based on real-time warm state (logged, not acted on — Phase 4)
Multi-tenant / public service exposure
RAG/CAG model boosting
Managed Agent cloud delegation (chain stub only in Phase 3)

Architecture

MCP tool call (Claude Code)
    ↓
Skill handler — calls ExecutorFn (unchanged)
    ↓
Orchestrator.Run (implements ExecutorFn)
    ├─ Resolve chain from models.yaml
    ├─ For each model in chain:
    │   ├─ [ollama/*] → LiteLLM executor → generate
    │   │       ↓
    │   │   Claude verifier (task + output + discipline)
    │   │       ├─ accept  → return Result (log attempt)
    │   │       └─ escalate → next tier (log attempt)
    │   │
    │   └─ [claude-*] → Claude executor (current) → generate + self-certify
    │           └─ return Result (log attempt)
    │
    └─ All tiers exhausted → return best attempt with escalation note

Claude is always the verifier for local tiers. At cloud tiers, Claude generates and self-certifies — the verifier call is skipped.

Components

1. `internal/exec/litellm.go` — LiteLLM executor

Calls POST /v1/chat/completions on the configured LiteLLM server. Implements the same ExecutorFn signature as the existing claude executor.

type LiteLLMExecutor struct {
    BaseURL    string
    APIKey     string
    HTTPClient *http.Client
    Timeout    time.Duration
}

func NewLiteLLM(baseURL, apiKey string, timeout time.Duration) *LiteLLMExecutor

func (e *LiteLLMExecutor) Run(ctx context.Context, req Request) (Result, error)

Request mapping:

req.SkillPrompt → system message
req.TaskPrompt → user message
req.Model → model field in the chat completions request

Response handling: local models are prompted (via the discipline file output contract) to return a JSON object matching the Result schema. The executor attempts json.Unmarshal into Result directly — no envelope unwrapping needed (unlike the --output-format json claude envelope). If unmarshalling fails, the executor returns an error that the orchestrator treats as an automatic escalation trigger.

2. `internal/exec/verifier.go` — Claude verifier

A focused Claude call that judges local model output. Uses the existing Executor (claude subprocess) internally.

type Verdict struct {
    Accept   bool   `json:"accept"`
    Feedback string `json:"feedback"` // reason if not accepting; empty if accept
}

type Verifier struct {
    executor *Executor // the existing claude executor
}

func NewVerifier(executor *Executor) *Verifier

func (v *Verifier) Verify(ctx context.Context, skillPrompt, taskPrompt string, output Result) (Verdict, error)

The verifier prompt gives Claude:

The skill discipline file (so it knows the iron laws and output contract)
The original task prompt (informed verification — Claude sees what was asked)
The generated output
A short instruction: "Does this output satisfy the discipline's iron laws and output contract? Reply with JSON: {\"accept\": true|false, \"feedback\": \"...\"}"

The verifier uses a lightweight JSON schema for its own output (a Verdict schema), keeping the call fast.

3. `internal/exec/orchestrator.go` — chain walker

Implements ExecutorFn. Walks the escalation chain, delegating generation and verification per tier.

type Chain []ChainEntry

type ChainEntry struct {
    Model    string // e.g. "ollama/phi4", "claude-sonnet-4-5"
    Tier     string // "local" | "subagent" | "managed"
    IsCloud  bool   // true for claude-* models; skips verifier
}

type Orchestrator struct {
    chain    Chain
    litellm  *LiteLLMExecutor
    claude   *Executor
    verifier *Verifier
    llamaSwapURL string // for warm-state probe
}

func NewOrchestrator(chain Chain, litellm *LiteLLMExecutor, claude *Executor, verifier *Verifier, llamaSwapURL string) *Orchestrator

func (o *Orchestrator) Run(ctx context.Context, req Request) (Result, error)

Algorithm:

for each entry in chain:
    warm = probe llama-swap (if local tier)
    start = now()
    if entry.IsCloud:
        result, err = claude.Run(ctx, req with entry.Model)
        log attempt(model, tier, duration, warm, verified=true)
        if err == nil: return result
    else:
        result, err = litellm.Run(ctx, req with entry.Model)
        duration = now() - start
        if err != nil:
            log attempt(model, tier, duration, warm, verified=false)
            continue  // automatic escalation on parse/network error
        verdict = verifier.Verify(ctx, req.SkillPrompt, req.TaskPrompt, result)
        log attempt(model, tier, duration, warm, verified=verdict.Accept)
        if verdict.Accept: return result
        // inject verifier feedback into next tier's task prompt
        req.TaskPrompt = req.TaskPrompt + "\n\nPrior attempt feedback: " + verdict.Feedback

return error("all tiers exhausted")

4. `internal/config/models.go` — chain parser

Replaces the current single-model resolution with chain parsing.

Updated models.yaml format:

verifier: claude-sonnet-4-6   # fixed verifier for all local tiers

llama_swap_url: http://koala:8080   # for warm-state probing

default_chain:
  - ollama/qwen3-coder-30b-tuned
  - claude-sonnet-4-5

skills:
  tdd:
    chain:
      - ollama/qwen3-coder-30b-tuned
      - claude-sonnet-4-5
  review:
    chain:
      - ollama/devstral-tuned
      - ollama/gemma4
      - claude-sonnet-4-5
  debug:
    chain:
      - ollama/deepseek-r1-tuned
      - claude-sonnet-4-5
  spec:
    chain:
      - ollama/phi4
      - ollama/gemma4
      - claude-sonnet-4-5
      - claude-opus-4-6
  retrospective:
    chain:
      - ollama/qwen3-coder-30b-tuned
      - claude-sonnet-4-5
  trainer:
    chain:
      - ollama/qwen3-coder-30b-tuned
      - claude-sonnet-4-5

The parser exposes:

func (m *Models) ChainFor(skill string) Chain
func (m *Models) Verifier() string
func (m *Models) LlamaSwapURL() string

Caller override (model param in MCP tool call) pins the chain to a single entry — one model, no escalation. This preserves the existing override behaviour for power users.

5. `internal/session/session.go` — updated `Attempt` struct

type Attempt struct {
    Attempt       int    `json:"attempt"`
    Model         string `json:"model"`
    Tier          string `json:"tier"`          // local | subagent | managed
    DurationMs    int64  `json:"duration_ms"`
    WarmStart     bool   `json:"warm_start"`    // model was already loaded in llama-swap
    Verified      bool   `json:"verified"`
    Verdict       string `json:"verdict,omitempty"` // accept | escalate | error
    Feedback      string `json:"feedback,omitempty"` // verifier feedback on escalation
    OutputSummary string `json:"output_summary,omitempty"`
    RunnerOutput  string `json:"runner_output,omitempty"`
}

6. `cmd/supervisor/main.go` — one wiring change

// Before:
reg.Register(review.New(review.Config{ExecutorFn: executor.Run, ...}))

// After:
chain := models.ChainFor("review")
orch := exec.NewOrchestrator(chain, litellmExec, claudeExec, verifier, models.LlamaSwapURL())
reg.Register(review.New(review.Config{ExecutorFn: orch.Run, ...}))

One orchestrator per skill, sharing the same litellmExec, claudeExec, and verifier instances.

Data flow example: `review` skill call

Claude Code calls review tool with files: ["internal/foo.go"]
Skill handler builds task prompt, calls orch.Run
Orchestrator resolves chain: [devstral, gemma4, sonnet]
Probes llama-swap: devstral is warm
LiteLLM calls devstral → returns JSON result
Verifier asks Claude: "does this review satisfy the iron laws?"
Claude: {"accept": false, "feedback": "missing line references for all findings"}
Orchestrator logs attempt #1 (devstral, local, 4200ms, warm, escalate)
Injects feedback into task prompt, calls gemma4
Verifier: {"accept": true}
Orchestrator logs attempt #2 (gemma4, local, 6100ms, cold, accept)
Returns result to skill handler → MCP response

Session JSONL records both attempts. You can see: devstral was warm but produced weak output; gemma4 was cold but passed.

Observability

Session JSONL is the primary store. Each Entry.Attempts slice records the full escalation trail. To analyse across sessions:

# Which models are escalating most?
jq -r '.attempts[] | select(.verdict == "escalate") | .model' brain/sessions/*.jsonl | sort | uniq -c

# Average latency per model
jq -r '.attempts[] | [.model, .duration_ms] | @tsv' brain/sessions/*.jsonl | awk '{sum[$1]+=$2; n[$1]++} END {for (m in sum) print m, sum[m]/n[m]}'

# Cold start frequency
jq -r '.attempts[] | select(.warm_start == false) | .model' brain/sessions/*.jsonl | sort | uniq -c

No new metrics infrastructure needed for Phase 3. Phase 4 can build a dashboard on top of this data.

Error handling

Scenario	Behaviour
LiteLLM unreachable	Log attempt as error, escalate immediately
Local model returns unparseable JSON	Log attempt as error, escalate
Verifier call fails	Log, treat as escalate (safe default)
All tiers exhausted	Return error to skill handler; skill returns MCP error to caller
Caller passes `model` override	Single-entry chain, no escalation, no verifier call

Testing approach

TestLiteLLMExecutor: mock HTTP server returning valid/invalid JSON; verify parse logic and error escalation
TestVerifier: fake claude executor returning accept/escalate verdicts; verify prompt construction
TestOrchestrator: table-driven — chains of 1/2/3 tiers, various accept/escalate/error combinations; verify attempt log contents and final result
TestModelsChainFor: YAML parsing for all skill overrides and default_chain fallback
Integration smoke test: start real LiteLLM (or mock), call review tool via MCP, verify attempt log written

Risks

Risk	Mitigation
Local models ignore output contract → bad JSON	Discipline files already specify JSON output contract; parse failure auto-escalates
Verifier Claude call adds latency to every local attempt	Verifier prompt is small and fast; acceptable tradeoff for quality gate
llama-swap warm probe adds overhead	Probe is a single lightweight HTTP GET; timeout at 200ms, treat failure as `warm_start: false`
Chain exhaustion leaves caller with no result	Return structured error via MCP; caller can retry with explicit `model` override

12 KiB Raw Blame History