12 KiB
Model Orchestration Design
Date: 2026-04-20
Status: Approved for implementation
Problem statement
The hyperguild supervisor currently spawns a claude --print subprocess for every skill call. The model routing config (models.yaml) exists but is dead weight — the model name is injected as text into the task prompt and ignored. Every skill call costs Claude tokens regardless of task complexity or data sensitivity.
Goal
Route skill work to the most appropriate model — weighing cost, latency, and quality — with Claude acting as the real supervisor: verifying outputs and deciding when to escalate. Local models on owned hardware handle the common case; Claude escalates through a chain to frontier models only when local quality is insufficient.
Success criteria
- Each skill dispatches generation to its configured local model via LiteLLM by default
- Claude verifies every local output and either accepts or escalates
- Escalation walks a per-skill chain (local small → local large → Sonnet → Opus) with one attempt per tier
- Every attempt (model, tier, duration, warm state, verdict) is logged in the session JSONL
- Cloud tiers (Sonnet/Opus) self-certify — no separate verifier call
- Zero changes to skill handlers — they call
ExecutorFnexactly as today LiteLTMBaseURLalready in config; no new env vars required beyondLLAMA_SWAP_URL
Constraints
- One attempt per tier before escalating (no retry within a tier)
- Anthropic T&C: Claude is called normally via Anthropic API; local models are called directly via LiteLLM HTTP — no API redirection
models.yamlremains the single routing config file
Out of scope
- Auto-rerouting based on real-time warm state (logged, not acted on — Phase 4)
- Multi-tenant / public service exposure
- RAG/CAG model boosting
- Managed Agent cloud delegation (chain stub only in Phase 3)
Architecture
MCP tool call (Claude Code)
↓
Skill handler — calls ExecutorFn (unchanged)
↓
Orchestrator.Run (implements ExecutorFn)
├─ Resolve chain from models.yaml
├─ For each model in chain:
│ ├─ [ollama/*] → LiteLLM executor → generate
│ │ ↓
│ │ Claude verifier (task + output + discipline)
│ │ ├─ accept → return Result (log attempt)
│ │ └─ escalate → next tier (log attempt)
│ │
│ └─ [claude-*] → Claude executor (current) → generate + self-certify
│ └─ return Result (log attempt)
│
└─ All tiers exhausted → return best attempt with escalation note
Claude is always the verifier for local tiers. At cloud tiers, Claude generates and self-certifies — the verifier call is skipped.
Components
1. internal/exec/litellm.go — LiteLLM executor
Calls POST /v1/chat/completions on the configured LiteLLM server. Implements the same ExecutorFn signature as the existing claude executor.
type LiteLLMExecutor struct {
BaseURL string
APIKey string
HTTPClient *http.Client
Timeout time.Duration
}
func NewLiteLLM(baseURL, apiKey string, timeout time.Duration) *LiteLLMExecutor
func (e *LiteLLMExecutor) Run(ctx context.Context, req Request) (Result, error)
Request mapping:
req.SkillPrompt→ system messagereq.TaskPrompt→ user messagereq.Model→modelfield in the chat completions request
Response handling: local models are prompted (via the discipline file output contract) to return a JSON object matching the Result schema. The executor attempts json.Unmarshal into Result directly — no envelope unwrapping needed (unlike the --output-format json claude envelope). If unmarshalling fails, the executor returns an error that the orchestrator treats as an automatic escalation trigger.
2. internal/exec/verifier.go — Claude verifier
A focused Claude call that judges local model output. Uses the existing Executor (claude subprocess) internally.
type Verdict struct {
Accept bool `json:"accept"`
Feedback string `json:"feedback"` // reason if not accepting; empty if accept
}
type Verifier struct {
executor *Executor // the existing claude executor
}
func NewVerifier(executor *Executor) *Verifier
func (v *Verifier) Verify(ctx context.Context, skillPrompt, taskPrompt string, output Result) (Verdict, error)
The verifier prompt gives Claude:
- The skill discipline file (so it knows the iron laws and output contract)
- The original task prompt (informed verification — Claude sees what was asked)
- The generated output
- A short instruction: "Does this output satisfy the discipline's iron laws and output contract? Reply with JSON:
{\"accept\": true|false, \"feedback\": \"...\"}"
The verifier uses a lightweight JSON schema for its own output (a Verdict schema), keeping the call fast.
3. internal/exec/orchestrator.go — chain walker
Implements ExecutorFn. Walks the escalation chain, delegating generation and verification per tier.
type Chain []ChainEntry
type ChainEntry struct {
Model string // e.g. "ollama/phi4", "claude-sonnet-4-5"
Tier string // "local" | "subagent" | "managed"
IsCloud bool // true for claude-* models; skips verifier
}
type Orchestrator struct {
chain Chain
litellm *LiteLLMExecutor
claude *Executor
verifier *Verifier
llamaSwapURL string // for warm-state probe
}
func NewOrchestrator(chain Chain, litellm *LiteLLMExecutor, claude *Executor, verifier *Verifier, llamaSwapURL string) *Orchestrator
func (o *Orchestrator) Run(ctx context.Context, req Request) (Result, error)
Algorithm:
for each entry in chain:
warm = probe llama-swap (if local tier)
start = now()
if entry.IsCloud:
result, err = claude.Run(ctx, req with entry.Model)
log attempt(model, tier, duration, warm, verified=true)
if err == nil: return result
else:
result, err = litellm.Run(ctx, req with entry.Model)
duration = now() - start
if err != nil:
log attempt(model, tier, duration, warm, verified=false)
continue // automatic escalation on parse/network error
verdict = verifier.Verify(ctx, req.SkillPrompt, req.TaskPrompt, result)
log attempt(model, tier, duration, warm, verified=verdict.Accept)
if verdict.Accept: return result
// inject verifier feedback into next tier's task prompt
req.TaskPrompt = req.TaskPrompt + "\n\nPrior attempt feedback: " + verdict.Feedback
return error("all tiers exhausted")
4. internal/config/models.go — chain parser
Replaces the current single-model resolution with chain parsing.
Updated models.yaml format:
verifier: claude-sonnet-4-6 # fixed verifier for all local tiers
llama_swap_url: http://koala:8080 # for warm-state probing
default_chain:
- ollama/qwen3-coder-30b-tuned
- claude-sonnet-4-5
skills:
tdd:
chain:
- ollama/qwen3-coder-30b-tuned
- claude-sonnet-4-5
review:
chain:
- ollama/devstral-tuned
- ollama/gemma4
- claude-sonnet-4-5
debug:
chain:
- ollama/deepseek-r1-tuned
- claude-sonnet-4-5
spec:
chain:
- ollama/phi4
- ollama/gemma4
- claude-sonnet-4-5
- claude-opus-4-6
retrospective:
chain:
- ollama/qwen3-coder-30b-tuned
- claude-sonnet-4-5
trainer:
chain:
- ollama/qwen3-coder-30b-tuned
- claude-sonnet-4-5
The parser exposes:
func (m *Models) ChainFor(skill string) Chain
func (m *Models) Verifier() string
func (m *Models) LlamaSwapURL() string
Caller override (model param in MCP tool call) pins the chain to a single entry — one model, no escalation. This preserves the existing override behaviour for power users.
5. internal/session/session.go — updated Attempt struct
type Attempt struct {
Attempt int `json:"attempt"`
Model string `json:"model"`
Tier string `json:"tier"` // local | subagent | managed
DurationMs int64 `json:"duration_ms"`
WarmStart bool `json:"warm_start"` // model was already loaded in llama-swap
Verified bool `json:"verified"`
Verdict string `json:"verdict,omitempty"` // accept | escalate | error
Feedback string `json:"feedback,omitempty"` // verifier feedback on escalation
OutputSummary string `json:"output_summary,omitempty"`
RunnerOutput string `json:"runner_output,omitempty"`
}
6. cmd/supervisor/main.go — one wiring change
// Before:
reg.Register(review.New(review.Config{ExecutorFn: executor.Run, ...}))
// After:
chain := models.ChainFor("review")
orch := exec.NewOrchestrator(chain, litellmExec, claudeExec, verifier, models.LlamaSwapURL())
reg.Register(review.New(review.Config{ExecutorFn: orch.Run, ...}))
One orchestrator per skill, sharing the same litellmExec, claudeExec, and verifier instances.
Data flow example: review skill call
- Claude Code calls
reviewtool withfiles: ["internal/foo.go"] - Skill handler builds task prompt, calls
orch.Run - Orchestrator resolves chain:
[devstral, gemma4, sonnet] - Probes llama-swap: devstral is warm
- LiteLLM calls devstral → returns JSON result
- Verifier asks Claude: "does this review satisfy the iron laws?"
- Claude:
{"accept": false, "feedback": "missing line references for all findings"} - Orchestrator logs attempt #1 (devstral, local, 4200ms, warm, escalate)
- Injects feedback into task prompt, calls gemma4
- Verifier:
{"accept": true} - Orchestrator logs attempt #2 (gemma4, local, 6100ms, cold, accept)
- Returns result to skill handler → MCP response
Session JSONL records both attempts. You can see: devstral was warm but produced weak output; gemma4 was cold but passed.
Observability
Session JSONL is the primary store. Each Entry.Attempts slice records the full escalation trail. To analyse across sessions:
# Which models are escalating most?
jq -r '.attempts[] | select(.verdict == "escalate") | .model' brain/sessions/*.jsonl | sort | uniq -c
# Average latency per model
jq -r '.attempts[] | [.model, .duration_ms] | @tsv' brain/sessions/*.jsonl | awk '{sum[$1]+=$2; n[$1]++} END {for (m in sum) print m, sum[m]/n[m]}'
# Cold start frequency
jq -r '.attempts[] | select(.warm_start == false) | .model' brain/sessions/*.jsonl | sort | uniq -c
No new metrics infrastructure needed for Phase 3. Phase 4 can build a dashboard on top of this data.
Error handling
| Scenario | Behaviour |
|---|---|
| LiteLLM unreachable | Log attempt as error, escalate immediately |
| Local model returns unparseable JSON | Log attempt as error, escalate |
| Verifier call fails | Log, treat as escalate (safe default) |
| All tiers exhausted | Return error to skill handler; skill returns MCP error to caller |
Caller passes model override |
Single-entry chain, no escalation, no verifier call |
Testing approach
TestLiteLLMExecutor: mock HTTP server returning valid/invalid JSON; verify parse logic and error escalationTestVerifier: fake claude executor returning accept/escalate verdicts; verify prompt constructionTestOrchestrator: table-driven — chains of 1/2/3 tiers, various accept/escalate/error combinations; verify attempt log contents and final resultTestModelsChainFor: YAML parsing for all skill overrides and default_chain fallback- Integration smoke test: start real LiteLLM (or mock), call
reviewtool via MCP, verify attempt log written
Risks
| Risk | Mitigation |
|---|---|
| Local models ignore output contract → bad JSON | Discipline files already specify JSON output contract; parse failure auto-escalates |
| Verifier Claude call adds latency to every local attempt | Verifier prompt is small and fast; acceptable tradeoff for quality gate |
| llama-swap warm probe adds overhead | Probe is a single lightweight HTTP GET; timeout at 200ms, treat failure as warm_start: false |
| Chain exhaustion leaves caller with no result | Return structured error via MCP; caller can retry with explicit model override |