# Model Orchestration Design **Date:** 2026-04-20 **Status:** Approved for implementation ## Problem statement The hyperguild supervisor currently spawns a `claude --print` subprocess for every skill call. The model routing config (`models.yaml`) exists but is dead weight — the model name is injected as text into the task prompt and ignored. Every skill call costs Claude tokens regardless of task complexity or data sensitivity. ## Goal Route skill work to the most appropriate model — weighing cost, latency, and quality — with Claude acting as the real supervisor: verifying outputs and deciding when to escalate. Local models on owned hardware handle the common case; Claude escalates through a chain to frontier models only when local quality is insufficient. ## Success criteria - [ ] Each skill dispatches generation to its configured local model via LiteLLM by default - [ ] Claude verifies every local output and either accepts or escalates - [ ] Escalation walks a per-skill chain (local small → local large → Sonnet → Opus) with one attempt per tier - [ ] Every attempt (model, tier, duration, warm state, verdict) is logged in the session JSONL - [ ] Cloud tiers (Sonnet/Opus) self-certify — no separate verifier call - [ ] Zero changes to skill handlers — they call `ExecutorFn` exactly as today - [ ] `LiteLTMBaseURL` already in config; no new env vars required beyond `LLAMA_SWAP_URL` ## Constraints - One attempt per tier before escalating (no retry within a tier) - Anthropic T&C: Claude is called normally via Anthropic API; local models are called directly via LiteLLM HTTP — no API redirection - `models.yaml` remains the single routing config file ## Out of scope - Auto-rerouting based on real-time warm state (logged, not acted on — Phase 4) - Multi-tenant / public service exposure - RAG/CAG model boosting - Managed Agent cloud delegation (chain stub only in Phase 3) --- ## Architecture ``` MCP tool call (Claude Code) ↓ Skill handler — calls ExecutorFn (unchanged) ↓ Orchestrator.Run (implements ExecutorFn) ├─ Resolve chain from models.yaml ├─ For each model in chain: │ ├─ [ollama/*] → LiteLLM executor → generate │ │ ↓ │ │ Claude verifier (task + output + discipline) │ │ ├─ accept → return Result (log attempt) │ │ └─ escalate → next tier (log attempt) │ │ │ └─ [claude-*] → Claude executor (current) → generate + self-certify │ └─ return Result (log attempt) │ └─ All tiers exhausted → return best attempt with escalation note ``` Claude is always the verifier for local tiers. At cloud tiers, Claude generates and self-certifies — the verifier call is skipped. --- ## Components ### 1. `internal/exec/litellm.go` — LiteLLM executor Calls `POST /v1/chat/completions` on the configured LiteLLM server. Implements the same `ExecutorFn` signature as the existing claude executor. ```go type LiteLLMExecutor struct { BaseURL string APIKey string HTTPClient *http.Client Timeout time.Duration } func NewLiteLLM(baseURL, apiKey string, timeout time.Duration) *LiteLLMExecutor func (e *LiteLLMExecutor) Run(ctx context.Context, req Request) (Result, error) ``` Request mapping: - `req.SkillPrompt` → system message - `req.TaskPrompt` → user message - `req.Model` → `model` field in the chat completions request Response handling: local models are prompted (via the discipline file output contract) to return a JSON object matching the `Result` schema. The executor attempts `json.Unmarshal` into `Result` directly — no envelope unwrapping needed (unlike the `--output-format json` claude envelope). If unmarshalling fails, the executor returns an error that the orchestrator treats as an automatic escalation trigger. ### 2. `internal/exec/verifier.go` — Claude verifier A focused Claude call that judges local model output. Uses the existing `Executor` (claude subprocess) internally. ```go type Verdict struct { Accept bool `json:"accept"` Feedback string `json:"feedback"` // reason if not accepting; empty if accept } type Verifier struct { executor *Executor // the existing claude executor } func NewVerifier(executor *Executor) *Verifier func (v *Verifier) Verify(ctx context.Context, skillPrompt, taskPrompt string, output Result) (Verdict, error) ``` The verifier prompt gives Claude: 1. The skill discipline file (so it knows the iron laws and output contract) 2. The original task prompt (informed verification — Claude sees what was asked) 3. The generated output 4. A short instruction: "Does this output satisfy the discipline's iron laws and output contract? Reply with JSON: `{\"accept\": true|false, \"feedback\": \"...\"}`" The verifier uses a lightweight JSON schema for its own output (a `Verdict` schema), keeping the call fast. ### 3. `internal/exec/orchestrator.go` — chain walker Implements `ExecutorFn`. Walks the escalation chain, delegating generation and verification per tier. ```go type Chain []ChainEntry type ChainEntry struct { Model string // e.g. "ollama/phi4", "claude-sonnet-4-5" Tier string // "local" | "subagent" | "managed" IsCloud bool // true for claude-* models; skips verifier } type Orchestrator struct { chain Chain litellm *LiteLLMExecutor claude *Executor verifier *Verifier llamaSwapURL string // for warm-state probe } func NewOrchestrator(chain Chain, litellm *LiteLLMExecutor, claude *Executor, verifier *Verifier, llamaSwapURL string) *Orchestrator func (o *Orchestrator) Run(ctx context.Context, req Request) (Result, error) ``` Algorithm: ``` for each entry in chain: warm = probe llama-swap (if local tier) start = now() if entry.IsCloud: result, err = claude.Run(ctx, req with entry.Model) log attempt(model, tier, duration, warm, verified=true) if err == nil: return result else: result, err = litellm.Run(ctx, req with entry.Model) duration = now() - start if err != nil: log attempt(model, tier, duration, warm, verified=false) continue // automatic escalation on parse/network error verdict = verifier.Verify(ctx, req.SkillPrompt, req.TaskPrompt, result) log attempt(model, tier, duration, warm, verified=verdict.Accept) if verdict.Accept: return result // inject verifier feedback into next tier's task prompt req.TaskPrompt = req.TaskPrompt + "\n\nPrior attempt feedback: " + verdict.Feedback return error("all tiers exhausted") ``` ### 4. `internal/config/models.go` — chain parser Replaces the current single-model resolution with chain parsing. Updated `models.yaml` format: ```yaml verifier: claude-sonnet-4-6 # fixed verifier for all local tiers llama_swap_url: http://koala:8080 # for warm-state probing default_chain: - ollama/qwen3-coder-30b-tuned - claude-sonnet-4-5 skills: tdd: chain: - ollama/qwen3-coder-30b-tuned - claude-sonnet-4-5 review: chain: - ollama/devstral-tuned - ollama/gemma4 - claude-sonnet-4-5 debug: chain: - ollama/deepseek-r1-tuned - claude-sonnet-4-5 spec: chain: - ollama/phi4 - ollama/gemma4 - claude-sonnet-4-5 - claude-opus-4-6 retrospective: chain: - ollama/qwen3-coder-30b-tuned - claude-sonnet-4-5 trainer: chain: - ollama/qwen3-coder-30b-tuned - claude-sonnet-4-5 ``` The parser exposes: ```go func (m *Models) ChainFor(skill string) Chain func (m *Models) Verifier() string func (m *Models) LlamaSwapURL() string ``` Caller override (`model` param in MCP tool call) pins the chain to a single entry — one model, no escalation. This preserves the existing override behaviour for power users. ### 5. `internal/session/session.go` — updated `Attempt` struct ```go type Attempt struct { Attempt int `json:"attempt"` Model string `json:"model"` Tier string `json:"tier"` // local | subagent | managed DurationMs int64 `json:"duration_ms"` WarmStart bool `json:"warm_start"` // model was already loaded in llama-swap Verified bool `json:"verified"` Verdict string `json:"verdict,omitempty"` // accept | escalate | error Feedback string `json:"feedback,omitempty"` // verifier feedback on escalation OutputSummary string `json:"output_summary,omitempty"` RunnerOutput string `json:"runner_output,omitempty"` } ``` ### 6. `cmd/supervisor/main.go` — one wiring change ```go // Before: reg.Register(review.New(review.Config{ExecutorFn: executor.Run, ...})) // After: chain := models.ChainFor("review") orch := exec.NewOrchestrator(chain, litellmExec, claudeExec, verifier, models.LlamaSwapURL()) reg.Register(review.New(review.Config{ExecutorFn: orch.Run, ...})) ``` One orchestrator per skill, sharing the same `litellmExec`, `claudeExec`, and `verifier` instances. --- ## Data flow example: `review` skill call 1. Claude Code calls `review` tool with `files: ["internal/foo.go"]` 2. Skill handler builds task prompt, calls `orch.Run` 3. Orchestrator resolves chain: `[devstral, gemma4, sonnet]` 4. Probes llama-swap: devstral is warm 5. LiteLLM calls devstral → returns JSON result 6. Verifier asks Claude: "does this review satisfy the iron laws?" 7. Claude: `{"accept": false, "feedback": "missing line references for all findings"}` 8. Orchestrator logs attempt #1 (devstral, local, 4200ms, warm, escalate) 9. Injects feedback into task prompt, calls gemma4 10. Verifier: `{"accept": true}` 11. Orchestrator logs attempt #2 (gemma4, local, 6100ms, cold, accept) 12. Returns result to skill handler → MCP response Session JSONL records both attempts. You can see: devstral was warm but produced weak output; gemma4 was cold but passed. --- ## Observability Session JSONL is the primary store. Each `Entry.Attempts` slice records the full escalation trail. To analyse across sessions: ```bash # Which models are escalating most? jq -r '.attempts[] | select(.verdict == "escalate") | .model' brain/sessions/*.jsonl | sort | uniq -c # Average latency per model jq -r '.attempts[] | [.model, .duration_ms] | @tsv' brain/sessions/*.jsonl | awk '{sum[$1]+=$2; n[$1]++} END {for (m in sum) print m, sum[m]/n[m]}' # Cold start frequency jq -r '.attempts[] | select(.warm_start == false) | .model' brain/sessions/*.jsonl | sort | uniq -c ``` No new metrics infrastructure needed for Phase 3. Phase 4 can build a dashboard on top of this data. --- ## Error handling | Scenario | Behaviour | |----------|-----------| | LiteLLM unreachable | Log attempt as error, escalate immediately | | Local model returns unparseable JSON | Log attempt as error, escalate | | Verifier call fails | Log, treat as escalate (safe default) | | All tiers exhausted | Return error to skill handler; skill returns MCP error to caller | | Caller passes `model` override | Single-entry chain, no escalation, no verifier call | --- ## Testing approach - `TestLiteLLMExecutor`: mock HTTP server returning valid/invalid JSON; verify parse logic and error escalation - `TestVerifier`: fake claude executor returning accept/escalate verdicts; verify prompt construction - `TestOrchestrator`: table-driven — chains of 1/2/3 tiers, various accept/escalate/error combinations; verify attempt log contents and final result - `TestModelsChainFor`: YAML parsing for all skill overrides and default_chain fallback - Integration smoke test: start real LiteLLM (or mock), call `review` tool via MCP, verify attempt log written --- ## Risks | Risk | Mitigation | |------|------------| | Local models ignore output contract → bad JSON | Discipline files already specify JSON output contract; parse failure auto-escalates | | Verifier Claude call adds latency to every local attempt | Verifier prompt is small and fast; acceptable tradeoff for quality gate | | llama-swap warm probe adds overhead | Probe is a single lightweight HTTP GET; timeout at 200ms, treat failure as `warm_start: false` | | Chain exhaustion leaves caller with no result | Return structured error via MCP; caller can retry with explicit `model` override |