docs: model orchestration design spec for Phase 3
This commit is contained in:
322
docs/superpowers/specs/2026-04-20-model-orchestration-design.md
Normal file
322
docs/superpowers/specs/2026-04-20-model-orchestration-design.md
Normal file
@@ -0,0 +1,322 @@
|
||||
# Model Orchestration Design
|
||||
|
||||
**Date:** 2026-04-20
|
||||
**Status:** Approved for implementation
|
||||
|
||||
## Problem statement
|
||||
|
||||
The hyperguild supervisor currently spawns a `claude --print` subprocess for every skill call. The model routing config (`models.yaml`) exists but is dead weight — the model name is injected as text into the task prompt and ignored. Every skill call costs Claude tokens regardless of task complexity or data sensitivity.
|
||||
|
||||
## Goal
|
||||
|
||||
Route skill work to the most appropriate model — weighing cost, latency, and quality — with Claude acting as the real supervisor: verifying outputs and deciding when to escalate. Local models on owned hardware handle the common case; Claude escalates through a chain to frontier models only when local quality is insufficient.
|
||||
|
||||
## Success criteria
|
||||
|
||||
- [ ] Each skill dispatches generation to its configured local model via LiteLLM by default
|
||||
- [ ] Claude verifies every local output and either accepts or escalates
|
||||
- [ ] Escalation walks a per-skill chain (local small → local large → Sonnet → Opus) with one attempt per tier
|
||||
- [ ] Every attempt (model, tier, duration, warm state, verdict) is logged in the session JSONL
|
||||
- [ ] Cloud tiers (Sonnet/Opus) self-certify — no separate verifier call
|
||||
- [ ] Zero changes to skill handlers — they call `ExecutorFn` exactly as today
|
||||
- [ ] `LiteLTMBaseURL` already in config; no new env vars required beyond `LLAMA_SWAP_URL`
|
||||
|
||||
## Constraints
|
||||
|
||||
- One attempt per tier before escalating (no retry within a tier)
|
||||
- Anthropic T&C: Claude is called normally via Anthropic API; local models are called directly via LiteLLM HTTP — no API redirection
|
||||
- `models.yaml` remains the single routing config file
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Auto-rerouting based on real-time warm state (logged, not acted on — Phase 4)
|
||||
- Multi-tenant / public service exposure
|
||||
- RAG/CAG model boosting
|
||||
- Managed Agent cloud delegation (chain stub only in Phase 3)
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
MCP tool call (Claude Code)
|
||||
↓
|
||||
Skill handler — calls ExecutorFn (unchanged)
|
||||
↓
|
||||
Orchestrator.Run (implements ExecutorFn)
|
||||
├─ Resolve chain from models.yaml
|
||||
├─ For each model in chain:
|
||||
│ ├─ [ollama/*] → LiteLLM executor → generate
|
||||
│ │ ↓
|
||||
│ │ Claude verifier (task + output + discipline)
|
||||
│ │ ├─ accept → return Result (log attempt)
|
||||
│ │ └─ escalate → next tier (log attempt)
|
||||
│ │
|
||||
│ └─ [claude-*] → Claude executor (current) → generate + self-certify
|
||||
│ └─ return Result (log attempt)
|
||||
│
|
||||
└─ All tiers exhausted → return best attempt with escalation note
|
||||
```
|
||||
|
||||
Claude is always the verifier for local tiers. At cloud tiers, Claude generates and self-certifies — the verifier call is skipped.
|
||||
|
||||
---
|
||||
|
||||
## Components
|
||||
|
||||
### 1. `internal/exec/litellm.go` — LiteLLM executor
|
||||
|
||||
Calls `POST /v1/chat/completions` on the configured LiteLLM server. Implements the same `ExecutorFn` signature as the existing claude executor.
|
||||
|
||||
```go
|
||||
type LiteLLMExecutor struct {
|
||||
BaseURL string
|
||||
APIKey string
|
||||
HTTPClient *http.Client
|
||||
Timeout time.Duration
|
||||
}
|
||||
|
||||
func NewLiteLLM(baseURL, apiKey string, timeout time.Duration) *LiteLLMExecutor
|
||||
|
||||
func (e *LiteLLMExecutor) Run(ctx context.Context, req Request) (Result, error)
|
||||
```
|
||||
|
||||
Request mapping:
|
||||
- `req.SkillPrompt` → system message
|
||||
- `req.TaskPrompt` → user message
|
||||
- `req.Model` → `model` field in the chat completions request
|
||||
|
||||
Response handling: local models are prompted (via the discipline file output contract) to return a JSON object matching the `Result` schema. The executor attempts `json.Unmarshal` into `Result` directly — no envelope unwrapping needed (unlike the `--output-format json` claude envelope). If unmarshalling fails, the executor returns an error that the orchestrator treats as an automatic escalation trigger.
|
||||
|
||||
### 2. `internal/exec/verifier.go` — Claude verifier
|
||||
|
||||
A focused Claude call that judges local model output. Uses the existing `Executor` (claude subprocess) internally.
|
||||
|
||||
```go
|
||||
type Verdict struct {
|
||||
Accept bool `json:"accept"`
|
||||
Feedback string `json:"feedback"` // reason if not accepting; empty if accept
|
||||
}
|
||||
|
||||
type Verifier struct {
|
||||
executor *Executor // the existing claude executor
|
||||
}
|
||||
|
||||
func NewVerifier(executor *Executor) *Verifier
|
||||
|
||||
func (v *Verifier) Verify(ctx context.Context, skillPrompt, taskPrompt string, output Result) (Verdict, error)
|
||||
```
|
||||
|
||||
The verifier prompt gives Claude:
|
||||
1. The skill discipline file (so it knows the iron laws and output contract)
|
||||
2. The original task prompt (informed verification — Claude sees what was asked)
|
||||
3. The generated output
|
||||
4. A short instruction: "Does this output satisfy the discipline's iron laws and output contract? Reply with JSON: `{\"accept\": true|false, \"feedback\": \"...\"}`"
|
||||
|
||||
The verifier uses a lightweight JSON schema for its own output (a `Verdict` schema), keeping the call fast.
|
||||
|
||||
### 3. `internal/exec/orchestrator.go` — chain walker
|
||||
|
||||
Implements `ExecutorFn`. Walks the escalation chain, delegating generation and verification per tier.
|
||||
|
||||
```go
|
||||
type Chain []ChainEntry
|
||||
|
||||
type ChainEntry struct {
|
||||
Model string // e.g. "ollama/phi4", "claude-sonnet-4-5"
|
||||
Tier string // "local" | "subagent" | "managed"
|
||||
IsCloud bool // true for claude-* models; skips verifier
|
||||
}
|
||||
|
||||
type Orchestrator struct {
|
||||
chain Chain
|
||||
litellm *LiteLLMExecutor
|
||||
claude *Executor
|
||||
verifier *Verifier
|
||||
llamaSwapURL string // for warm-state probe
|
||||
}
|
||||
|
||||
func NewOrchestrator(chain Chain, litellm *LiteLLMExecutor, claude *Executor, verifier *Verifier, llamaSwapURL string) *Orchestrator
|
||||
|
||||
func (o *Orchestrator) Run(ctx context.Context, req Request) (Result, error)
|
||||
```
|
||||
|
||||
Algorithm:
|
||||
```
|
||||
for each entry in chain:
|
||||
warm = probe llama-swap (if local tier)
|
||||
start = now()
|
||||
if entry.IsCloud:
|
||||
result, err = claude.Run(ctx, req with entry.Model)
|
||||
log attempt(model, tier, duration, warm, verified=true)
|
||||
if err == nil: return result
|
||||
else:
|
||||
result, err = litellm.Run(ctx, req with entry.Model)
|
||||
duration = now() - start
|
||||
if err != nil:
|
||||
log attempt(model, tier, duration, warm, verified=false)
|
||||
continue // automatic escalation on parse/network error
|
||||
verdict = verifier.Verify(ctx, req.SkillPrompt, req.TaskPrompt, result)
|
||||
log attempt(model, tier, duration, warm, verified=verdict.Accept)
|
||||
if verdict.Accept: return result
|
||||
// inject verifier feedback into next tier's task prompt
|
||||
req.TaskPrompt = req.TaskPrompt + "\n\nPrior attempt feedback: " + verdict.Feedback
|
||||
|
||||
return error("all tiers exhausted")
|
||||
```
|
||||
|
||||
### 4. `internal/config/models.go` — chain parser
|
||||
|
||||
Replaces the current single-model resolution with chain parsing.
|
||||
|
||||
Updated `models.yaml` format:
|
||||
|
||||
```yaml
|
||||
verifier: claude-sonnet-4-6 # fixed verifier for all local tiers
|
||||
|
||||
llama_swap_url: http://koala:8080 # for warm-state probing
|
||||
|
||||
default_chain:
|
||||
- ollama/qwen3-coder-30b-tuned
|
||||
- claude-sonnet-4-5
|
||||
|
||||
skills:
|
||||
tdd:
|
||||
chain:
|
||||
- ollama/qwen3-coder-30b-tuned
|
||||
- claude-sonnet-4-5
|
||||
review:
|
||||
chain:
|
||||
- ollama/devstral-tuned
|
||||
- ollama/gemma4
|
||||
- claude-sonnet-4-5
|
||||
debug:
|
||||
chain:
|
||||
- ollama/deepseek-r1-tuned
|
||||
- claude-sonnet-4-5
|
||||
spec:
|
||||
chain:
|
||||
- ollama/phi4
|
||||
- ollama/gemma4
|
||||
- claude-sonnet-4-5
|
||||
- claude-opus-4-6
|
||||
retrospective:
|
||||
chain:
|
||||
- ollama/qwen3-coder-30b-tuned
|
||||
- claude-sonnet-4-5
|
||||
trainer:
|
||||
chain:
|
||||
- ollama/qwen3-coder-30b-tuned
|
||||
- claude-sonnet-4-5
|
||||
```
|
||||
|
||||
The parser exposes:
|
||||
```go
|
||||
func (m *Models) ChainFor(skill string) Chain
|
||||
func (m *Models) Verifier() string
|
||||
func (m *Models) LlamaSwapURL() string
|
||||
```
|
||||
|
||||
Caller override (`model` param in MCP tool call) pins the chain to a single entry — one model, no escalation. This preserves the existing override behaviour for power users.
|
||||
|
||||
### 5. `internal/session/session.go` — updated `Attempt` struct
|
||||
|
||||
```go
|
||||
type Attempt struct {
|
||||
Attempt int `json:"attempt"`
|
||||
Model string `json:"model"`
|
||||
Tier string `json:"tier"` // local | subagent | managed
|
||||
DurationMs int64 `json:"duration_ms"`
|
||||
WarmStart bool `json:"warm_start"` // model was already loaded in llama-swap
|
||||
Verified bool `json:"verified"`
|
||||
Verdict string `json:"verdict,omitempty"` // accept | escalate | error
|
||||
Feedback string `json:"feedback,omitempty"` // verifier feedback on escalation
|
||||
OutputSummary string `json:"output_summary,omitempty"`
|
||||
RunnerOutput string `json:"runner_output,omitempty"`
|
||||
}
|
||||
```
|
||||
|
||||
### 6. `cmd/supervisor/main.go` — one wiring change
|
||||
|
||||
```go
|
||||
// Before:
|
||||
reg.Register(review.New(review.Config{ExecutorFn: executor.Run, ...}))
|
||||
|
||||
// After:
|
||||
chain := models.ChainFor("review")
|
||||
orch := exec.NewOrchestrator(chain, litellmExec, claudeExec, verifier, models.LlamaSwapURL())
|
||||
reg.Register(review.New(review.Config{ExecutorFn: orch.Run, ...}))
|
||||
```
|
||||
|
||||
One orchestrator per skill, sharing the same `litellmExec`, `claudeExec`, and `verifier` instances.
|
||||
|
||||
---
|
||||
|
||||
## Data flow example: `review` skill call
|
||||
|
||||
1. Claude Code calls `review` tool with `files: ["internal/foo.go"]`
|
||||
2. Skill handler builds task prompt, calls `orch.Run`
|
||||
3. Orchestrator resolves chain: `[devstral, gemma4, sonnet]`
|
||||
4. Probes llama-swap: devstral is warm
|
||||
5. LiteLLM calls devstral → returns JSON result
|
||||
6. Verifier asks Claude: "does this review satisfy the iron laws?"
|
||||
7. Claude: `{"accept": false, "feedback": "missing line references for all findings"}`
|
||||
8. Orchestrator logs attempt #1 (devstral, local, 4200ms, warm, escalate)
|
||||
9. Injects feedback into task prompt, calls gemma4
|
||||
10. Verifier: `{"accept": true}`
|
||||
11. Orchestrator logs attempt #2 (gemma4, local, 6100ms, cold, accept)
|
||||
12. Returns result to skill handler → MCP response
|
||||
|
||||
Session JSONL records both attempts. You can see: devstral was warm but produced weak output; gemma4 was cold but passed.
|
||||
|
||||
---
|
||||
|
||||
## Observability
|
||||
|
||||
Session JSONL is the primary store. Each `Entry.Attempts` slice records the full escalation trail. To analyse across sessions:
|
||||
|
||||
```bash
|
||||
# Which models are escalating most?
|
||||
jq -r '.attempts[] | select(.verdict == "escalate") | .model' brain/sessions/*.jsonl | sort | uniq -c
|
||||
|
||||
# Average latency per model
|
||||
jq -r '.attempts[] | [.model, .duration_ms] | @tsv' brain/sessions/*.jsonl | awk '{sum[$1]+=$2; n[$1]++} END {for (m in sum) print m, sum[m]/n[m]}'
|
||||
|
||||
# Cold start frequency
|
||||
jq -r '.attempts[] | select(.warm_start == false) | .model' brain/sessions/*.jsonl | sort | uniq -c
|
||||
```
|
||||
|
||||
No new metrics infrastructure needed for Phase 3. Phase 4 can build a dashboard on top of this data.
|
||||
|
||||
---
|
||||
|
||||
## Error handling
|
||||
|
||||
| Scenario | Behaviour |
|
||||
|----------|-----------|
|
||||
| LiteLLM unreachable | Log attempt as error, escalate immediately |
|
||||
| Local model returns unparseable JSON | Log attempt as error, escalate |
|
||||
| Verifier call fails | Log, treat as escalate (safe default) |
|
||||
| All tiers exhausted | Return error to skill handler; skill returns MCP error to caller |
|
||||
| Caller passes `model` override | Single-entry chain, no escalation, no verifier call |
|
||||
|
||||
---
|
||||
|
||||
## Testing approach
|
||||
|
||||
- `TestLiteLLMExecutor`: mock HTTP server returning valid/invalid JSON; verify parse logic and error escalation
|
||||
- `TestVerifier`: fake claude executor returning accept/escalate verdicts; verify prompt construction
|
||||
- `TestOrchestrator`: table-driven — chains of 1/2/3 tiers, various accept/escalate/error combinations; verify attempt log contents and final result
|
||||
- `TestModelsChainFor`: YAML parsing for all skill overrides and default_chain fallback
|
||||
- Integration smoke test: start real LiteLLM (or mock), call `review` tool via MCP, verify attempt log written
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| Local models ignore output contract → bad JSON | Discipline files already specify JSON output contract; parse failure auto-escalates |
|
||||
| Verifier Claude call adds latency to every local attempt | Verifier prompt is small and fast; acceptable tradeoff for quality gate |
|
||||
| llama-swap warm probe adds overhead | Probe is a single lightweight HTTP GET; timeout at 200ms, treat failure as `warm_start: false` |
|
||||
| Chain exhaustion leaves caller with no result | Return structured error via MCP; caller can retry with explicit `model` override |
|
||||
Reference in New Issue
Block a user