# Model Orchestration Design

**Date:** 2026-04-20  
**Status:** Approved for implementation

## Problem statement

The hyperguild supervisor currently spawns a `claude --print` subprocess for every skill call. The model routing config (`models.yaml`) exists but is dead weight — the model name is injected as text into the task prompt and ignored. Every skill call costs Claude tokens regardless of task complexity or data sensitivity.

## Goal

Route skill work to the most appropriate model — weighing cost, latency, and quality — with Claude acting as the real supervisor: verifying outputs and deciding when to escalate. Local models on owned hardware handle the common case; Claude escalates through a chain to frontier models only when local quality is insufficient.

## Success criteria

- [ ] Each skill dispatches generation to its configured local model via LiteLLM by default
- [ ] Claude verifies every local output and either accepts or escalates
- [ ] Escalation walks a per-skill chain (local small → local large → Sonnet → Opus) with one attempt per tier
- [ ] Every attempt (model, tier, duration, warm state, verdict) is logged in the session JSONL
- [ ] Cloud tiers (Sonnet/Opus) self-certify — no separate verifier call
- [ ] Zero changes to skill handlers — they call `ExecutorFn` exactly as today
- [ ] `LiteLTMBaseURL` already in config; no new env vars required beyond `LLAMA_SWAP_URL`

## Constraints

- One attempt per tier before escalating (no retry within a tier)
- Anthropic T&C: Claude is called normally via Anthropic API; local models are called directly via LiteLLM HTTP — no API redirection
- `models.yaml` remains the single routing config file

## Out of scope

- Auto-rerouting based on real-time warm state (logged, not acted on — Phase 4)
- Multi-tenant / public service exposure
- RAG/CAG model boosting
- Managed Agent cloud delegation (chain stub only in Phase 3)

---

## Architecture

```
MCP tool call (Claude Code)
    ↓
Skill handler — calls ExecutorFn (unchanged)
    ↓
Orchestrator.Run (implements ExecutorFn)
    ├─ Resolve chain from models.yaml
    ├─ For each model in chain:
    │   ├─ [ollama/*] → LiteLLM executor → generate
    │   │       ↓
    │   │   Claude verifier (task + output + discipline)
    │   │       ├─ accept  → return Result (log attempt)
    │   │       └─ escalate → next tier (log attempt)
    │   │
    │   └─ [claude-*] → Claude executor (current) → generate + self-certify
    │           └─ return Result (log attempt)
    │
    └─ All tiers exhausted → return best attempt with escalation note
```

Claude is always the verifier for local tiers. At cloud tiers, Claude generates and self-certifies — the verifier call is skipped.

---

## Components

### 1. `internal/exec/litellm.go` — LiteLLM executor

Calls `POST /v1/chat/completions` on the configured LiteLLM server. Implements the same `ExecutorFn` signature as the existing claude executor.

```go
type LiteLLMExecutor struct {
    BaseURL    string
    APIKey     string
    HTTPClient *http.Client
    Timeout    time.Duration
}

func NewLiteLLM(baseURL, apiKey string, timeout time.Duration) *LiteLLMExecutor

func (e *LiteLLMExecutor) Run(ctx context.Context, req Request) (Result, error)
```

Request mapping:
- `req.SkillPrompt` → system message
- `req.TaskPrompt` → user message
- `req.Model` → `model` field in the chat completions request

Response handling: local models are prompted (via the discipline file output contract) to return a JSON object matching the `Result` schema. The executor attempts `json.Unmarshal` into `Result` directly — no envelope unwrapping needed (unlike the `--output-format json` claude envelope). If unmarshalling fails, the executor returns an error that the orchestrator treats as an automatic escalation trigger.

### 2. `internal/exec/verifier.go` — Claude verifier

A focused Claude call that judges local model output. Uses the existing `Executor` (claude subprocess) internally.

```go
type Verdict struct {
    Accept   bool   `json:"accept"`
    Feedback string `json:"feedback"` // reason if not accepting; empty if accept
}

type Verifier struct {
    executor *Executor // the existing claude executor
}

func NewVerifier(executor *Executor) *Verifier

func (v *Verifier) Verify(ctx context.Context, skillPrompt, taskPrompt string, output Result) (Verdict, error)
```

The verifier prompt gives Claude:
1. The skill discipline file (so it knows the iron laws and output contract)
2. The original task prompt (informed verification — Claude sees what was asked)
3. The generated output
4. A short instruction: "Does this output satisfy the discipline's iron laws and output contract? Reply with JSON: `{\"accept\": true|false, \"feedback\": \"...\"}`"

The verifier uses a lightweight JSON schema for its own output (a `Verdict` schema), keeping the call fast.

### 3. `internal/exec/orchestrator.go` — chain walker

Implements `ExecutorFn`. Walks the escalation chain, delegating generation and verification per tier.

```go
type Chain []ChainEntry

type ChainEntry struct {
    Model    string // e.g. "ollama/phi4", "claude-sonnet-4-5"
    Tier     string // "local" | "subagent" | "managed"
    IsCloud  bool   // true for claude-* models; skips verifier
}

type Orchestrator struct {
    chain    Chain
    litellm  *LiteLLMExecutor
    claude   *Executor
    verifier *Verifier
    llamaSwapURL string // for warm-state probe
}

func NewOrchestrator(chain Chain, litellm *LiteLLMExecutor, claude *Executor, verifier *Verifier, llamaSwapURL string) *Orchestrator

func (o *Orchestrator) Run(ctx context.Context, req Request) (Result, error)
```

Algorithm:
```
for each entry in chain:
    warm = probe llama-swap (if local tier)
    start = now()
    if entry.IsCloud:
        result, err = claude.Run(ctx, req with entry.Model)
        log attempt(model, tier, duration, warm, verified=true)
        if err == nil: return result
    else:
        result, err = litellm.Run(ctx, req with entry.Model)
        duration = now() - start
        if err != nil:
            log attempt(model, tier, duration, warm, verified=false)
            continue  // automatic escalation on parse/network error
        verdict = verifier.Verify(ctx, req.SkillPrompt, req.TaskPrompt, result)
        log attempt(model, tier, duration, warm, verified=verdict.Accept)
        if verdict.Accept: return result
        // inject verifier feedback into next tier's task prompt
        req.TaskPrompt = req.TaskPrompt + "\n\nPrior attempt feedback: " + verdict.Feedback

return error("all tiers exhausted")
```

### 4. `internal/config/models.go` — chain parser

Replaces the current single-model resolution with chain parsing.

Updated `models.yaml` format:

```yaml
verifier: claude-sonnet-4-6   # fixed verifier for all local tiers

llama_swap_url: http://koala:8080   # for warm-state probing

default_chain:
  - ollama/qwen3-coder-30b-tuned
  - claude-sonnet-4-5

skills:
  tdd:
    chain:
      - ollama/qwen3-coder-30b-tuned
      - claude-sonnet-4-5
  review:
    chain:
      - ollama/devstral-tuned
      - ollama/gemma4
      - claude-sonnet-4-5
  debug:
    chain:
      - ollama/deepseek-r1-tuned
      - claude-sonnet-4-5
  spec:
    chain:
      - ollama/phi4
      - ollama/gemma4
      - claude-sonnet-4-5
      - claude-opus-4-6
  retrospective:
    chain:
      - ollama/qwen3-coder-30b-tuned
      - claude-sonnet-4-5
  trainer:
    chain:
      - ollama/qwen3-coder-30b-tuned
      - claude-sonnet-4-5
```

The parser exposes:
```go
func (m *Models) ChainFor(skill string) Chain
func (m *Models) Verifier() string
func (m *Models) LlamaSwapURL() string
```

Caller override (`model` param in MCP tool call) pins the chain to a single entry — one model, no escalation. This preserves the existing override behaviour for power users.

### 5. `internal/session/session.go` — updated `Attempt` struct

```go
type Attempt struct {
    Attempt       int    `json:"attempt"`
    Model         string `json:"model"`
    Tier          string `json:"tier"`          // local | subagent | managed
    DurationMs    int64  `json:"duration_ms"`
    WarmStart     bool   `json:"warm_start"`    // model was already loaded in llama-swap
    Verified      bool   `json:"verified"`
    Verdict       string `json:"verdict,omitempty"` // accept | escalate | error
    Feedback      string `json:"feedback,omitempty"` // verifier feedback on escalation
    OutputSummary string `json:"output_summary,omitempty"`
    RunnerOutput  string `json:"runner_output,omitempty"`
}
```

### 6. `cmd/supervisor/main.go` — one wiring change

```go
// Before:
reg.Register(review.New(review.Config{ExecutorFn: executor.Run, ...}))

// After:
chain := models.ChainFor("review")
orch := exec.NewOrchestrator(chain, litellmExec, claudeExec, verifier, models.LlamaSwapURL())
reg.Register(review.New(review.Config{ExecutorFn: orch.Run, ...}))
```

One orchestrator per skill, sharing the same `litellmExec`, `claudeExec`, and `verifier` instances.

---

## Data flow example: `review` skill call

1. Claude Code calls `review` tool with `files: ["internal/foo.go"]`
2. Skill handler builds task prompt, calls `orch.Run`
3. Orchestrator resolves chain: `[devstral, gemma4, sonnet]`
4. Probes llama-swap: devstral is warm
5. LiteLLM calls devstral → returns JSON result
6. Verifier asks Claude: "does this review satisfy the iron laws?"
7. Claude: `{"accept": false, "feedback": "missing line references for all findings"}`
8. Orchestrator logs attempt #1 (devstral, local, 4200ms, warm, escalate)
9. Injects feedback into task prompt, calls gemma4
10. Verifier: `{"accept": true}`
11. Orchestrator logs attempt #2 (gemma4, local, 6100ms, cold, accept)
12. Returns result to skill handler → MCP response

Session JSONL records both attempts. You can see: devstral was warm but produced weak output; gemma4 was cold but passed.

---

## Observability

Session JSONL is the primary store. Each `Entry.Attempts` slice records the full escalation trail. To analyse across sessions:

```bash
# Which models are escalating most?
jq -r '.attempts[] | select(.verdict == "escalate") | .model' brain/sessions/*.jsonl | sort | uniq -c

# Average latency per model
jq -r '.attempts[] | [.model, .duration_ms] | @tsv' brain/sessions/*.jsonl | awk '{sum[$1]+=$2; n[$1]++} END {for (m in sum) print m, sum[m]/n[m]}'

# Cold start frequency
jq -r '.attempts[] | select(.warm_start == false) | .model' brain/sessions/*.jsonl | sort | uniq -c
```

No new metrics infrastructure needed for Phase 3. Phase 4 can build a dashboard on top of this data.

---

## Error handling

| Scenario | Behaviour |
|----------|-----------|
| LiteLLM unreachable | Log attempt as error, escalate immediately |
| Local model returns unparseable JSON | Log attempt as error, escalate |
| Verifier call fails | Log, treat as escalate (safe default) |
| All tiers exhausted | Return error to skill handler; skill returns MCP error to caller |
| Caller passes `model` override | Single-entry chain, no escalation, no verifier call |

---

## Testing approach

- `TestLiteLLMExecutor`: mock HTTP server returning valid/invalid JSON; verify parse logic and error escalation
- `TestVerifier`: fake claude executor returning accept/escalate verdicts; verify prompt construction
- `TestOrchestrator`: table-driven — chains of 1/2/3 tiers, various accept/escalate/error combinations; verify attempt log contents and final result
- `TestModelsChainFor`: YAML parsing for all skill overrides and default_chain fallback
- Integration smoke test: start real LiteLLM (or mock), call `review` tool via MCP, verify attempt log written

---

## Risks

| Risk | Mitigation |
|------|------------|
| Local models ignore output contract → bad JSON | Discipline files already specify JSON output contract; parse failure auto-escalates |
| Verifier Claude call adds latency to every local attempt | Verifier prompt is small and fast; acceptable tradeoff for quality gate |
| llama-swap warm probe adds overhead | Probe is a single lightweight HTTP GET; timeout at 200ms, treat failure as `warm_start: false` |
| Chain exhaustion leaves caller with no result | Return structured error via MCP; caller can retry with explicit `model` override |