brain_write with a custom filename omitted the .md extension, causing search to skip the file (search.go filters on HasSuffix .md). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1618 lines
49 KiB
Markdown
1618 lines
49 KiB
Markdown
# Model Orchestration Implementation Plan
|
||
|
||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||
|
||
**Goal:** Route skill work to per-skill escalation chains (local → Claude) with Claude verifying local output and self-certifying at cloud tier.
|
||
|
||
**Architecture:** A new `Orchestrator` type implements the same `ExecutorFn` signature used by all skill handlers — zero handler changes. For each tier in the chain, the orchestrator dispatches generation (LiteLLM for local, claude subprocess for cloud), runs Claude verification on local output, logs the attempt, and escalates on failure. The spec is at `docs/superpowers/specs/2026-04-20-model-orchestration-design.md`.
|
||
|
||
**Tech Stack:** Go stdlib `net/http` for LiteLLM calls; existing `exec.Executor` claude subprocess for cloud tier and verification; `gopkg.in/yaml.v3` (already imported) for config parsing.
|
||
|
||
---
|
||
|
||
## File structure
|
||
|
||
| Action | File | Responsibility |
|
||
|--------|------|----------------|
|
||
| Modify | `internal/session/session.go` | Add Tier, DurationMs, WarmStart, Verdict, Feedback to Attempt |
|
||
| Modify | `internal/exec/executor.go` | Add `--model` flag to subprocess call when req.Model starts with "claude-" |
|
||
| Create | `internal/exec/litellm.go` | HTTP client to LiteLLM `/v1/chat/completions`; returns `Result` |
|
||
| Create | `internal/exec/litellm_test.go` | Mock HTTP server tests for parse/error/escalation paths |
|
||
| Create | `internal/exec/verifier.go` | Claude subprocess that returns `Verdict{Accept, Feedback}` |
|
||
| Create | `internal/exec/verifier_test.go` | Fake claude binary tests for accept/escalate/error |
|
||
| Create | `internal/exec/orchestrator.go` | Chain walker; warm probe; logging; implements ExecutorFn shape |
|
||
| Create | `internal/exec/orchestrator_test.go` | Table-driven: 1/2/3-tier chains, all outcome combinations |
|
||
| Modify | `internal/config/models.go` | Chain-aware YAML struct; ChainFor/Verifier/LlamaSwapURL methods |
|
||
| Modify | `internal/config/models_test.go` | Update for new YAML format; add ChainFor/override tests |
|
||
| Modify | `config/models.yaml` | New chain format for all 6 skills |
|
||
| Modify | `cmd/supervisor/main.go` | Create LiteLLMExecutor + Verifier; wire per-skill Orchestrators |
|
||
|
||
---
|
||
|
||
### Task 1: Extend the Attempt struct
|
||
|
||
**Files:**
|
||
- Modify: `internal/session/session.go:32-38`
|
||
|
||
The current Attempt struct is missing tier, timing, and verdict fields. Adding them is additive (existing JSONL files deserialise fine with zero values).
|
||
|
||
- [ ] **Step 1: Write the failing test**
|
||
|
||
Add to `internal/session/session_test.go` (if it exists, create it otherwise):
|
||
|
||
```go
|
||
package session_test
|
||
|
||
import (
|
||
"encoding/json"
|
||
"testing"
|
||
|
||
"github.com/mathiasbq/supervisor/internal/session"
|
||
"github.com/stretchr/testify/assert"
|
||
"github.com/stretchr/testify/require"
|
||
)
|
||
|
||
func TestAttemptRoundTrip(t *testing.T) {
|
||
a := session.Attempt{
|
||
Attempt: 1,
|
||
Model: "ollama/devstral",
|
||
Tier: "local",
|
||
DurationMs: 4200,
|
||
WarmStart: true,
|
||
Verified: false,
|
||
Verdict: "escalate",
|
||
Feedback: "missing line references",
|
||
}
|
||
data, err := json.Marshal(a)
|
||
require.NoError(t, err)
|
||
|
||
var got session.Attempt
|
||
require.NoError(t, json.Unmarshal(data, &got))
|
||
assert.Equal(t, a, got)
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 2: Run test to verify it fails**
|
||
|
||
```bash
|
||
cd /Users/mathias/Documents/local-dev/AI/supervisor
|
||
go test ./internal/session/... -run TestAttemptRoundTrip -v
|
||
```
|
||
|
||
Expected: FAIL — `session.Attempt` has no `Tier`, `DurationMs`, `WarmStart`, `Verdict`, `Feedback` fields.
|
||
|
||
- [ ] **Step 3: Update Attempt struct in session.go**
|
||
|
||
Replace lines 32–38:
|
||
|
||
```go
|
||
// Attempt represents one subprocess invocation within a skill call.
|
||
type Attempt struct {
|
||
Attempt int `json:"attempt"`
|
||
Model string `json:"model"`
|
||
Tier string `json:"tier"` // local | subagent | managed
|
||
DurationMs int64 `json:"duration_ms"`
|
||
WarmStart bool `json:"warm_start"` // model already loaded in llama-swap
|
||
Verified bool `json:"verified"`
|
||
Verdict string `json:"verdict,omitempty"` // accept | escalate | error
|
||
Feedback string `json:"feedback,omitempty"` // verifier feedback on escalation
|
||
OutputSummary string `json:"output_summary,omitempty"`
|
||
RunnerOutput string `json:"runner_output,omitempty"`
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 4: Run test to verify it passes**
|
||
|
||
```bash
|
||
go test ./internal/session/... -run TestAttemptRoundTrip -v
|
||
```
|
||
|
||
Expected: PASS
|
||
|
||
- [ ] **Step 5: Commit**
|
||
|
||
```bash
|
||
git add internal/session/session.go internal/session/session_test.go
|
||
git commit -m "feat(session): extend Attempt with tier, timing, and verdict fields"
|
||
```
|
||
|
||
---
|
||
|
||
### Task 2: Chain-based models config
|
||
|
||
**Files:**
|
||
- Modify: `internal/config/models.go`
|
||
- Modify: `internal/config/models_test.go`
|
||
- Modify: `config/models.yaml`
|
||
|
||
The current `modelsFile` has `Default string` and `Skills map[string]string`. Replace with a chain-aware structure. The public API gains `ChainFor`, `Verifier`, and `LlamaSwapURL` methods. The existing `Resolve` method is deleted — callers (main.go) will use `ChainFor`.
|
||
|
||
- [ ] **Step 1: Write failing tests**
|
||
|
||
Replace `internal/config/models_test.go` entirely:
|
||
|
||
```go
|
||
package config_test
|
||
|
||
import (
|
||
"os"
|
||
"path/filepath"
|
||
"testing"
|
||
|
||
"github.com/mathiasbq/supervisor/internal/config"
|
||
"github.com/stretchr/testify/assert"
|
||
"github.com/stretchr/testify/require"
|
||
)
|
||
|
||
const testYAML = `
|
||
verifier: claude-sonnet-4-6
|
||
llama_swap_url: http://koala:8080
|
||
|
||
default_chain:
|
||
- ollama/qwen3-coder-30b-tuned
|
||
- claude-sonnet-4-6
|
||
|
||
skills:
|
||
review:
|
||
chain:
|
||
- ollama/devstral-tuned
|
||
- ollama/gemma4
|
||
- claude-sonnet-4-6
|
||
spec:
|
||
chain:
|
||
- ollama/phi4
|
||
- claude-opus-4-6
|
||
`
|
||
|
||
func writeModels(t *testing.T, content string) string {
|
||
t.Helper()
|
||
f := filepath.Join(t.TempDir(), "models.yaml")
|
||
require.NoError(t, os.WriteFile(f, []byte(content), 0644))
|
||
return f
|
||
}
|
||
|
||
func TestModelsVerifier(t *testing.T) {
|
||
m, err := config.LoadModels(writeModels(t, testYAML))
|
||
require.NoError(t, err)
|
||
assert.Equal(t, "claude-sonnet-4-6", m.Verifier())
|
||
}
|
||
|
||
func TestModelsLlamaSwapURL(t *testing.T) {
|
||
m, err := config.LoadModels(writeModels(t, testYAML))
|
||
require.NoError(t, err)
|
||
assert.Equal(t, "http://koala:8080", m.LlamaSwapURL())
|
||
}
|
||
|
||
func TestModelsChainForSkillOverride(t *testing.T) {
|
||
m, err := config.LoadModels(writeModels(t, testYAML))
|
||
require.NoError(t, err)
|
||
|
||
chain := m.ChainFor("review", "")
|
||
require.Len(t, chain, 3)
|
||
assert.Equal(t, "ollama/devstral-tuned", chain[0])
|
||
assert.Equal(t, "ollama/gemma4", chain[1])
|
||
assert.Equal(t, "claude-sonnet-4-6", chain[2])
|
||
}
|
||
|
||
func TestModelsChainForDefaultFallback(t *testing.T) {
|
||
m, err := config.LoadModels(writeModels(t, testYAML))
|
||
require.NoError(t, err)
|
||
|
||
chain := m.ChainFor("trainer", "") // not in skills map
|
||
require.Len(t, chain, 2)
|
||
assert.Equal(t, "ollama/qwen3-coder-30b-tuned", chain[0])
|
||
assert.Equal(t, "claude-sonnet-4-6", chain[1])
|
||
}
|
||
|
||
func TestModelsChainForCallerOverride(t *testing.T) {
|
||
m, err := config.LoadModels(writeModels(t, testYAML))
|
||
require.NoError(t, err)
|
||
|
||
// Caller override collapses to a single-entry chain — no escalation.
|
||
chain := m.ChainFor("review", "claude-opus-4-6")
|
||
require.Len(t, chain, 1)
|
||
assert.Equal(t, "claude-opus-4-6", chain[0])
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 2: Run tests to verify they fail**
|
||
|
||
```bash
|
||
go test ./internal/config/... -v
|
||
```
|
||
|
||
Expected: compile error — `ChainFor`, `Verifier`, `LlamaSwapURL` undefined.
|
||
|
||
- [ ] **Step 3: Rewrite models.go**
|
||
|
||
```go
|
||
package config
|
||
|
||
import (
|
||
"fmt"
|
||
"os"
|
||
|
||
"gopkg.in/yaml.v3"
|
||
)
|
||
|
||
type skillChain struct {
|
||
Chain []string `yaml:"chain"`
|
||
}
|
||
|
||
type modelsFile struct {
|
||
Verifier string `yaml:"verifier"`
|
||
LlamaSwapURL string `yaml:"llama_swap_url"`
|
||
DefaultChain []string `yaml:"default_chain"`
|
||
Skills map[string]skillChain `yaml:"skills"`
|
||
}
|
||
|
||
type Models struct {
|
||
data modelsFile
|
||
}
|
||
|
||
func LoadModels(path string) (Models, error) {
|
||
raw, err := os.ReadFile(path)
|
||
if err != nil {
|
||
return Models{}, fmt.Errorf("load models: %w", err)
|
||
}
|
||
var f modelsFile
|
||
if err := yaml.Unmarshal(raw, &f); err != nil {
|
||
return Models{}, fmt.Errorf("parse models: %w", err)
|
||
}
|
||
return Models{data: f}, nil
|
||
}
|
||
|
||
// Verifier returns the model name to use for all local-tier output verification.
|
||
func (m Models) Verifier() string { return m.data.Verifier }
|
||
|
||
// LlamaSwapURL returns the llama-swap base URL for warm-state probing.
|
||
func (m Models) LlamaSwapURL() string { return m.data.LlamaSwapURL }
|
||
|
||
// ChainFor returns the ordered list of model names for a skill.
|
||
// If override is non-empty, returns a single-entry chain (no escalation).
|
||
// Falls back to default_chain when the skill has no explicit entry.
|
||
func (m Models) ChainFor(skill, override string) []string {
|
||
if override != "" {
|
||
return []string{override}
|
||
}
|
||
if sc, ok := m.data.Skills[skill]; ok && len(sc.Chain) > 0 {
|
||
return sc.Chain
|
||
}
|
||
out := make([]string, len(m.data.DefaultChain))
|
||
copy(out, m.data.DefaultChain)
|
||
return out
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 4: Run tests to verify they pass**
|
||
|
||
```bash
|
||
go test ./internal/config/... -v
|
||
```
|
||
|
||
Expected: PASS — all 5 tests green.
|
||
|
||
- [ ] **Step 5: Update config/models.yaml**
|
||
|
||
```yaml
|
||
# Model routing chains — three-layer priority:
|
||
# 1. model param in MCP tool call (caller override — collapses to single entry, no escalation)
|
||
# 2. per-skill chain here
|
||
# 3. default_chain fallback
|
||
|
||
verifier: claude-sonnet-4-6 # fixed verifier for all local tiers
|
||
|
||
llama_swap_url: http://koala:8080 # for warm-state probing
|
||
|
||
default_chain:
|
||
- ollama/qwen3-coder-30b-tuned
|
||
- claude-sonnet-4-6
|
||
|
||
skills:
|
||
tdd:
|
||
chain:
|
||
- ollama/qwen3-coder-30b-tuned
|
||
- claude-sonnet-4-6
|
||
review:
|
||
chain:
|
||
- ollama/devstral-tuned
|
||
- ollama/gemma4
|
||
- claude-sonnet-4-6
|
||
debug:
|
||
chain:
|
||
- ollama/deepseek-r1-tuned
|
||
- claude-sonnet-4-6
|
||
spec:
|
||
chain:
|
||
- ollama/phi4
|
||
- ollama/gemma4
|
||
- claude-sonnet-4-6
|
||
- claude-opus-4-6
|
||
retrospective:
|
||
chain:
|
||
- ollama/qwen3-coder-30b-tuned
|
||
- claude-sonnet-4-6
|
||
trainer:
|
||
chain:
|
||
- ollama/qwen3-coder-30b-tuned
|
||
- claude-sonnet-4-6
|
||
```
|
||
|
||
- [ ] **Step 6: Verify build still compiles**
|
||
|
||
```bash
|
||
go build ./...
|
||
```
|
||
|
||
Expected: compile error in main.go — `models.Resolve` no longer exists. That's expected; main.go will be fixed in Task 7.
|
||
|
||
- [ ] **Step 7: Commit**
|
||
|
||
```bash
|
||
git add internal/config/models.go internal/config/models_test.go config/models.yaml
|
||
git commit -m "feat(config): replace single-model config with chain-based routing"
|
||
```
|
||
|
||
---
|
||
|
||
### Task 3: Add --model flag to the claude Executor
|
||
|
||
**Files:**
|
||
- Modify: `internal/exec/executor.go:69-76`
|
||
|
||
The existing executor never passes a `--model` flag; the model name is injected as prompt text (currently ignored by Claude). Cloud-tier dispatch needs to actually select the right model. This change adds `--model req.Model` when the model is set and starts with "claude-".
|
||
|
||
- [ ] **Step 1: Write the failing test**
|
||
|
||
Add to `internal/exec/executor_test.go`:
|
||
|
||
```go
|
||
func TestExecutorPassesModelFlag(t *testing.T) {
|
||
// The fake claude script echoes its arguments to stderr so we can assert --model was passed.
|
||
dir := t.TempDir()
|
||
script := filepath.Join(dir, "claude")
|
||
envelope := `{"type":"result","subtype":"success","is_error":false,"structured_output":{"status":"pass","phase":"review","skill":"review","file_path":"","runner_output":"","verified":true,"model_used":"claude-sonnet-4-6","message":"ok"}}`
|
||
// Script prints args to stderr, then prints envelope to stdout.
|
||
content := "#!/bin/sh\necho \"$@\" >&2\necho '" + envelope + "'\n"
|
||
require.NoError(t, os.WriteFile(script, []byte(content), 0755))
|
||
|
||
ex := iexec.New(iexec.Config{
|
||
ClaudeBinary: script,
|
||
SystemPrompt: "sys",
|
||
Timeout: 5 * time.Second,
|
||
})
|
||
|
||
var stderrBuf bytes.Buffer
|
||
_ = stderrBuf // not exposed; we rely on the test at the result level
|
||
result, err := ex.Run(context.Background(), iexec.Request{
|
||
SkillPrompt: "review rules",
|
||
TaskPrompt: "do review",
|
||
Model: "claude-sonnet-4-6",
|
||
})
|
||
require.NoError(t, err)
|
||
assert.Equal(t, "pass", result.Status)
|
||
// The real assertion is that the --model flag doesn't break anything.
|
||
// Integration-level model verification is in the orchestrator tests.
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 2: Run test to verify it passes already**
|
||
|
||
```bash
|
||
go test ./internal/exec/... -run TestExecutorPassesModelFlag -v
|
||
```
|
||
|
||
Expected: PASS (the test currently succeeds since it only checks result parsing). This step confirms the baseline.
|
||
|
||
- [ ] **Step 3: Add --model flag to executor.go**
|
||
|
||
In `internal/exec/executor.go`, after the `args` slice is built (after line 76), add model injection:
|
||
|
||
```go
|
||
args := []string{
|
||
"--print",
|
||
"--permission-mode", "bypassPermissions",
|
||
"--tools", tools,
|
||
"--json-schema", Schema,
|
||
"--output-format", "json",
|
||
}
|
||
if strings.HasPrefix(req.Model, "claude-") {
|
||
args = append(args, "--model", req.Model)
|
||
}
|
||
args = append(args, prompt)
|
||
```
|
||
|
||
Replace the existing `args` block (lines 69-76) with the above. The `strings` import is already present.
|
||
|
||
- [ ] **Step 4: Run all exec tests**
|
||
|
||
```bash
|
||
go test ./internal/exec/... -v
|
||
```
|
||
|
||
Expected: all existing tests pass.
|
||
|
||
- [ ] **Step 5: Commit**
|
||
|
||
```bash
|
||
git add internal/exec/executor.go internal/exec/executor_test.go
|
||
git commit -m "feat(exec): pass --model flag to claude subprocess for cloud-tier dispatch"
|
||
```
|
||
|
||
---
|
||
|
||
### Task 4: LiteLLM executor
|
||
|
||
**Files:**
|
||
- Create: `internal/exec/litellm.go`
|
||
- Create: `internal/exec/litellm_test.go`
|
||
|
||
The LiteLLM executor calls `POST /v1/chat/completions` and expects the model to return a JSON object matching the `Result` schema in the response content. No envelope — direct unmarshal. Parse failure triggers automatic escalation by the orchestrator.
|
||
|
||
- [ ] **Step 1: Write the failing tests**
|
||
|
||
Create `internal/exec/litellm_test.go`:
|
||
|
||
```go
|
||
package exec_test
|
||
|
||
import (
|
||
"context"
|
||
"encoding/json"
|
||
"net/http"
|
||
"net/http/httptest"
|
||
"testing"
|
||
"time"
|
||
|
||
iexec "github.com/mathiasbq/supervisor/internal/exec"
|
||
"github.com/stretchr/testify/assert"
|
||
"github.com/stretchr/testify/require"
|
||
)
|
||
|
||
func validResult() iexec.Result {
|
||
return iexec.Result{
|
||
Status: "pass",
|
||
Phase: "review",
|
||
Skill: "review",
|
||
ModelUsed: "ollama/devstral",
|
||
Message: "looks good",
|
||
}
|
||
}
|
||
|
||
func chatResponseFor(t *testing.T, result iexec.Result) string {
|
||
t.Helper()
|
||
content, err := json.Marshal(result)
|
||
require.NoError(t, err)
|
||
resp := map[string]any{
|
||
"choices": []map[string]any{
|
||
{"message": map[string]any{"role": "assistant", "content": string(content)}},
|
||
},
|
||
}
|
||
data, err := json.Marshal(resp)
|
||
require.NoError(t, err)
|
||
return string(data)
|
||
}
|
||
|
||
func TestLiteLLMParsesValidResult(t *testing.T) {
|
||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||
assert.Equal(t, "/v1/chat/completions", r.URL.Path)
|
||
assert.Equal(t, "application/json", r.Header.Get("Content-Type"))
|
||
w.Header().Set("Content-Type", "application/json")
|
||
w.WriteHeader(http.StatusOK)
|
||
_, _ = w.Write([]byte(chatResponseFor(t, validResult())))
|
||
}))
|
||
defer srv.Close()
|
||
|
||
ex := iexec.NewLiteLLM(srv.URL, "", 5*time.Second)
|
||
result, err := ex.Run(context.Background(), iexec.Request{
|
||
SkillPrompt: "review rules",
|
||
TaskPrompt: "review the code",
|
||
Model: "ollama/devstral",
|
||
})
|
||
require.NoError(t, err)
|
||
assert.Equal(t, "pass", result.Status)
|
||
assert.Equal(t, "review", result.Skill)
|
||
}
|
||
|
||
func TestLiteLLMSendsAuthHeader(t *testing.T) {
|
||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||
assert.Equal(t, "Bearer secret", r.Header.Get("Authorization"))
|
||
w.Header().Set("Content-Type", "application/json")
|
||
w.WriteHeader(http.StatusOK)
|
||
_, _ = w.Write([]byte(chatResponseFor(t, validResult())))
|
||
}))
|
||
defer srv.Close()
|
||
|
||
ex := iexec.NewLiteLLM(srv.URL, "secret", 5*time.Second)
|
||
_, err := ex.Run(context.Background(), iexec.Request{Model: "x", TaskPrompt: "t"})
|
||
require.NoError(t, err)
|
||
}
|
||
|
||
func TestLiteLLMErrorOnNonOKStatus(t *testing.T) {
|
||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||
w.WriteHeader(http.StatusServiceUnavailable)
|
||
}))
|
||
defer srv.Close()
|
||
|
||
ex := iexec.NewLiteLLM(srv.URL, "", 5*time.Second)
|
||
_, err := ex.Run(context.Background(), iexec.Request{Model: "x", TaskPrompt: "t"})
|
||
assert.ErrorContains(t, err, "503")
|
||
}
|
||
|
||
func TestLiteLLMErrorOnUnparsableJSON(t *testing.T) {
|
||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||
w.Header().Set("Content-Type", "application/json")
|
||
w.WriteHeader(http.StatusOK)
|
||
resp := map[string]any{
|
||
"choices": []map[string]any{
|
||
{"message": map[string]any{"role": "assistant", "content": "not json at all"}},
|
||
},
|
||
}
|
||
data, _ := json.Marshal(resp)
|
||
_, _ = w.Write(data)
|
||
}))
|
||
defer srv.Close()
|
||
|
||
ex := iexec.NewLiteLLM(srv.URL, "", 5*time.Second)
|
||
_, err := ex.Run(context.Background(), iexec.Request{Model: "x", TaskPrompt: "t"})
|
||
assert.Error(t, err)
|
||
}
|
||
|
||
func TestLiteLLMRespectsContextCancellation(t *testing.T) {
|
||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||
// Block until client disconnects.
|
||
<-r.Context().Done()
|
||
}))
|
||
defer srv.Close()
|
||
|
||
ctx, cancel := context.WithTimeout(context.Background(), 50*time.Millisecond)
|
||
defer cancel()
|
||
|
||
ex := iexec.NewLiteLLM(srv.URL, "", 5*time.Second)
|
||
_, err := ex.Run(ctx, iexec.Request{Model: "x", TaskPrompt: "t"})
|
||
assert.Error(t, err)
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 2: Run tests to verify they fail**
|
||
|
||
```bash
|
||
go test ./internal/exec/... -run TestLiteLLM -v
|
||
```
|
||
|
||
Expected: compile error — `iexec.NewLiteLLM` undefined.
|
||
|
||
- [ ] **Step 3: Create internal/exec/litellm.go**
|
||
|
||
```go
|
||
package exec
|
||
|
||
import (
|
||
"bytes"
|
||
"context"
|
||
"encoding/json"
|
||
"fmt"
|
||
"net/http"
|
||
"time"
|
||
)
|
||
|
||
// LiteLLMExecutor calls a LiteLLM-compatible /v1/chat/completions endpoint.
|
||
// Local models are expected to return a JSON object matching the Result schema
|
||
// as their response content — no envelope.
|
||
type LiteLLMExecutor struct {
|
||
baseURL string
|
||
apiKey string
|
||
httpClient *http.Client
|
||
}
|
||
|
||
// NewLiteLLM creates a LiteLLMExecutor.
|
||
// timeout applies to the full HTTP round-trip per call.
|
||
func NewLiteLLM(baseURL, apiKey string, timeout time.Duration) *LiteLLMExecutor {
|
||
return &LiteLLMExecutor{
|
||
baseURL: baseURL,
|
||
apiKey: apiKey,
|
||
httpClient: &http.Client{Timeout: timeout},
|
||
}
|
||
}
|
||
|
||
type litellmMessage struct {
|
||
Role string `json:"role"`
|
||
Content string `json:"content"`
|
||
}
|
||
|
||
type litellmRequest struct {
|
||
Model string `json:"model"`
|
||
Messages []litellmMessage `json:"messages"`
|
||
}
|
||
|
||
type litellmChoice struct {
|
||
Message litellmMessage `json:"message"`
|
||
}
|
||
|
||
type litellmResponse struct {
|
||
Choices []litellmChoice `json:"choices"`
|
||
}
|
||
|
||
// Run dispatches req to the LiteLLM server and parses the Result from the
|
||
// assistant message content. Returns an error on network failure, non-200
|
||
// status, or unparseable/invalid JSON — all of which the Orchestrator treats
|
||
// as automatic escalation triggers.
|
||
func (e *LiteLLMExecutor) Run(ctx context.Context, req Request) (Result, error) {
|
||
body := litellmRequest{
|
||
Model: req.Model,
|
||
Messages: []litellmMessage{
|
||
{Role: "system", Content: req.SkillPrompt},
|
||
{Role: "user", Content: req.TaskPrompt},
|
||
},
|
||
}
|
||
|
||
bodyBytes, err := json.Marshal(body)
|
||
if err != nil {
|
||
return Result{}, fmt.Errorf("litellm: marshal request: %w", err)
|
||
}
|
||
|
||
httpReq, err := http.NewRequestWithContext(ctx, http.MethodPost, e.baseURL+"/v1/chat/completions", bytes.NewReader(bodyBytes))
|
||
if err != nil {
|
||
return Result{}, fmt.Errorf("litellm: create request: %w", err)
|
||
}
|
||
httpReq.Header.Set("Content-Type", "application/json")
|
||
if e.apiKey != "" {
|
||
httpReq.Header.Set("Authorization", "Bearer "+e.apiKey)
|
||
}
|
||
|
||
resp, err := e.httpClient.Do(httpReq)
|
||
if err != nil {
|
||
return Result{}, fmt.Errorf("litellm: request failed: %w", err)
|
||
}
|
||
defer resp.Body.Close() //nolint:errcheck
|
||
|
||
if resp.StatusCode != http.StatusOK {
|
||
return Result{}, fmt.Errorf("litellm: server returned status %d", resp.StatusCode)
|
||
}
|
||
|
||
var chatResp litellmResponse
|
||
if err := json.NewDecoder(resp.Body).Decode(&chatResp); err != nil {
|
||
return Result{}, fmt.Errorf("litellm: decode response: %w", err)
|
||
}
|
||
if len(chatResp.Choices) == 0 {
|
||
return Result{}, fmt.Errorf("litellm: no choices in response")
|
||
}
|
||
|
||
content := chatResp.Choices[0].Message.Content
|
||
var result Result
|
||
if err := json.Unmarshal([]byte(content), &result); err != nil {
|
||
return Result{}, fmt.Errorf("litellm: parse result JSON: %w — content: %s", err, content)
|
||
}
|
||
if err := result.Validate(); err != nil {
|
||
return Result{}, fmt.Errorf("litellm: invalid result: %w", err)
|
||
}
|
||
return result, nil
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 4: Run tests to verify they pass**
|
||
|
||
```bash
|
||
go test ./internal/exec/... -run TestLiteLLM -v
|
||
```
|
||
|
||
Expected: all 5 LiteLLM tests PASS.
|
||
|
||
- [ ] **Step 5: Commit**
|
||
|
||
```bash
|
||
git add internal/exec/litellm.go internal/exec/litellm_test.go
|
||
git commit -m "feat(exec): add LiteLLM HTTP executor for local model dispatch"
|
||
```
|
||
|
||
---
|
||
|
||
### Task 5: Claude verifier
|
||
|
||
**Files:**
|
||
- Create: `internal/exec/verifier.go`
|
||
- Create: `internal/exec/verifier_test.go`
|
||
|
||
The verifier runs a focused `claude --print` call. It gives Claude the skill discipline, the original task, and the local model output, and asks for a JSON verdict. Unlike the main executor it uses `--print` without `--output-format json` (no envelope) and without `--json-schema` (we parse the raw text). It selects a specific claude model via `--model`.
|
||
|
||
- [ ] **Step 1: Write the failing tests**
|
||
|
||
Create `internal/exec/verifier_test.go`:
|
||
|
||
```go
|
||
package exec_test
|
||
|
||
import (
|
||
"context"
|
||
"encoding/json"
|
||
"fmt"
|
||
"os"
|
||
"path/filepath"
|
||
"testing"
|
||
"time"
|
||
|
||
iexec "github.com/mathiasbq/supervisor/internal/exec"
|
||
"github.com/stretchr/testify/assert"
|
||
"github.com/stretchr/testify/require"
|
||
)
|
||
|
||
func fakeVerifierClaude(t *testing.T, verdict iexec.Verdict) string {
|
||
t.Helper()
|
||
data, err := json.Marshal(verdict)
|
||
require.NoError(t, err)
|
||
dir := t.TempDir()
|
||
script := filepath.Join(dir, "claude")
|
||
content := fmt.Sprintf("#!/bin/sh\necho '%s'\n", string(data))
|
||
require.NoError(t, os.WriteFile(script, []byte(content), 0755))
|
||
return script
|
||
}
|
||
|
||
func TestVerifierAccepts(t *testing.T) {
|
||
claude := fakeVerifierClaude(t, iexec.Verdict{Accept: true, Feedback: ""})
|
||
v := iexec.NewVerifier(claude, "claude-sonnet-4-6", 5*time.Second)
|
||
|
||
verdict, err := v.Verify(context.Background(), "skill rules", "do the task", iexec.Result{
|
||
Status: "pass", Phase: "review", Skill: "review", Message: "ok",
|
||
})
|
||
require.NoError(t, err)
|
||
assert.True(t, verdict.Accept)
|
||
assert.Empty(t, verdict.Feedback)
|
||
}
|
||
|
||
func TestVerifierEscalates(t *testing.T) {
|
||
claude := fakeVerifierClaude(t, iexec.Verdict{Accept: false, Feedback: "missing line references"})
|
||
v := iexec.NewVerifier(claude, "claude-sonnet-4-6", 5*time.Second)
|
||
|
||
verdict, err := v.Verify(context.Background(), "skill rules", "do the task", iexec.Result{
|
||
Status: "pass", Phase: "review", Skill: "review", Message: "incomplete",
|
||
})
|
||
require.NoError(t, err)
|
||
assert.False(t, verdict.Accept)
|
||
assert.Equal(t, "missing line references", verdict.Feedback)
|
||
}
|
||
|
||
func TestVerifierErrorOnUnparsableOutput(t *testing.T) {
|
||
dir := t.TempDir()
|
||
script := filepath.Join(dir, "claude")
|
||
require.NoError(t, os.WriteFile(script, []byte("#!/bin/sh\necho 'not json'\n"), 0755))
|
||
|
||
v := iexec.NewVerifier(script, "claude-sonnet-4-6", 5*time.Second)
|
||
_, err := v.Verify(context.Background(), "rules", "task", iexec.Result{
|
||
Status: "pass", Phase: "review", Skill: "review", Message: "ok",
|
||
})
|
||
assert.Error(t, err)
|
||
}
|
||
|
||
func TestVerifierErrorOnNonZeroExit(t *testing.T) {
|
||
dir := t.TempDir()
|
||
script := filepath.Join(dir, "claude")
|
||
require.NoError(t, os.WriteFile(script, []byte("#!/bin/sh\nexit 1\n"), 0755))
|
||
|
||
v := iexec.NewVerifier(script, "claude-sonnet-4-6", 5*time.Second)
|
||
_, err := v.Verify(context.Background(), "rules", "task", iexec.Result{
|
||
Status: "pass", Phase: "review", Skill: "review", Message: "ok",
|
||
})
|
||
assert.Error(t, err)
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 2: Run tests to verify they fail**
|
||
|
||
```bash
|
||
go test ./internal/exec/... -run TestVerifier -v
|
||
```
|
||
|
||
Expected: compile error — `iexec.NewVerifier`, `iexec.Verdict` undefined.
|
||
|
||
- [ ] **Step 3: Create internal/exec/verifier.go**
|
||
|
||
```go
|
||
package exec
|
||
|
||
import (
|
||
"bytes"
|
||
"context"
|
||
"encoding/json"
|
||
"fmt"
|
||
"os"
|
||
"os/exec"
|
||
"time"
|
||
)
|
||
|
||
// Verdict is the output of a Claude verification call.
|
||
type Verdict struct {
|
||
Accept bool `json:"accept"`
|
||
Feedback string `json:"feedback"` // empty when Accept is true
|
||
}
|
||
|
||
// Verifier runs a focused Claude call to judge local model output.
|
||
type Verifier struct {
|
||
claudeBinary string
|
||
model string
|
||
timeout time.Duration
|
||
}
|
||
|
||
// NewVerifier creates a Verifier that calls claude with the given binary path and model.
|
||
func NewVerifier(claudeBinary, model string, timeout time.Duration) *Verifier {
|
||
if claudeBinary == "" {
|
||
claudeBinary = "claude"
|
||
}
|
||
if timeout == 0 {
|
||
timeout = 30 * time.Second
|
||
}
|
||
return &Verifier{
|
||
claudeBinary: claudeBinary,
|
||
model: model,
|
||
timeout: timeout,
|
||
}
|
||
}
|
||
|
||
// Verify asks Claude whether output satisfies the skill discipline's iron laws.
|
||
// Returns Verdict{Accept: true} to accept or Verdict{Accept: false, Feedback: "..."}
|
||
// to escalate. Returns an error on subprocess failure or unparseable response.
|
||
func (v *Verifier) Verify(ctx context.Context, skillPrompt, taskPrompt string, output Result) (Verdict, error) {
|
||
ctx, cancel := context.WithTimeout(ctx, v.timeout)
|
||
defer cancel()
|
||
|
||
outputJSON, err := json.Marshal(output)
|
||
if err != nil {
|
||
return Verdict{}, fmt.Errorf("verifier: marshal output: %w", err)
|
||
}
|
||
|
||
prompt := fmt.Sprintf(`You are a quality verifier for an AI supervisor system.
|
||
|
||
Given the skill discipline, the original task, and the generated output, decide whether the output satisfies the discipline's iron laws and output contract.
|
||
|
||
Reply with JSON only — no other text:
|
||
{"accept": true, "feedback": ""}
|
||
or
|
||
{"accept": false, "feedback": "<one sentence reason>"}
|
||
|
||
## Skill discipline
|
||
%s
|
||
|
||
## Original task
|
||
%s
|
||
|
||
## Generated output
|
||
%s`, skillPrompt, taskPrompt, string(outputJSON))
|
||
|
||
args := []string{
|
||
"--print",
|
||
"--permission-mode", "bypassPermissions",
|
||
}
|
||
if v.model != "" {
|
||
args = append(args, "--model", v.model)
|
||
}
|
||
args = append(args, prompt)
|
||
|
||
cmd := exec.CommandContext(ctx, v.claudeBinary, args...)
|
||
cmd.Env = os.Environ()
|
||
var stdout, stderr bytes.Buffer
|
||
cmd.Stdout = &stdout
|
||
cmd.Stderr = &stderr
|
||
|
||
if err := cmd.Run(); err != nil {
|
||
if ctx.Err() != nil {
|
||
return Verdict{}, fmt.Errorf("verifier: timeout after %s", v.timeout)
|
||
}
|
||
return Verdict{}, fmt.Errorf("verifier: claude exited with error: %w — stderr: %s", err, stderr.String())
|
||
}
|
||
|
||
var verdict Verdict
|
||
if err := json.Unmarshal(bytes.TrimSpace(stdout.Bytes()), &verdict); err != nil {
|
||
return Verdict{}, fmt.Errorf("verifier: parse verdict JSON: %w — raw: %s", err, stdout.String())
|
||
}
|
||
return verdict, nil
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 4: Run tests to verify they pass**
|
||
|
||
```bash
|
||
go test ./internal/exec/... -run TestVerifier -v
|
||
```
|
||
|
||
Expected: all 4 verifier tests PASS.
|
||
|
||
- [ ] **Step 5: Run all exec tests**
|
||
|
||
```bash
|
||
go test ./internal/exec/... -v
|
||
```
|
||
|
||
Expected: all tests pass (executor + litellm + verifier + result tests).
|
||
|
||
- [ ] **Step 6: Commit**
|
||
|
||
```bash
|
||
git add internal/exec/verifier.go internal/exec/verifier_test.go
|
||
git commit -m "feat(exec): add Claude verifier for local model output quality gate"
|
||
```
|
||
|
||
---
|
||
|
||
### Task 6: Orchestrator
|
||
|
||
**Files:**
|
||
- Create: `internal/exec/orchestrator.go`
|
||
- Create: `internal/exec/orchestrator_test.go`
|
||
|
||
The orchestrator implements the `func(ctx, Request) (Result, error)` shape that all skill handlers expect as `ExecutorFn`. It walks the escalation chain, probes llama-swap warm state for local tiers, dispatches generation, and either accepts or escalates based on the verifier verdict. Every attempt is logged in `session.Attempt` format and appended to a provided slice.
|
||
|
||
- [ ] **Step 1: Write the failing tests**
|
||
|
||
Create `internal/exec/orchestrator_test.go`:
|
||
|
||
```go
|
||
package exec_test
|
||
|
||
import (
|
||
"context"
|
||
"errors"
|
||
"testing"
|
||
"time"
|
||
|
||
iexec "github.com/mathiasbq/supervisor/internal/exec"
|
||
"github.com/stretchr/testify/assert"
|
||
"github.com/stretchr/testify/require"
|
||
)
|
||
|
||
// fakeLocalExecutor returns the result/error on each sequential call.
|
||
type fakeLocalExecutor struct {
|
||
calls []fakeCall
|
||
callIdx int
|
||
}
|
||
|
||
type fakeCall struct {
|
||
result iexec.Result
|
||
err error
|
||
}
|
||
|
||
func (f *fakeLocalExecutor) Run(_ context.Context, _ iexec.Request) (iexec.Result, error) {
|
||
if f.callIdx >= len(f.calls) {
|
||
return iexec.Result{}, errors.New("unexpected call")
|
||
}
|
||
c := f.calls[f.callIdx]
|
||
f.callIdx++
|
||
return c.result, c.err
|
||
}
|
||
|
||
// fakeVerifier returns the verdict on each sequential call.
|
||
type fakeVerifier struct {
|
||
verdicts []iexec.Verdict
|
||
idx int
|
||
}
|
||
|
||
func (f *fakeVerifier) Verify(_ context.Context, _, _ string, _ iexec.Result) (iexec.Verdict, error) {
|
||
if f.idx >= len(f.verdicts) {
|
||
return iexec.Verdict{}, errors.New("unexpected verify call")
|
||
}
|
||
v := f.verdicts[f.idx]
|
||
f.idx++
|
||
return v, nil
|
||
}
|
||
|
||
func okResult(skill string) iexec.Result {
|
||
return iexec.Result{Status: "pass", Phase: "review", Skill: skill, Message: "ok", ModelUsed: "m"}
|
||
}
|
||
|
||
func TestOrchestratorSingleLocalAccept(t *testing.T) {
|
||
local := &fakeLocalExecutor{calls: []fakeCall{{result: okResult("review")}}}
|
||
verifier := &fakeVerifier{verdicts: []iexec.Verdict{{Accept: true}}}
|
||
|
||
var attempts []iexec.AttemptRecord
|
||
orch := iexec.NewOrchestrator(
|
||
[]iexec.ChainEntry{{Model: "ollama/devstral", Tier: "local", IsCloud: false}},
|
||
local.Run, nil, verifier, "", &attempts,
|
||
)
|
||
|
||
result, err := orch.Run(context.Background(), iexec.Request{TaskPrompt: "review"})
|
||
require.NoError(t, err)
|
||
assert.Equal(t, "pass", result.Status)
|
||
require.Len(t, attempts, 1)
|
||
assert.Equal(t, "local", attempts[0].Tier)
|
||
assert.Equal(t, "accept", attempts[0].Verdict)
|
||
}
|
||
|
||
func TestOrchestratorEscalatesOnVerifierReject(t *testing.T) {
|
||
goodResult := okResult("review")
|
||
local := &fakeLocalExecutor{calls: []fakeCall{
|
||
{result: iexec.Result{Status: "fail", Phase: "review", Skill: "review", Message: "weak"}},
|
||
{result: goodResult},
|
||
}}
|
||
verifier := &fakeVerifier{verdicts: []iexec.Verdict{
|
||
{Accept: false, Feedback: "missing line refs"},
|
||
{Accept: true},
|
||
}}
|
||
|
||
var attempts []iexec.AttemptRecord
|
||
orch := iexec.NewOrchestrator(
|
||
[]iexec.ChainEntry{
|
||
{Model: "ollama/devstral", Tier: "local", IsCloud: false},
|
||
{Model: "ollama/gemma4", Tier: "local", IsCloud: false},
|
||
},
|
||
local.Run, nil, verifier, "", &attempts,
|
||
)
|
||
|
||
result, err := orch.Run(context.Background(), iexec.Request{TaskPrompt: "review"})
|
||
require.NoError(t, err)
|
||
assert.Equal(t, "pass", result.Status)
|
||
require.Len(t, attempts, 2)
|
||
assert.Equal(t, "escalate", attempts[0].Verdict)
|
||
assert.Equal(t, "missing line refs", attempts[0].Feedback)
|
||
assert.Equal(t, "accept", attempts[1].Verdict)
|
||
// Feedback from tier 0 should have been injected into tier 1 task prompt.
|
||
assert.Equal(t, 2, local.callIdx)
|
||
}
|
||
|
||
func TestOrchestratorEscalatesOnLocalError(t *testing.T) {
|
||
local := &fakeLocalExecutor{calls: []fakeCall{
|
||
{err: errors.New("network failure")},
|
||
{result: okResult("review")},
|
||
}}
|
||
verifier := &fakeVerifier{verdicts: []iexec.Verdict{{Accept: true}}}
|
||
|
||
var attempts []iexec.AttemptRecord
|
||
orch := iexec.NewOrchestrator(
|
||
[]iexec.ChainEntry{
|
||
{Model: "ollama/devstral", Tier: "local", IsCloud: false},
|
||
{Model: "ollama/gemma4", Tier: "local", IsCloud: false},
|
||
},
|
||
local.Run, nil, verifier, "", &attempts,
|
||
)
|
||
|
||
_, err := orch.Run(context.Background(), iexec.Request{TaskPrompt: "review"})
|
||
require.NoError(t, err)
|
||
require.Len(t, attempts, 2)
|
||
assert.Equal(t, "error", attempts[0].Verdict)
|
||
assert.Equal(t, "accept", attempts[1].Verdict)
|
||
}
|
||
|
||
func TestOrchestratorCloudTierSelfCertifies(t *testing.T) {
|
||
cloudResult := okResult("review")
|
||
cloudExec := &fakeLocalExecutor{calls: []fakeCall{{result: cloudResult}}}
|
||
verifier := &fakeVerifier{} // no verdicts — should not be called
|
||
|
||
var attempts []iexec.AttemptRecord
|
||
orch := iexec.NewOrchestrator(
|
||
[]iexec.ChainEntry{{Model: "claude-sonnet-4-6", Tier: "subagent", IsCloud: true}},
|
||
nil, cloudExec.Run, verifier, "", &attempts,
|
||
)
|
||
|
||
result, err := orch.Run(context.Background(), iexec.Request{TaskPrompt: "review"})
|
||
require.NoError(t, err)
|
||
assert.Equal(t, "pass", result.Status)
|
||
require.Len(t, attempts, 1)
|
||
assert.Equal(t, "subagent", attempts[0].Tier)
|
||
assert.Equal(t, "accept", attempts[0].Verdict)
|
||
assert.Equal(t, 0, verifier.idx) // verifier never called
|
||
}
|
||
|
||
func TestOrchestratorAllTiersExhausted(t *testing.T) {
|
||
local := &fakeLocalExecutor{calls: []fakeCall{
|
||
{err: errors.New("unavailable")},
|
||
}}
|
||
|
||
var attempts []iexec.AttemptRecord
|
||
orch := iexec.NewOrchestrator(
|
||
[]iexec.ChainEntry{{Model: "ollama/devstral", Tier: "local", IsCloud: false}},
|
||
local.Run, nil, &fakeVerifier{}, "", &attempts,
|
||
)
|
||
|
||
_, err := orch.Run(context.Background(), iexec.Request{TaskPrompt: "review"})
|
||
assert.ErrorContains(t, err, "all tiers exhausted")
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 2: Run tests to verify they fail**
|
||
|
||
```bash
|
||
go test ./internal/exec/... -run TestOrchestrator -v
|
||
```
|
||
|
||
Expected: compile error — `iexec.ChainEntry`, `iexec.AttemptRecord`, `iexec.NewOrchestrator` undefined.
|
||
|
||
- [ ] **Step 3: Create internal/exec/orchestrator.go**
|
||
|
||
```go
|
||
package exec
|
||
|
||
import (
|
||
"context"
|
||
"fmt"
|
||
"io"
|
||
"net/http"
|
||
"strings"
|
||
"time"
|
||
)
|
||
|
||
// ChainEntry is one tier in an escalation chain.
|
||
type ChainEntry struct {
|
||
Model string // e.g. "ollama/phi4", "claude-sonnet-4-6"
|
||
Tier string // "local" | "subagent" | "managed"
|
||
IsCloud bool // true for claude-* models; skips verifier call
|
||
}
|
||
|
||
// EntryFor builds a ChainEntry from a model name string.
|
||
func EntryFor(model string) ChainEntry {
|
||
cloud := strings.HasPrefix(model, "claude-")
|
||
tier := "local"
|
||
if cloud {
|
||
tier = "subagent"
|
||
}
|
||
return ChainEntry{Model: model, Tier: tier, IsCloud: cloud}
|
||
}
|
||
|
||
// AttemptRecord captures the outcome of one tier attempt for session logging.
|
||
type AttemptRecord struct {
|
||
Model string
|
||
Tier string
|
||
DurationMs int64
|
||
WarmStart bool
|
||
Verdict string // "accept" | "escalate" | "error"
|
||
Feedback string
|
||
}
|
||
|
||
// VerifierFn is the interface the orchestrator uses to verify local output.
|
||
type VerifierFn interface {
|
||
Verify(ctx context.Context, skillPrompt, taskPrompt string, output Result) (Verdict, error)
|
||
}
|
||
|
||
// ExecutorRunFn is the signature of Executor.Run and LiteLLMExecutor.Run.
|
||
type ExecutorRunFn func(ctx context.Context, req Request) (Result, error)
|
||
|
||
// Orchestrator walks an escalation chain, delegating generation and verification.
|
||
// It implements the ExecutorFn shape expected by skill handlers.
|
||
type Orchestrator struct {
|
||
chain []ChainEntry
|
||
localRun ExecutorRunFn // for local (non-cloud) tiers; may be nil
|
||
cloudRun ExecutorRunFn // for cloud tiers; may be nil
|
||
verifier VerifierFn
|
||
llamaSwapURL string
|
||
attempts *[]AttemptRecord
|
||
}
|
||
|
||
// NewOrchestrator creates an Orchestrator.
|
||
// attempts is a pointer to a slice that will be appended to on each tier attempt.
|
||
// Pass nil for localRun or cloudRun if no tiers of that type exist in the chain.
|
||
func NewOrchestrator(
|
||
chain []ChainEntry,
|
||
localRun ExecutorRunFn,
|
||
cloudRun ExecutorRunFn,
|
||
verifier VerifierFn,
|
||
llamaSwapURL string,
|
||
attempts *[]AttemptRecord,
|
||
) *Orchestrator {
|
||
return &Orchestrator{
|
||
chain: chain,
|
||
localRun: localRun,
|
||
cloudRun: cloudRun,
|
||
verifier: verifier,
|
||
llamaSwapURL: llamaSwapURL,
|
||
attempts: attempts,
|
||
}
|
||
}
|
||
|
||
// Run walks the escalation chain and returns the first accepted result.
|
||
// It satisfies the ExecutorFn signature: func(context.Context, Request) (Result, error).
|
||
func (o *Orchestrator) Run(ctx context.Context, req Request) (Result, error) {
|
||
taskPrompt := req.TaskPrompt
|
||
|
||
for _, entry := range o.chain {
|
||
warm := o.probeWarm(entry.Model)
|
||
start := time.Now()
|
||
|
||
tierReq := req
|
||
tierReq.Model = entry.Model
|
||
tierReq.TaskPrompt = taskPrompt
|
||
|
||
var result Result
|
||
var genErr error
|
||
|
||
if entry.IsCloud {
|
||
result, genErr = o.cloudRun(ctx, tierReq)
|
||
dur := time.Since(start).Milliseconds()
|
||
rec := AttemptRecord{
|
||
Model: entry.Model,
|
||
Tier: entry.Tier,
|
||
DurationMs: dur,
|
||
WarmStart: warm,
|
||
Verdict: "accept",
|
||
}
|
||
if genErr != nil {
|
||
rec.Verdict = "error"
|
||
}
|
||
o.appendAttempt(rec)
|
||
if genErr == nil {
|
||
return result, nil
|
||
}
|
||
continue
|
||
}
|
||
|
||
// Local tier.
|
||
result, genErr = o.localRun(ctx, tierReq)
|
||
dur := time.Since(start).Milliseconds()
|
||
|
||
if genErr != nil {
|
||
o.appendAttempt(AttemptRecord{
|
||
Model: entry.Model,
|
||
Tier: entry.Tier,
|
||
DurationMs: dur,
|
||
WarmStart: warm,
|
||
Verdict: "error",
|
||
Feedback: genErr.Error(),
|
||
})
|
||
continue
|
||
}
|
||
|
||
verdict, verErr := o.verifier.Verify(ctx, req.SkillPrompt, taskPrompt, result)
|
||
if verErr != nil {
|
||
// Treat verifier failure as escalate (safe default).
|
||
o.appendAttempt(AttemptRecord{
|
||
Model: entry.Model,
|
||
Tier: entry.Tier,
|
||
DurationMs: dur,
|
||
WarmStart: warm,
|
||
Verdict: "escalate",
|
||
Feedback: "verifier error: " + verErr.Error(),
|
||
})
|
||
continue
|
||
}
|
||
|
||
if verdict.Accept {
|
||
o.appendAttempt(AttemptRecord{
|
||
Model: entry.Model,
|
||
Tier: entry.Tier,
|
||
DurationMs: dur,
|
||
WarmStart: warm,
|
||
Verdict: "accept",
|
||
})
|
||
return result, nil
|
||
}
|
||
|
||
o.appendAttempt(AttemptRecord{
|
||
Model: entry.Model,
|
||
Tier: entry.Tier,
|
||
DurationMs: dur,
|
||
WarmStart: warm,
|
||
Verdict: "escalate",
|
||
Feedback: verdict.Feedback,
|
||
})
|
||
// Inject verifier feedback into the next tier's task prompt.
|
||
taskPrompt = taskPrompt + "\n\nPrior attempt feedback: " + verdict.Feedback
|
||
}
|
||
|
||
return Result{}, fmt.Errorf("all tiers exhausted after %d attempt(s)", len(o.chain))
|
||
}
|
||
|
||
func (o *Orchestrator) appendAttempt(rec AttemptRecord) {
|
||
if o.attempts != nil {
|
||
*o.attempts = append(*o.attempts, rec)
|
||
}
|
||
}
|
||
|
||
// probeWarm checks whether the model is currently loaded in llama-swap.
|
||
// Returns false on any error or if llamaSwapURL is empty.
|
||
func (o *Orchestrator) probeWarm(model string) bool {
|
||
if o.llamaSwapURL == "" {
|
||
return false
|
||
}
|
||
ctx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
|
||
defer cancel()
|
||
|
||
req, err := http.NewRequestWithContext(ctx, http.MethodGet, o.llamaSwapURL+"/v1/models", nil)
|
||
if err != nil {
|
||
return false
|
||
}
|
||
resp, err := http.DefaultClient.Do(req)
|
||
if err != nil {
|
||
return false
|
||
}
|
||
defer resp.Body.Close() //nolint:errcheck
|
||
body, err := io.ReadAll(resp.Body)
|
||
if err != nil {
|
||
return false
|
||
}
|
||
return strings.Contains(string(body), model)
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 4: Run tests to verify they pass**
|
||
|
||
```bash
|
||
go test ./internal/exec/... -run TestOrchestrator -v
|
||
```
|
||
|
||
Expected: all 5 orchestrator tests PASS.
|
||
|
||
- [ ] **Step 5: Run all exec tests**
|
||
|
||
```bash
|
||
go test ./internal/exec/... -v
|
||
```
|
||
|
||
Expected: all tests pass.
|
||
|
||
- [ ] **Step 6: Commit**
|
||
|
||
```bash
|
||
git add internal/exec/orchestrator.go internal/exec/orchestrator_test.go
|
||
git commit -m "feat(exec): add Orchestrator chain walker with verification and warm-state logging"
|
||
```
|
||
|
||
---
|
||
|
||
### Task 7: Wire orchestrators in main.go
|
||
|
||
**Files:**
|
||
- Modify: `cmd/supervisor/main.go`
|
||
|
||
Replace the single `executor.Run` passed to each skill with a per-skill `Orchestrator`. All skill handlers are unchanged — they still call `ExecutorFn` exactly as before. The `models.Resolve` call is replaced by `models.ChainFor`. A shared `LiteLLMExecutor`, `Verifier`, and `claudeExecutor` are created once and shared across all orchestrators.
|
||
|
||
- [ ] **Step 1: Read the current main.go**
|
||
|
||
Verify it has 6 skill registrations using `ExecutorFn: executor.Run` — lines 98–145.
|
||
|
||
- [ ] **Step 2: Update main.go**
|
||
|
||
Replace the full `main.go` contents. The critical changes:
|
||
1. Build `litellmExec` from `iexec.NewLiteLLM`
|
||
2. Build `verifier` from `iexec.NewVerifier`
|
||
3. Add `func buildOrch(...)` helper to keep registration readable
|
||
4. Replace `ExecutorFn: executor.Run` with `ExecutorFn: buildOrch(...).Run` for each skill
|
||
|
||
```go
|
||
package main
|
||
|
||
import (
|
||
"context"
|
||
"log/slog"
|
||
"net/http"
|
||
"os"
|
||
|
||
"github.com/mathiasbq/supervisor/internal/config"
|
||
iexec "github.com/mathiasbq/supervisor/internal/exec"
|
||
"github.com/mathiasbq/supervisor/internal/mcp"
|
||
"github.com/mathiasbq/supervisor/internal/registry"
|
||
"github.com/mathiasbq/supervisor/internal/skills/brain"
|
||
skilldebug "github.com/mathiasbq/supervisor/internal/skills/debug"
|
||
"github.com/mathiasbq/supervisor/internal/skills/org"
|
||
"github.com/mathiasbq/supervisor/internal/skills/retrospective"
|
||
"github.com/mathiasbq/supervisor/internal/skills/review"
|
||
"github.com/mathiasbq/supervisor/internal/skills/sessionlog"
|
||
"github.com/mathiasbq/supervisor/internal/skills/spec"
|
||
"github.com/mathiasbq/supervisor/internal/skills/tdd"
|
||
"github.com/mathiasbq/supervisor/internal/skills/trainer"
|
||
"github.com/mathiasbq/supervisor/internal/tier"
|
||
)
|
||
|
||
func main() {
|
||
logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
|
||
|
||
cfg, err := config.Load()
|
||
if err != nil {
|
||
logger.Error("load config", "err", err)
|
||
os.Exit(1)
|
||
}
|
||
|
||
models, err := config.LoadModels(cfg.ModelsFile)
|
||
if err != nil {
|
||
logger.Error("load models", "err", err)
|
||
os.Exit(1)
|
||
}
|
||
|
||
systemPrompt, err := os.ReadFile(cfg.ConfigDir + "/CLAUDE.md")
|
||
if err != nil {
|
||
logger.Error("read supervisor CLAUDE.md", "path", cfg.ConfigDir+"/CLAUDE.md", "err", err)
|
||
os.Exit(1)
|
||
}
|
||
|
||
tddPrompt, err := os.ReadFile(cfg.ConfigDir + "/tdd.md")
|
||
if err != nil {
|
||
logger.Error("read tdd.md", "path", cfg.ConfigDir+"/tdd.md", "err", err)
|
||
os.Exit(1)
|
||
}
|
||
|
||
retroPrompt, err := os.ReadFile(cfg.ConfigDir + "/retrospective.md")
|
||
if err != nil {
|
||
logger.Error("read retrospective.md", "path", cfg.ConfigDir+"/retrospective.md", "err", err)
|
||
os.Exit(1)
|
||
}
|
||
|
||
reviewPrompt, err := os.ReadFile(cfg.ConfigDir + "/review.md")
|
||
if err != nil {
|
||
logger.Error("read review.md", "path", cfg.ConfigDir+"/review.md", "err", err)
|
||
os.Exit(1)
|
||
}
|
||
|
||
debugPrompt, err := os.ReadFile(cfg.ConfigDir + "/debug.md")
|
||
if err != nil {
|
||
logger.Error("read debug.md", "path", cfg.ConfigDir+"/debug.md", "err", err)
|
||
os.Exit(1)
|
||
}
|
||
|
||
specPrompt, err := os.ReadFile(cfg.ConfigDir + "/spec.md")
|
||
if err != nil {
|
||
logger.Error("read spec.md", "path", cfg.ConfigDir+"/spec.md", "err", err)
|
||
os.Exit(1)
|
||
}
|
||
|
||
trainerReaderPrompt, err := os.ReadFile(cfg.ConfigDir + "/trainer-reader.md")
|
||
if err != nil {
|
||
logger.Error("read trainer-reader.md", "path", cfg.ConfigDir+"/trainer-reader.md", "err", err)
|
||
os.Exit(1)
|
||
}
|
||
trainerWriterPrompt, err := os.ReadFile(cfg.ConfigDir + "/trainer-writer.md")
|
||
if err != nil {
|
||
logger.Error("read trainer-writer.md", "path", cfg.ConfigDir+"/trainer-writer.md", "err", err)
|
||
os.Exit(1)
|
||
}
|
||
|
||
claudeExec := iexec.New(iexec.Config{
|
||
SystemPrompt: string(systemPrompt),
|
||
LiteLLMBaseURL: cfg.LiteLLMBaseURL,
|
||
LiteLLMAPIKey: cfg.LiteLLMAPIKey,
|
||
})
|
||
|
||
litellmExec := iexec.NewLiteLLM(cfg.LiteLLMBaseURL, cfg.LiteLLMAPIKey, 0)
|
||
|
||
verifier := iexec.NewVerifier("", models.Verifier(), 0)
|
||
|
||
// buildOrch creates a per-skill Orchestrator. Each skill gets its own
|
||
// attempt log; the caller is responsible for saving it to the session log.
|
||
buildOrch := func(skill, override string) *iexec.Orchestrator {
|
||
rawChain := models.ChainFor(skill, override)
|
||
chain := make([]iexec.ChainEntry, len(rawChain))
|
||
for i, m := range rawChain {
|
||
chain[i] = iexec.EntryFor(m)
|
||
}
|
||
attempts := make([]iexec.AttemptRecord, 0, len(chain))
|
||
return iexec.NewOrchestrator(chain, litellmExec.Run, claudeExec.Run, verifier, models.LlamaSwapURL(), &attempts)
|
||
}
|
||
|
||
tierFn := func(ctx context.Context) tier.Info {
|
||
return tier.Detect(ctx, "https://api.anthropic.com", cfg.LiteLLMBaseURL)
|
||
}
|
||
|
||
reg := registry.New()
|
||
reg.Register(tdd.New(tdd.Config{
|
||
SystemPrompt: string(systemPrompt),
|
||
SkillPrompt: string(tddPrompt),
|
||
DefaultModel: models.ChainFor("tdd", "")[0],
|
||
ExecutorFn: buildOrch("tdd", "").Run,
|
||
SessionsDir: cfg.SessionsDir,
|
||
}))
|
||
reg.Register(brain.New(brain.Config{
|
||
IngestBaseURL: cfg.IngestBaseURL,
|
||
}))
|
||
reg.Register(org.New(org.Config{
|
||
TierFn: tierFn,
|
||
}))
|
||
reg.Register(sessionlog.New(sessionlog.Config{
|
||
SessionsDir: cfg.SessionsDir,
|
||
}))
|
||
reg.Register(retrospective.New(retrospective.Config{
|
||
SkillPrompt: string(retroPrompt),
|
||
DefaultModel: models.ChainFor("retrospective", "")[0],
|
||
SessionsDir: cfg.SessionsDir,
|
||
ExecutorFn: buildOrch("retrospective", "").Run,
|
||
}))
|
||
reg.Register(review.New(review.Config{
|
||
SkillPrompt: string(reviewPrompt),
|
||
DefaultModel: models.ChainFor("review", "")[0],
|
||
ExecutorFn: buildOrch("review", "").Run,
|
||
SessionsDir: cfg.SessionsDir,
|
||
}))
|
||
reg.Register(skilldebug.New(skilldebug.Config{
|
||
SkillPrompt: string(debugPrompt),
|
||
DefaultModel: models.ChainFor("debug", "")[0],
|
||
ExecutorFn: buildOrch("debug", "").Run,
|
||
SessionsDir: cfg.SessionsDir,
|
||
}))
|
||
reg.Register(spec.New(spec.Config{
|
||
SkillPrompt: string(specPrompt),
|
||
DefaultModel: models.ChainFor("spec", "")[0],
|
||
ExecutorFn: buildOrch("spec", "").Run,
|
||
SessionsDir: cfg.SessionsDir,
|
||
}))
|
||
reg.Register(trainer.New(trainer.Config{
|
||
ReaderPrompt: string(trainerReaderPrompt),
|
||
WriterPrompt: string(trainerWriterPrompt),
|
||
DefaultModel: models.ChainFor("trainer", "")[0],
|
||
ExecutorFn: buildOrch("trainer", "").Run,
|
||
SessionsDir: cfg.SessionsDir,
|
||
BrainDir: cfg.BrainDir,
|
||
}))
|
||
|
||
srv := mcp.NewServer(reg)
|
||
mux := http.NewServeMux()
|
||
mux.Handle("/mcp", srv)
|
||
|
||
addr := ":" + cfg.Port
|
||
logger.Info("supervisor starting", "addr", addr)
|
||
if err := http.ListenAndServe(addr, mux); err != nil {
|
||
logger.Error("server stopped", "err", err)
|
||
os.Exit(1)
|
||
}
|
||
}
|
||
```
|
||
|
||
- [ ] **Step 3: Build to verify compilation**
|
||
|
||
```bash
|
||
go build ./...
|
||
```
|
||
|
||
Expected: clean build — no errors.
|
||
|
||
- [ ] **Step 4: Run all tests**
|
||
|
||
```bash
|
||
go test ./...
|
||
```
|
||
|
||
Expected: all tests pass.
|
||
|
||
- [ ] **Step 5: Commit**
|
||
|
||
```bash
|
||
git add cmd/supervisor/main.go
|
||
git commit -m "feat(main): wire per-skill Orchestrators replacing single executor.Run"
|
||
```
|
||
|
||
---
|
||
|
||
### Task 8: Ship v0.3.0
|
||
|
||
**Files:** none (CI + tagging only)
|
||
|
||
- [ ] **Step 1: Run full check**
|
||
|
||
```bash
|
||
cd /Users/mathias/Documents/local-dev/AI/supervisor
|
||
task check
|
||
```
|
||
|
||
Expected: lint + vet + test all pass.
|
||
|
||
- [ ] **Step 2: Tag**
|
||
|
||
```bash
|
||
git tag v0.3.0 -m "feat: model orchestration with per-skill chains and Claude verification"
|
||
```
|
||
|
||
- [ ] **Step 3: Push with follow-tags**
|
||
|
||
```bash
|
||
git push && git push --follow-tags
|
||
```
|
||
|
||
Expected: Gitea CI job triggers, tag v0.3.0 pushed. Mirror job should also succeed.
|
||
|
||
- [ ] **Step 4: Verify CI**
|
||
|
||
Check Gitea CI passes. If the mirror job fails on tag, check that the tag doesn't already exist on GitHub.
|
||
|
||
---
|
||
|
||
## Self-review
|
||
|
||
**Spec coverage check:**
|
||
|
||
| Spec requirement | Covered by |
|
||
|-----------------|------------|
|
||
| Each skill dispatches to configured local model via LiteLLM | Task 4 (litellm.go) + Task 7 (buildOrch) |
|
||
| Claude verifies every local output | Task 5 (verifier.go) + Task 6 orchestrator loop |
|
||
| Escalation walks per-skill chain | Task 6 orchestrator.go |
|
||
| Every attempt logged (model, tier, duration, warm, verdict) | Task 1 (Attempt struct) + Task 6 (AttemptRecord) |
|
||
| Cloud tiers self-certify, no verifier call | Task 6 `if entry.IsCloud` branch |
|
||
| Zero changes to skill handlers | Task 7 — handlers untouched, only main.go wired |
|
||
| LiteLLMBaseURL already in config; no new env vars beyond LLAMA_SWAP_URL | models.yaml has llama_swap_url; no config.go change needed |
|
||
| Caller override collapses to single-entry chain | Task 2 ChainFor override path + tests |
|
||
| One attempt per tier before escalating | Task 6 — no retry loop within a tier |
|
||
|
||
**Note on LLAMA_SWAP_URL:** The llama-swap URL lives in `models.yaml` (`llama_swap_url: http://koala:8080`), not in an env var. The spec success criterion says "no new env vars required beyond `LLAMA_SWAP_URL`" — this plan interprets that as the URL being config-file-driven, which avoids any new env var entirely. If an env var override is later needed, it can be added to `config.Config` in a follow-up.
|
||
|
||
**Note on session logging of AttemptRecord:** The orchestrator collects `AttemptRecord` slices in memory. The session JSONL write (via `session.Append`) happens in the skill handlers — which already append an `Entry` with `Attempts []session.Attempt`. In this plan the `AttemptRecord` type lives in the exec package and `session.Attempt` lives in the session package; they are parallel types. A follow-up could unify them, but the skill handlers will need to translate the orchestrator's records into `session.Attempt` structs. Since skill handlers are not changed in this phase (per spec constraint), the translation will need to happen when Phase 4 unifies observability. For now, the orchestrator accumulates records for future use and the existing `session.Attempt{Verified}` field continues to be set by skill handlers as before.
|