fix(ingestion): always append .md extension to written filenames

brain_write with a custom filename omitted the .md extension, causing search to skip the file (search.go filters on HasSuffix .md). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 19:23:07 +02:00
parent ca8a691241
commit c9310b1079
7 changed files with 6955 additions and 1 deletions
--- a/docs/multi-model-routing.md
+++ b/docs/multi-model-routing.md
@@ -0,0 +1,241 @@
+# Multi-Model Routing for supervisor
+
+Reference document for implementing multi-model access within the supervisor project.
+Researched April 2026. Constraints: Claude Max subscription (ToS must be respected).
+
+---
+
+## Goal
+
+Route tasks to specialized, cheaper, or local models during agent and skill flows — without
+violating Anthropic's terms or introducing unnecessary infrastructure risk.
+
+---
+
+## Hard Constraints
+
+- Claude Max subscription is in use. Anthropic's April 2026 terms **prohibit using the
+  subscription with third-party harnesses that spoof the Anthropic API surface**.
+- `ANTHROPIC_BASE_URL` → LiteLLM workaround is explicitly out of scope.
+- Claude must remain the reasoning engine. Other models are tools, not replacements.
+
+---
+
+## Infrastructure Available
+
+| Machine | Role | Relevant services |
+|---------|------|-------------------|
+| koala   | GPU inference | llama-swap, Ollama, Qdrant, LiteLLM proxy |
+| iguana  | Services, builds | k3s, general services |
+| flamingo | Daily driver | Claude Code runs here |
+
+LiteLLM proxy on koala exposes 100+ models (local + cloud) through a unified API.
+All machines connected via Tailscale.
+
+---
+
+## Approved Patterns
+
+### Pattern 1 — Native Claude model tiering (zero build)
+
+Claude Code subagents support per-agent model selection via frontmatter.
+Use this for cost routing within the Claude model family.
+
+```yaml
+# ~/.claude/agents/explorer.md
+---
+name: explorer
+description: File reading, code search, codebase mapping — use for all exploration tasks
+model: haiku
+---
+```
+
+- `haiku` for exploration, summarization, classification
+- `sonnet` (default) for main reasoning and implementation
+- `opus` for deep analysis, architecture decisions
+
+**When to use**: Always. Add `model: haiku` to any subagent that does read-heavy or
+classification work. Cheapest and fastest path to cost control.
+
+---
+
+### Pattern 2 — MCP tools wrapping local models (primary build target)
+
+Expose local models on koala as named MCP tools. Claude remains the orchestrator and
+reasoning engine — it calls local models as tools the same way it calls any other tool.
+
+This is the intended MCP use case and carries zero ToS risk.
+
+**Semantic contract**: Claude decides *when* to delegate based on the tool description.
+Write descriptions that tell Claude what the model is good for.
+
+#### MCP server implementation
+
+Small Python server, run on koala or flamingo, registered in Claude Code settings.
+
+```python
+# supervisor/scripts/mcp_local_models.py
+import mcp
+import requests
+
+server = mcp.Server("local-models")
+
+LITELLM_BASE = "http://koala:4000"
+OLLAMA_BASE  = "http://koala:11434"
+
+def _litellm_chat(model: str, prompt: str) -> str:
+    r = requests.post(f"{LITELLM_BASE}/v1/chat/completions", json={
+        "model": model,
+        "messages": [{"role": "user", "content": prompt}],
+        "max_tokens": 2048,
+    })
+    r.raise_for_status()
+    return r.json()["choices"][0]["message"]["content"]
+
+
+@server.tool()
+def ask_local_llama(prompt: str) -> str:
+    """Ask the local Llama model on koala.
+    Use for: bulk summarization, first-pass analysis, classification, simple Q&A,
+    anything that does not require deep reasoning or up-to-date knowledge.
+    Faster and cheaper than cloud models for routine subtasks."""
+    return _litellm_chat("llama3-local", prompt)
+
+
+@server.tool()
+def ask_coding_model(code: str, question: str) -> str:
+    """Ask a code-specialized local model.
+    Use for: syntax checking, boilerplate generation, code formatting questions,
+    simple refactors where pattern-matching is sufficient."""
+    return _litellm_chat("codellama-local", f"Code:\n{code}\n\nQuestion: {question}")
+
+
+@server.tool()
+def list_available_local_models() -> list[str]:
+    """List all models currently available on the local LiteLLM proxy."""
+    r = requests.get(f"{LITELLM_BASE}/v1/models")
+    r.raise_for_status()
+    return [m["id"] for m in r.json()["data"]]
+
+
+if __name__ == "__main__":
+    mcp.run_stdio_server(server)
+```
+
+#### Register in Claude Code
+
+Add to `~/.claude/settings.json` (or project-level `.claude/settings.json`):
+
+```json
+{
+  "mcpServers": {
+    "local-models": {
+      "command": "python3",
+      "args": ["/path/to/supervisor/scripts/mcp_local_models.py"]
+    }
+  }
+}
+```
+
+#### LiteLLM config additions needed on koala
+
+```yaml
+# litellm config.yaml — add model entries for local models
+model_list:
+  - model_name: llama3-local
+    litellm_params:
+      model: ollama/llama3.2
+      api_base: http://localhost:11434
+
+  - model_name: codellama-local
+    litellm_params:
+      model: ollama/codellama
+      api_base: http://localhost:11434
+```
+
+---
+
+### Pattern 3 — External orchestration scripts (for pipeline workflows)
+
+For multi-model pipelines that don't need to live inside a Claude Code session.
+These scripts use their own API key (separate from Max subscription — API billing),
+so they can call Claude API + LiteLLM freely.
+
+Claude Code invokes them via the Bash tool.
+
+```
+Claude Code → [Bash tool] → ./scripts/orchestrate.py → {Claude API, LiteLLM, local models}
+```
+
+```python
+# supervisor/scripts/orchestrate.py
+import anthropic
+import requests
+
+claude = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY — separate from Max subscription
+
+def analyze_document(path: str) -> str:
+    with open(path) as f:
+        content = f.read()
+
+    # Step 1: local Llama extracts structure (fast, cheap)
+    structure = requests.post("http://koala:4000/v1/chat/completions", json={
+        "model": "llama3-local",
+        "messages": [{"role": "user", "content": f"Extract key sections from:\n{content}"}],
+    }).json()["choices"][0]["message"]["content"]
+
+    # Step 2: Claude synthesizes and reasons over it
+    synthesis = claude.messages.create(
+        model="claude-sonnet-4-6",
+        max_tokens=2048,
+        messages=[{"role": "user", "content": f"Synthesize these findings:\n{structure}"}]
+    )
+    return synthesis.content[0].text
+```
+
+**When to use**: Batch processing, automated pipelines, workflows triggered by cron or
+external events. Not for interactive Claude Code sessions.
+
+---
+
+## What to Skip
+
+| Approach | Why skip |
+|----------|----------|
+| `ANTHROPIC_BASE_URL` → LiteLLM | ToS violation with Max subscription (April 2026 terms) |
+| Third-party harnesses (OpenClaw etc.) | Explicitly banned for subscription users |
+| A2A in Claude Code | Not implemented by Anthropic yet — revisit late 2026 |
+| OpenAI agent handoffs | Loses execution context, not worth the complexity |
+
+---
+
+## Protocol Landscape (for awareness, not immediate action)
+
+- **MCP** — production, 97M monthly downloads, your primary tool-access protocol. LiteLLM
+  natively supports it as both MCP gateway and MCP client as of v1.60+.
+- **A2A v1.0** — Google/Linux Foundation, 150+ orgs in production, but Anthropic has not
+  shipped it in Claude Code. The intent is agent-to-agent peer delegation (vs MCP's
+  agent-to-tool). Worth watching for H2 2026.
+- **AGNTCY** — Cisco/Linux Foundation, discovery and identity layer beneath MCP+A2A. 
+  Potentially relevant for multi-machine routing across koala/iguana/flamingo once mature.
+
+---
+
+## Build Priority
+
+| Step | Effort | Value | When |
+|------|--------|-------|------|
+| Add `model: haiku` to explorer subagents | 10 min | Immediate cost saving | Now |
+| Write MCP server for local models | 2–3h | Local model access in sessions | Soon |
+| Register MCP server in Claude Code settings | 15 min | Activates pattern 2 | With above |
+| Write orchestration script template | 1–2h | Pipeline workflows | When needed |
+
+---
+
+## References
+
+- LiteLLM MCP docs: https://docs.litellm.ai/docs/mcp
+- Community MCP wrapper for LiteLLM: https://github.com/itsDarianNgo/mcp-server-litellm
+- Ollama MCP server: https://github.com/rawveg/ollama-mcp
+- A2A protocol status: https://www.linuxfoundation.org/press/a2a-protocol-surpasses-150-organizations-lands-in-major-cloud-platforms-and-sees-enterprise-production-use-in-first-year
+- AGNTCY: https://github.com/agntcy