feat: brain tunnels — cross-wing concept links and embedding-based retrieval #2

Closed
opened 2026-05-07 15:44:54 +00:00 by gitea-mcp-bot · 1 comment

Context

This is the follow-up to #1 (Hall taxonomy). Once the Wing/Hall layout is stable, two retrieval gaps remain:

  1. Cross-wing blindness — the same concept can appear in multiple Wings (e.g. pex-copper in both bathroom-plumbing and koala-plumbing, or val-vol-r2 in both jepa-fx and hyperguild). Current retrieval treats Wings as isolated silos. There is no way to ask "what do I know about X across all Wings?" and get a ranked, deduplicated result.

  2. Term-frequency scoring is brittlesearch.Query scores by raw term count. Synonyms, paraphrasing, and concept drift across sessions mean semantically relevant notes score zero and irrelevant notes score high. As the brain grows this degrades faster than linearly.

This issue addresses both: Tunnels for explicit cross-wing links, and embedding-based retrieval as an opt-in replacement for term-frequency scoring.


Design

Tunnels

A Tunnel is a bidirectional wikilink between two notes in different Wings that share a concept. They are created in two ways:

Automatic — when brain_write writes a note with wing=A, it runs a lightweight concept-match pass over the new note's content against an in-memory index of existing Wing names and note titles. If a match is found in Wing B, it appends a ## See also section with a wikilink to the matching note in Wing B, and appends a reciprocal link to that note.

Manual — new MCP tool brain_tunnel that takes source (wing/hall/slug) and target (wing/hall/slug) and writes the bidirectional link explicitly.

Tunnels are plain Obsidian wikilinks ([[wing-b/hall/slug]]) — no special syntax, no database. Obsidian's graph view will show the cross-wing edges naturally.

Embedding-based retrieval

Replace (or augment) term-frequency scoring in search.Query with cosine similarity over note embeddings, using the existing LiteLLM embedding endpoint on piguard.

Storage — embeddings stored as a sidecar index at brain/.embeddings/index.json — a flat map of relative_path → []float32. This file is gitignored (it's a derived artifact). Obsidian never sees it.

Index maintenance — the watcher (already running at INGEST_WATCH_INTERVAL) detects new/modified .md files under brain/wiki/ and re-embeds them in the background. Initial full index built on first start if brain/.embeddings/index.json is absent.

Query path — when INGEST_EMBED_URL is set (pointing at piguard's embedding API), search.Query embeds the query string and returns results ranked by cosine similarity. When unset, falls back to term-frequency (current behaviour). This makes embeddings opt-in with zero breaking changes.

Hybrid scoring — when both term-frequency and embedding scores are available, combine with a configurable weight: score = α * tf_score + (1-α) * embed_score. Default α=0.3 (favour semantic). Configurable via INGEST_HYBRID_ALPHA.


Implementation

Tunnels

ingestion/internal/brain/tunnel.go

package brain

// TunnelCandidate is a cross-wing match found during auto-tunnel detection.
type TunnelCandidate struct {
    SourcePath string
    TargetPath string
    MatchedTerm string
}

// DetectTunnels scans brainDir/wiki/ for notes whose titles or tags overlap
// with terms extracted from content. Returns candidates without writing.
func DetectTunnels(brainDir, content string) ([]TunnelCandidate, error)

// WriteTunnel appends a wikilink to both source and target notes.
// Idempotent — does not duplicate if link already present.
func WriteTunnel(brainDir, sourcePath, targetPath string) error

Auto-tunnel runs after every brain_write with wing+hall. Candidates are written automatically only when confidence is high (exact title match). Fuzzy matches are written to brain/raw/tunnel-candidates-<date>.md for human review.

New MCP tool: brain_tunnel

{
  "name": "brain_tunnel",
  "description": "Create an explicit bidirectional link between two notes in different wings.",
  "parameters": {
    "source": "wing/hall/slug of the source note",
    "target": "wing/hall/slug of the target note"
  }
}

Embedding retrieval

ingestion/internal/embed/embed.go

package embed

// Index is an in-memory embedding index loaded from brain/.embeddings/index.json.
type Index struct { ... }

// Load reads or initialises the index from disk.
func Load(brainDir string) (*Index, error)

// Upsert embeds content and stores it under path.
func (idx *Index) Upsert(ctx context.Context, path, content string, embedFn EmbedFunc) error

// Search returns the top-k paths ranked by cosine similarity to query.
func (idx *Index) Search(ctx context.Context, query string, k int, embedFn EmbedFunc) ([]Result, error)

// Save persists the index to disk atomically.
func (idx *Index) Save(brainDir string) error

Watcher integration (ingestion/internal/watcher/watcher.go)

On each tick, diff the file list against the embedding index. For new/modified files, call Index.Upsert. Call Index.Save after each batch.

search.Query extension

type QueryOptions struct {
    Query       string
    Limit       int
    Wing        string
    Hall        string
    EmbedIndex  *embed.Index  // nil = term-frequency only
    EmbedFn     embed.EmbedFunc
    HybridAlpha float64       // 0.0–1.0, default 0.3
}

New env vars

Variable Default Purpose
INGEST_EMBED_URL "" Embedding API base URL (piguard LiteLLM). Empty = disabled.
INGEST_EMBED_MODEL text-embedding-3-small Model name passed to LiteLLM
INGEST_EMBED_DIM 1536 Expected embedding dimension (for validation)
INGEST_HYBRID_ALPHA 0.3 TF weight in hybrid scoring (0=pure semantic, 1=pure TF)

Acceptance criteria

Tunnels

  • brain_tunnel source=jepa-fx/decisions/val-vol-r2 target=hyperguild/decisions/routing-floor writes wikilinks in both files and is idempotent on second call
  • Auto-tunnel after brain_write creates a link when an exact Wing/note title match is found in content
  • Fuzzy candidates land in brain/raw/tunnel-candidates-<date>.md, not written automatically
  • Obsidian graph view shows cross-wing edges for tunnel links

Embeddings

  • When INGEST_EMBED_URL is unset, behaviour is identical to pre-issue (term-frequency, no regressions)
  • When set, brain_query results are ranked by hybrid score
  • brain/.embeddings/index.json is populated on watcher tick for new notes
  • Index survives server restart (loaded from disk on start)
  • brain_query with wing filter still scopes embedding search to that Wing's paths only

General

  • All existing tests pass
  • New tests cover: tunnel idempotency, cosine similarity correctness, hybrid alpha boundary values (0.0 and 1.0), watcher upsert on file change

Dependencies

  • Depends on #1 (Hall taxonomy) — Wing/Hall path layout must exist before tunnels and embedding scoping are meaningful
  • No new Go module dependencies for tunnels (stdlib only)
  • Embedding requires an HTTP client to piguard — reuse existing llm.Client pattern

Branch

feat/brain-tunnels-and-embeddings from feat/brain-halls (rebase onto main after #1 merges)

Out of scope

  • Vector database (ChromaDB, Qdrant etc.) — flat JSON index is sufficient at this scale
  • Automatic tunnel creation for fuzzy matches — human review required
  • Fine-tuning or reranking models

Created via git-mcp on behalf of @mathiasbq

## Context This is the follow-up to #1 (Hall taxonomy). Once the Wing/Hall layout is stable, two retrieval gaps remain: 1. **Cross-wing blindness** — the same concept can appear in multiple Wings (e.g. `pex-copper` in both `bathroom-plumbing` and `koala-plumbing`, or `val-vol-r2` in both `jepa-fx` and `hyperguild`). Current retrieval treats Wings as isolated silos. There is no way to ask "what do I know about X across all Wings?" and get a ranked, deduplicated result. 2. **Term-frequency scoring is brittle** — `search.Query` scores by raw term count. Synonyms, paraphrasing, and concept drift across sessions mean semantically relevant notes score zero and irrelevant notes score high. As the brain grows this degrades faster than linearly. This issue addresses both: **Tunnels** for explicit cross-wing links, and **embedding-based retrieval** as an opt-in replacement for term-frequency scoring. --- ## Design ### Tunnels A Tunnel is a bidirectional wikilink between two notes in different Wings that share a concept. They are created in two ways: **Automatic** — when `brain_write` writes a note with `wing=A`, it runs a lightweight concept-match pass over the new note's content against an in-memory index of existing Wing names and note titles. If a match is found in Wing B, it appends a `## See also` section with a wikilink to the matching note in Wing B, and appends a reciprocal link to that note. **Manual** — new MCP tool `brain_tunnel` that takes `source` (wing/hall/slug) and `target` (wing/hall/slug) and writes the bidirectional link explicitly. Tunnels are plain Obsidian wikilinks (`[[wing-b/hall/slug]]`) — no special syntax, no database. Obsidian's graph view will show the cross-wing edges naturally. ### Embedding-based retrieval Replace (or augment) term-frequency scoring in `search.Query` with cosine similarity over note embeddings, using the existing LiteLLM embedding endpoint on piguard. **Storage** — embeddings stored as a sidecar index at `brain/.embeddings/index.json` — a flat map of `relative_path → []float32`. This file is gitignored (it's a derived artifact). Obsidian never sees it. **Index maintenance** — the watcher (already running at `INGEST_WATCH_INTERVAL`) detects new/modified `.md` files under `brain/wiki/` and re-embeds them in the background. Initial full index built on first start if `brain/.embeddings/index.json` is absent. **Query path** — when `INGEST_EMBED_URL` is set (pointing at piguard's embedding API), `search.Query` embeds the query string and returns results ranked by cosine similarity. When unset, falls back to term-frequency (current behaviour). This makes embeddings opt-in with zero breaking changes. **Hybrid scoring** — when both term-frequency and embedding scores are available, combine with a configurable weight: `score = α * tf_score + (1-α) * embed_score`. Default `α=0.3` (favour semantic). Configurable via `INGEST_HYBRID_ALPHA`. --- ## Implementation ### Tunnels #### `ingestion/internal/brain/tunnel.go` ```go package brain // TunnelCandidate is a cross-wing match found during auto-tunnel detection. type TunnelCandidate struct { SourcePath string TargetPath string MatchedTerm string } // DetectTunnels scans brainDir/wiki/ for notes whose titles or tags overlap // with terms extracted from content. Returns candidates without writing. func DetectTunnels(brainDir, content string) ([]TunnelCandidate, error) // WriteTunnel appends a wikilink to both source and target notes. // Idempotent — does not duplicate if link already present. func WriteTunnel(brainDir, sourcePath, targetPath string) error ``` Auto-tunnel runs after every `brain_write` with `wing`+`hall`. Candidates are written automatically only when confidence is high (exact title match). Fuzzy matches are written to `brain/raw/tunnel-candidates-<date>.md` for human review. #### New MCP tool: `brain_tunnel` ```json { "name": "brain_tunnel", "description": "Create an explicit bidirectional link between two notes in different wings.", "parameters": { "source": "wing/hall/slug of the source note", "target": "wing/hall/slug of the target note" } } ``` ### Embedding retrieval #### `ingestion/internal/embed/embed.go` ```go package embed // Index is an in-memory embedding index loaded from brain/.embeddings/index.json. type Index struct { ... } // Load reads or initialises the index from disk. func Load(brainDir string) (*Index, error) // Upsert embeds content and stores it under path. func (idx *Index) Upsert(ctx context.Context, path, content string, embedFn EmbedFunc) error // Search returns the top-k paths ranked by cosine similarity to query. func (idx *Index) Search(ctx context.Context, query string, k int, embedFn EmbedFunc) ([]Result, error) // Save persists the index to disk atomically. func (idx *Index) Save(brainDir string) error ``` #### Watcher integration (`ingestion/internal/watcher/watcher.go`) On each tick, diff the file list against the embedding index. For new/modified files, call `Index.Upsert`. Call `Index.Save` after each batch. #### `search.Query` extension ```go type QueryOptions struct { Query string Limit int Wing string Hall string EmbedIndex *embed.Index // nil = term-frequency only EmbedFn embed.EmbedFunc HybridAlpha float64 // 0.0–1.0, default 0.3 } ``` #### New env vars | Variable | Default | Purpose | |----------|---------|---------| | `INGEST_EMBED_URL` | `""` | Embedding API base URL (piguard LiteLLM). Empty = disabled. | | `INGEST_EMBED_MODEL` | `text-embedding-3-small` | Model name passed to LiteLLM | | `INGEST_EMBED_DIM` | `1536` | Expected embedding dimension (for validation) | | `INGEST_HYBRID_ALPHA` | `0.3` | TF weight in hybrid scoring (0=pure semantic, 1=pure TF) | --- ## Acceptance criteria **Tunnels** - `brain_tunnel source=jepa-fx/decisions/val-vol-r2 target=hyperguild/decisions/routing-floor` writes wikilinks in both files and is idempotent on second call - Auto-tunnel after `brain_write` creates a link when an exact Wing/note title match is found in content - Fuzzy candidates land in `brain/raw/tunnel-candidates-<date>.md`, not written automatically - Obsidian graph view shows cross-wing edges for tunnel links **Embeddings** - When `INGEST_EMBED_URL` is unset, behaviour is identical to pre-issue (term-frequency, no regressions) - When set, `brain_query` results are ranked by hybrid score - `brain/.embeddings/index.json` is populated on watcher tick for new notes - Index survives server restart (loaded from disk on start) - `brain_query` with `wing` filter still scopes embedding search to that Wing's paths only **General** - All existing tests pass - New tests cover: tunnel idempotency, cosine similarity correctness, hybrid alpha boundary values (0.0 and 1.0), watcher upsert on file change ## Dependencies - Depends on #1 (Hall taxonomy) — Wing/Hall path layout must exist before tunnels and embedding scoping are meaningful - No new Go module dependencies for tunnels (stdlib only) - Embedding requires an HTTP client to piguard — reuse existing `llm.Client` pattern ## Branch `feat/brain-tunnels-and-embeddings` from `feat/brain-halls` (rebase onto main after #1 merges) ## Out of scope - Vector database (ChromaDB, Qdrant etc.) — flat JSON index is sufficient at this scale - Automatic tunnel creation for fuzzy matches — human review required - Fine-tuning or reranking models --- _Created via git-mcp on behalf of @mathiasbq_
Owner

Restructuring this. Two reasons:

  1. Embedding section conflicts with #8. This issue specifies a flat-JSON sidecar (brain/.embeddings/index.json) as the vector store. DECISIONS.md (2026-04-08) commits to Qdrant for vectors, and #8 plans the hybrid BM25+Qdrant+nomic-embed-text path properly. Two parallel embedding designs in flight is worse than one.

  2. Tunnels are orthogonal to retrieval. Cross-wing wikilinks are a structural / navigation feature; embedding retrieval is a scoring feature. Bundling them was a packaging accident.

Plan:

  • #16 (new) — Tunnels only, depends on #1.
  • #8 — already covers the embedding work properly. Adding a comment there to confirm it absorbs the embedding portion of this issue.

Closing this one as restructured.

Restructuring this. Two reasons: 1. **Embedding section conflicts with #8.** This issue specifies a flat-JSON sidecar (`brain/.embeddings/index.json`) as the vector store. DECISIONS.md (2026-04-08) commits to Qdrant for vectors, and #8 plans the hybrid BM25+Qdrant+nomic-embed-text path properly. Two parallel embedding designs in flight is worse than one. 2. **Tunnels are orthogonal to retrieval.** Cross-wing wikilinks are a structural / navigation feature; embedding retrieval is a scoring feature. Bundling them was a packaging accident. Plan: - **#16** (new) — Tunnels only, depends on #1. - **#8** — already covers the embedding work properly. Adding a comment there to confirm it absorbs the embedding portion of this issue. Closing this one as restructured.
Sign in to join this conversation.
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: mathias/hyperguild#2