diff --git a/docs/superpowers/specs/2026-04-23-level3-slug-authority-design.md b/docs/superpowers/specs/2026-04-23-level3-slug-authority-design.md new file mode 100644 index 0000000..8035e20 --- /dev/null +++ b/docs/superpowers/specs/2026-04-23-level3-slug-authority-design.md @@ -0,0 +1,148 @@ +# Level 3: Strip Slug Authority from LLM — Design Spec + +## Problem + +The ingestion pipeline currently asks the LLM to produce full wiki pages including the file path (e.g. `wiki/sources/finbert-huggingface.md`). This causes two classes of bug: + +1. **Slug proliferation** — the LLM invents different slugs for the same concept across chunks or runs, producing duplicate pages that diverge in content. +2. **Unstable paths** — the LLM may shorten, expand, or vary titles, making deduplication via `Resolve` unreliable because the slug mismatch is upstream of the normalizer. + +## Solution + +Strip slug authority from the LLM entirely. The LLM returns a minimal structured object. The pipeline computes all slugs deterministically from titles using `wiki.Slug(title)`. + +--- + +## LLM JSON Contract + +### Output format (per page) + +```json +{ + "title": "FinBERT", + "type": "concept", + "subtype": "framework", + "domain": "ai-llm", + "content": "## Definition\n\nA BERT-based model fine-tuned for financial sentiment...\n\n## Related\n\n- [[Sentiment Analysis]]\n- [[Hugging Face]]\n" +} +``` + +**Fields:** + +| Field | Required | Values | +|-------|----------|--------| +| `title` | yes | Human-readable title, e.g. "FinBERT" | +| `type` | yes | `"source"` \| `"concept"` \| `"entity"` | +| `subtype` | for entity/source | entity: `person\|company\|tool\|model\|framework\|technology`; source: `article\|pdf\|book\|video\|note\|project` | +| `domain` | no | tag string, e.g. `ai-llm`, `finance` | +| `content` | yes | Markdown body sections only — no frontmatter, no path | + +**Wikilinks in content:** `[[Display Name]]` only. No slug. The pipeline canonicalizes to `[[slug|Display Name]]` in a post-processing step. + +**The LLM never writes slugs, paths, or frontmatter.** + +--- + +## Pipeline Changes + +### New type: `RawPage` + +```go +type RawPage struct { + Title string + Type string // "source" | "concept" | "entity" + Subtype string + Domain string + Content string +} +``` + +### New step order + +``` +ParseRawPages → BuildPages → Resolve → CanonicalizeLinks → injectSourceRefs → mergeAll → write +``` + +### Step descriptions + +**`ParseRawPages(output string) ([]RawPage, []string)`** +Replaces `ParsePages`. Deserializes JSON objects with the new schema. Same truncation-recovery logic as today. Returns `(pages, warnings)`. + +**`BuildPages(rawPages []RawPage, sourceSlug, date string) []wiki.Page`** +Converts `RawPage → wiki.Page`: +- Computes slug: `wiki.Slug(page.Title)` +- Computes path: `wiki//.md` +- Assembles frontmatter: + ``` + --- + title: + type: <type> + subtype: <subtype> # omitted if empty + domain: <domain> # omitted if empty + created: <date> + source: <sourceSlug> # omitted for the source page itself + --- + ``` +- Concatenates frontmatter + content + +**`Resolve(pages []wiki.Page, inventory) []wiki.Page`** +Unchanged. Normalizes near-duplicate titles to existing inventory slugs. + +**`CanonicalizeLinks(pages []wiki.Page, inventory) ([]wiki.Page, []string)`** +New. Builds a title→slug map from inventory + current batch. Replaces `[[Display Name]]` with `[[slug|Display Name]]` in each page's content. Titles with no known slug are left as-is and returned as warnings. + +**`injectSourceRefs`** +Unchanged. Reads `[[slug|...]]` links (post-canonicalization) to inject back-references. + +**`mergeAll → write`** +Unchanged. + +### `pipeline.Run` signature change + +```go +func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryRun bool) (Result, error) +``` + +`source` is already passed (it's the display name / filename). A new internal `sourceSlug` is derived from it via `wiki.Slug(source)` before calling `BuildPages`. No API change needed. + +--- + +## Files Changed + +| File | Change | +|------|--------| +| `ingestion/internal/pipeline/parse.go` | Replace `ParsePages` with `ParseRawPages` + `RawPage` type | +| `ingestion/internal/pipeline/build.go` | New file: `BuildPages` | +| `ingestion/internal/pipeline/links.go` | New file: `CanonicalizeLinks` | +| `ingestion/internal/pipeline/pipeline.go` | Wire new steps; derive `sourceSlug` from `source` | +| `ingestion/internal/pipeline/prompt.go` | New system prompt + `BuildPrompt` for new JSON format | +| `brain/schema.md` | Update wikilink format and JSON schema docs | + +`resolve.go`, `refs.go`, `backfill.go`, `merge.go` — no changes. + +--- + +## Wikilink Format + +- **LLM output**: `[[Display Name]]` +- **Stored on disk**: `[[slug|Display Name]]` +- **`CanonicalizeLinks`** converts between the two using the inventory + +This matches Obsidian's display-alias syntax that the existing codebase already uses. + +--- + +## Testing Strategy + +- `ParseRawPages`: table-driven, cover valid JSON, truncated output, unknown type, missing title +- `BuildPages`: table-driven, cover slug computation, frontmatter assembly, source page (no `source:` field), entity with subtype +- `CanonicalizeLinks`: cover known title → replaced, unknown title → left as-is + warning, multiple links in one page +- Integration test: full `Run` call with mock LLM returning new JSON format, assert no slug duplication across two chunks of the same source + +--- + +## Out of Scope + +- Re-ingesting existing pages (user will trigger manually after deploy) +- Changing the `BackfillRefs` endpoint (already correct, slug-based) +- Changing the `Resolve` fuzzy-match algorithm