149 lines
5.2 KiB
Markdown
149 lines
5.2 KiB
Markdown
# Level 3: Strip Slug Authority from LLM — Design Spec
|
|
|
|
## Problem
|
|
|
|
The ingestion pipeline currently asks the LLM to produce full wiki pages including the file path (e.g. `wiki/sources/finbert-huggingface.md`). This causes two classes of bug:
|
|
|
|
1. **Slug proliferation** — the LLM invents different slugs for the same concept across chunks or runs, producing duplicate pages that diverge in content.
|
|
2. **Unstable paths** — the LLM may shorten, expand, or vary titles, making deduplication via `Resolve` unreliable because the slug mismatch is upstream of the normalizer.
|
|
|
|
## Solution
|
|
|
|
Strip slug authority from the LLM entirely. The LLM returns a minimal structured object. The pipeline computes all slugs deterministically from titles using `wiki.Slug(title)`.
|
|
|
|
---
|
|
|
|
## LLM JSON Contract
|
|
|
|
### Output format (per page)
|
|
|
|
```json
|
|
{
|
|
"title": "FinBERT",
|
|
"type": "concept",
|
|
"subtype": "framework",
|
|
"domain": "ai-llm",
|
|
"content": "## Definition\n\nA BERT-based model fine-tuned for financial sentiment...\n\n## Related\n\n- [[Sentiment Analysis]]\n- [[Hugging Face]]\n"
|
|
}
|
|
```
|
|
|
|
**Fields:**
|
|
|
|
| Field | Required | Values |
|
|
|-------|----------|--------|
|
|
| `title` | yes | Human-readable title, e.g. "FinBERT" |
|
|
| `type` | yes | `"source"` \| `"concept"` \| `"entity"` |
|
|
| `subtype` | for entity/source | entity: `person\|company\|tool\|model\|framework\|technology`; source: `article\|pdf\|book\|video\|note\|project` |
|
|
| `domain` | no | tag string, e.g. `ai-llm`, `finance` |
|
|
| `content` | yes | Markdown body sections only — no frontmatter, no path |
|
|
|
|
**Wikilinks in content:** `[[Display Name]]` only. No slug. The pipeline canonicalizes to `[[slug|Display Name]]` in a post-processing step.
|
|
|
|
**The LLM never writes slugs, paths, or frontmatter.**
|
|
|
|
---
|
|
|
|
## Pipeline Changes
|
|
|
|
### New type: `RawPage`
|
|
|
|
```go
|
|
type RawPage struct {
|
|
Title string
|
|
Type string // "source" | "concept" | "entity"
|
|
Subtype string
|
|
Domain string
|
|
Content string
|
|
}
|
|
```
|
|
|
|
### New step order
|
|
|
|
```
|
|
ParseRawPages → BuildPages → Resolve → CanonicalizeLinks → injectSourceRefs → mergeAll → write
|
|
```
|
|
|
|
### Step descriptions
|
|
|
|
**`ParseRawPages(output string) ([]RawPage, []string)`**
|
|
Replaces `ParsePages`. Deserializes JSON objects with the new schema. Same truncation-recovery logic as today. Returns `(pages, warnings)`.
|
|
|
|
**`BuildPages(rawPages []RawPage, sourceSlug, date string) []wiki.Page`**
|
|
Converts `RawPage → wiki.Page`:
|
|
- Computes slug: `wiki.Slug(page.Title)`
|
|
- Computes path: `wiki/<type>/<slug>.md`
|
|
- Assembles frontmatter:
|
|
```
|
|
---
|
|
title: <Title>
|
|
type: <type>
|
|
subtype: <subtype> # omitted if empty
|
|
domain: <domain> # omitted if empty
|
|
created: <date>
|
|
source: <sourceSlug> # omitted for the source page itself
|
|
---
|
|
```
|
|
- Concatenates frontmatter + content
|
|
|
|
**`Resolve(pages []wiki.Page, inventory) []wiki.Page`**
|
|
Unchanged. Normalizes near-duplicate titles to existing inventory slugs.
|
|
|
|
**`CanonicalizeLinks(pages []wiki.Page, inventory) ([]wiki.Page, []string)`**
|
|
New. Builds a title→slug map from inventory + current batch. Replaces `[[Display Name]]` with `[[slug|Display Name]]` in each page's content. Titles with no known slug are left as-is and returned as warnings.
|
|
|
|
**`injectSourceRefs`**
|
|
Unchanged. Reads `[[slug|...]]` links (post-canonicalization) to inject back-references.
|
|
|
|
**`mergeAll → write`**
|
|
Unchanged.
|
|
|
|
### `pipeline.Run` signature change
|
|
|
|
```go
|
|
func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryRun bool) (Result, error)
|
|
```
|
|
|
|
`source` is already passed (it's the display name / filename). A new internal `sourceSlug` is derived from it via `wiki.Slug(source)` before calling `BuildPages`. No API change needed.
|
|
|
|
---
|
|
|
|
## Files Changed
|
|
|
|
| File | Change |
|
|
|------|--------|
|
|
| `ingestion/internal/pipeline/parse.go` | Replace `ParsePages` with `ParseRawPages` + `RawPage` type |
|
|
| `ingestion/internal/pipeline/build.go` | New file: `BuildPages` |
|
|
| `ingestion/internal/pipeline/links.go` | New file: `CanonicalizeLinks` |
|
|
| `ingestion/internal/pipeline/pipeline.go` | Wire new steps; derive `sourceSlug` from `source` |
|
|
| `ingestion/internal/pipeline/prompt.go` | New system prompt + `BuildPrompt` for new JSON format |
|
|
| `brain/schema.md` | Update wikilink format and JSON schema docs |
|
|
|
|
`resolve.go`, `refs.go`, `backfill.go`, `merge.go` — no changes.
|
|
|
|
---
|
|
|
|
## Wikilink Format
|
|
|
|
- **LLM output**: `[[Display Name]]`
|
|
- **Stored on disk**: `[[slug|Display Name]]`
|
|
- **`CanonicalizeLinks`** converts between the two using the inventory
|
|
|
|
This matches Obsidian's display-alias syntax that the existing codebase already uses.
|
|
|
|
---
|
|
|
|
## Testing Strategy
|
|
|
|
- `ParseRawPages`: table-driven, cover valid JSON, truncated output, unknown type, missing title
|
|
- `BuildPages`: table-driven, cover slug computation, frontmatter assembly, source page (no `source:` field), entity with subtype
|
|
- `CanonicalizeLinks`: cover known title → replaced, unknown title → left as-is + warning, multiple links in one page
|
|
- Integration test: full `Run` call with mock LLM returning new JSON format, assert no slug duplication across two chunks of the same source
|
|
|
|
---
|
|
|
|
## Out of Scope
|
|
|
|
- Re-ingesting existing pages (user will trigger manually after deploy)
|
|
- Changing the `BackfillRefs` endpoint (already correct, slug-based)
|
|
- Changing the `Resolve` fuzzy-match algorithm
|