18 Commits

Author SHA1 Message Date
Mathias Bergqvist
0a70d9e972 feat(pipeline): add POST /ingest-raw for direct batch ingestion without LLM
All checks were successful
CI / Lint / Test / Vet (push) Successful in 9s
CI / Mirror to GitHub (push) Has been skipped
Allows callers to provide pre-structured RawPage data directly, bypassing the
LLM extraction step. The pipeline still handles slug computation, frontmatter,
link canonicalization, source back-references, and dedup — only the extraction
is skipped. Useful when a more capable model or manual curation produces the
structured data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-24 11:15:59 +02:00
Mathias Bergqvist
3e9a648115 fix(pipeline): repair invalid JSON escape sequences from LLM output before parsing
All checks were successful
CI / Lint / Test / Vet (push) Successful in 11s
CI / Mirror to GitHub (push) Has been skipped
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 22:04:27 +02:00
Mathias Bergqvist
923a665365 fix(pipeline): skip RawPages with empty title in BuildPages instead of producing broken paths
All checks were successful
CI / Lint / Test / Vet (push) Successful in 9s
CI / Mirror to GitHub (push) Has been skipped
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:55:37 +02:00
Mathias Bergqvist
537aebc302 feat(pipeline): update system prompt for new LLM JSON contract (no slugs)
- Change prompt to reflect new output format: title, type, subtype, domain, content
- Remove slug/path generation responsibility from LLM — pipeline now handles it
- Wikilinks change from [[slug|Display Name]] to [[Display Name]] only
- LLM no longer includes frontmatter or paths in output

docs(schema): update LLM output format and wikilink convention for Level 3

- Specify JSON schema: title, type, subtype, domain, content fields
- Remove frontmatter requirements from schema output (handled by pipeline)
- Simplify wikilink format to [[Display Name]] — no slug or pipe
- Pipeline now responsible for slug generation and frontmatter construction

These changes shift slug/frontmatter generation from LLM to pipeline,
reducing cognitive load on the model and improving control over output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:45:21 +02:00
Mathias Bergqvist
de35d4dbb0 feat(pipeline): wire ParseRawPages+BuildPages+CanonicalizeLinks into Run
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:07:33 +02:00
Mathias Bergqvist
26855f69b0 feat(pipeline): add CanonicalizeLinks — convert [[Display Name]] to [[slug|Display Name]] 2026-04-23 18:59:10 +02:00
Mathias Bergqvist
a7b363d589 fix(pipeline): quote YAML scalar fields in buildFrontmatter to prevent injection 2026-04-23 18:56:39 +02:00
Mathias Bergqvist
7b57051af8 feat(pipeline): add BuildPages — compute slugs/paths/frontmatter from RawPage 2026-04-23 18:50:37 +02:00
Mathias Bergqvist
a620f6cb01 fix(pipeline): guard empty-title bridge + skip stale integration tests until task4
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 18:46:07 +02:00
Mathias Bergqvist
26b5636b43 feat(pipeline): replace ParsePages with ParseRawPages + RawPage type
Strips slug authority from the LLM. The new RawPage type carries only
{title, type, subtype, domain, content} — no paths or frontmatter.
Pipeline will derive slugs deterministically (Task 4).

pipeline.go gets a temporary bridge stub (TODO task4) to keep the
package compiling between tasks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 18:41:33 +02:00
Mathias Bergqvist
989f375aec docs: add Level 3 implementation plan 2026-04-23 17:37:45 +02:00
Mathias Bergqvist
6403d5e444 docs: add Level 3 slug authority design spec 2026-04-23 17:23:22 +02:00
Mathias Bergqvist
ab19968ae2 feat: POST /backfill-refs — retroactive source back-reference injection
All checks were successful
CI / Lint / Test / Vet (push) Successful in 10s
CI / Mirror to GitHub (push) Successful in 3s
Walks wiki/sources/, extracts wikilinks from each source page, and injects
## Sources back-refs into all linked concept and entity pages. All refs from
all sources are accumulated in memory before writing, so multiple sources
referencing the same concept are merged in a single write. Running the
endpoint multiple times is safe — wiki.Merge deduplicates bullet items.
2026-04-23 16:50:11 +02:00
Mathias Bergqvist
1605624668 feat(pipeline): add POST /backfill-refs endpoint to retroactively inject source back-references 2026-04-23 16:50:00 +02:00
Mathias Bergqvist
55fa0b503a feat: source back-references on concept and entity pages
All checks were successful
CI / Lint / Test / Vet (push) Successful in 10s
CI / Mirror to GitHub (push) Successful in 3s
After each ingestion, every concept and entity page linked from the
source page gains a ## Sources entry pointing back to that source.
Pages already on disk (from prior ingestions) are loaded and updated,
so re-ingesting a new source accumulates references over time.
Deduplication is handled by wiki.Merge's existing bullet-section logic.
2026-04-23 16:36:40 +02:00
Mathias Bergqvist
3c2bd9268c feat(pipeline): wire source back-reference injection into Run 2026-04-23 16:36:22 +02:00
Mathias Bergqvist
29727ec2a5 feat(pipeline): inject source back-references into concept and entity pages 2026-04-23 16:35:47 +02:00
Mathias Bergqvist
0a075088b2 docs: add source back-references implementation plan 2026-04-23 16:33:41 +02:00
23 changed files with 3301 additions and 137 deletions

View File

@@ -3,21 +3,34 @@
This document defines the three page types in the brain wiki. This document defines the three page types in the brain wiki.
The LLM must follow this schema exactly when generating wiki pages. The LLM must follow this schema exactly when generating wiki pages.
## Output Format
Return a JSON array. Each element:
```json
{
"title": "exact page title",
"type": "source | concept | entity",
"subtype": "see below — omit for concept",
"domain": "see domains — omit if none fits",
"content": "Markdown body only — no frontmatter, no path"
}
```
- `subtype` for **source**: `article | pdf | book | video | note | project`
- `subtype` for **entity**: `person | company | tool | model | framework | technology`
- The pipeline computes slugs and frontmatter — never include them in output.
## Wikilink Format ## Wikilink Format
All cross-references use `[[slug|Display Text]]`. All cross-references use `[[Display Name]]` — just the display name, no slug, no pipe.
Rules: Rules:
- slug = lowercase filename without .md, spaces → hyphens, strip all non-alphanumeric except hyphens - Only link to pages in the inventory or pages you are creating in this response
- The `|` separator is REQUIRED — never use `[[Title]]` without a slug - The pipeline converts `[[Display Name]]` to `[[slug|Display Name]]` automatically
- Examples: `[[domain-driven-design|Domain Driven Design]]`, `[[ryan-singer|Ryan Singer]]` - Section links must match their section type (Related Concepts → concept pages only, etc.)
- Slugs must resolve to an existing file in the inventory, or a file you are creating in this response
Slug generation examples: Examples: `[[Domain Driven Design]]`, `[[Ryan Singer]]`, `[[Shape Up]]`
- "Domain Driven Design" → `domain-driven-design`
- "It's Complicated" → `its-complicated`
- "gRPC" → `grpc`
- "GPT-4o" → `gpt-4o`
## Domains ## Domains
@@ -30,17 +43,6 @@ Use one of: `ai-llm`, `software-engineering`, `product-strategy`, `finance-marke
One page per ingested source. Books are NEVER split across multiple source pages — update the existing one. One page per ingested source. Books are NEVER split across multiple source pages — update the existing one.
Required frontmatter:
```yaml
title: <exact title>
type: article | pdf | book | video | note | project
domain: <domain>
date_ingested: YYYY-MM-DD
last_updated: YYYY-MM-DD
aliases:
- <exact title>
```
Body sections (in this order): Body sections (in this order):
### Summary ### Summary
@@ -50,10 +52,10 @@ Body sections (in this order):
Bulleted list. Paraphrase — no verbatim quotes or code. Bulleted list. Paraphrase — no verbatim quotes or code.
### Concepts Introduced or Reinforced ### Concepts Introduced or Reinforced
Wikilinks to wiki/concepts/ ONLY. One per line. Wikilinks to concept pages ONLY. One per line.
### Entities Mentioned ### Entities Mentioned
Wikilinks to wiki/entities/ ONLY. One per line. Wikilinks to entity pages ONLY. One per line.
### Open Questions Raised ### Open Questions Raised
Gaps or follow-up questions from this source. Gaps or follow-up questions from this source.
@@ -75,15 +77,6 @@ Dated entries appended on re-ingestion. NEVER rewrite — only append.
One page per idea, framework, methodology, or pattern. One page per idea, framework, methodology, or pattern.
Required frontmatter:
```yaml
title: <concept name>
domain: <domain>
last_updated: YYYY-MM-DD
aliases:
- <exact title>
```
Body sections (in this order): Body sections (in this order):
### Definition ### Definition
@@ -93,13 +86,13 @@ One-paragraph plain-language explanation.
Practical significance. Why should anyone care? Practical significance. Why should anyone care?
### Related Concepts ### Related Concepts
Wikilinks to wiki/concepts/ ONLY. Wikilinks to concept pages ONLY.
### Related Entities ### Related Entities
Wikilinks to wiki/entities/ ONLY. Wikilinks to entity pages ONLY.
### Sources ### Sources
Wikilinks to wiki/sources/ ONLY. Wikilinks to source pages ONLY.
### Evolving Notes ### Evolving Notes
Updated as new sources arrive. Append, do not rewrite. Updated as new sources arrive. Append, do not rewrite.
@@ -110,16 +103,6 @@ Updated as new sources arrive. Append, do not rewrite.
One page per person, tool, organisation, technology, or product. One page per person, tool, organisation, technology, or product.
Required frontmatter:
```yaml
title: <name>
type: person | company | tool | model | framework | technology
domain: <domain>
last_updated: YYYY-MM-DD
aliases:
- <exact title>
```
Body sections (in this order): Body sections (in this order):
### Description ### Description
@@ -132,23 +115,23 @@ Why this entity matters to this knowledge base.
With dates where known. With dates where known.
### Related Concepts ### Related Concepts
Wikilinks to wiki/concepts/ ONLY. Wikilinks to concept pages ONLY.
### Related Entities ### Related Entities
Wikilinks to wiki/entities/ ONLY. Wikilinks to entity pages ONLY.
### Sources ### Sources
Wikilinks to wiki/sources/ ONLY. Wikilinks to source pages ONLY.
--- ---
## Non-Negotiable Rules ## Non-Negotiable Rules
1. Output ONLY a valid JSON array — no markdown fences, no prose before or after 1. Output ONLY a valid JSON array — no markdown fences, no prose before or after
2. Each element: `{"path": "wiki/<type>/<slug>.md", "content": "...full markdown..."}` 2. Each element: `{"title": "...", "type": "...", "subtype": "...", "domain": "...", "content": "..."}`
3. Slugs are kebab-case: lowercase, spaces→hyphens, strip special characters 3. Never include slugs, paths, or frontmatter in output — the pipeline handles these
4. Every wikilink must be `[[slug|Display Text]]` — the pipe separator is required 4. Wikilinks: `[[Display Name]]` only — no pipe, no slug
5. Dates always YYYY-MM-DD 5. Dates always YYYY-MM-DD (used only in content body where contextually relevant)
6. Never reproduce verbatim code — describe the pattern or technique 6. Never reproduce verbatim code — describe the pattern or technique
7. Section links must match their section type (Related Concepts → concepts/ only, etc.) 7. Section links must match their section type
8. One source page per book — if inventory shows it exists, include it as an UPDATE 8. One source page per book — if inventory shows it exists, include it as an UPDATE

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,433 @@
# Source Back-References Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** After the LLM produces wiki pages for an ingestion, automatically inject a `## Sources` back-reference on every concept and entity page that the source page links to.
**Architecture:** A new `injectSourceRefs` post-processing step is inserted between `Resolve` and `mergeAll` in `pipeline.Run`. It finds the source page in the proposed batch, extracts all `[[slug|...]]` wikilinks, then calls `wiki.Merge` with a minimal patch page to add the back-reference. `wiki.Merge` already treats `## Sources` as a bullet section with deduplication — no custom section parsing is needed. For concepts/entities that exist on disk but weren't proposed in the current batch (the common case on re-ingestion), the function loads them from disk and adds them to the pages list so they are updated.
**Tech Stack:** Go stdlib (`regexp`, `os`, `path/filepath`, `strings`), existing `wiki.Merge` and `wiki.Page` types.
---
## File Structure
**New files:**
- `ingestion/internal/pipeline/refs.go``injectSourceRefs`, `addSourceRef`, `extractWikilinks`, `findSourcePage`, `findInInventory`
- `ingestion/internal/pipeline/refs_test.go` — table-driven tests
**Modified files:**
- `ingestion/internal/pipeline/pipeline.go` — insert `injectSourceRefs` call between `Resolve` and `mergeAll`
---
### Task 1: `refs.go` — source back-reference injection
**Files:**
- Create: `ingestion/internal/pipeline/refs_test.go`
- Create: `ingestion/internal/pipeline/refs.go`
- [ ] **Step 1: Write the failing tests**
```go
// ingestion/internal/pipeline/refs_test.go
package pipeline
import (
"os"
"path/filepath"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// makeInventory builds a minimal inventory for test use.
func makeInventory(concepts, entities []string) map[wiki.PageType][]wiki.Entry {
inv := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {},
wiki.PageTypeEntity: {},
wiki.PageTypeSource: {},
}
for _, slug := range concepts {
inv[wiki.PageTypeConcept] = append(inv[wiki.PageTypeConcept], wiki.Entry{Slug: slug, Title: slug})
}
for _, slug := range entities {
inv[wiki.PageTypeEntity] = append(inv[wiki.PageTypeEntity], wiki.Entry{Slug: slug, Title: slug})
}
return inv
}
func TestInjectSourceRefs_NoSourcePage(t *testing.T) {
pages := []wiki.Page{
{Path: "wiki/concepts/foo.md", Content: "---\ntitle: Foo\n---\n\n## Definition\n\nFoo.\n"},
}
got := injectSourceRefs(pages, makeInventory(nil, nil), t.TempDir())
assert.Equal(t, pages, got)
}
func TestInjectSourceRefs_InjectsIntoProposedConcept(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/my-article.md",
Content: "---\ntitle: My Article\n---\n\n## Summary\n\nSee [[domain-driven-design|Domain Driven Design]].\n",
},
{
Path: "wiki/concepts/domain-driven-design.md",
Content: "---\ntitle: Domain Driven Design\n---\n\n## Definition\n\nA methodology.\n",
},
}
got := injectSourceRefs(pages, makeInventory(nil, nil), t.TempDir())
require.Len(t, got, 2)
assert.Contains(t, got[1].Content, "## Sources")
assert.Contains(t, got[1].Content, "[[my-article|My Article]]")
}
func TestInjectSourceRefs_LoadsConceptFromDisk(t *testing.T) {
brainDir := t.TempDir()
conceptDir := filepath.Join(brainDir, "wiki", "concepts")
require.NoError(t, os.MkdirAll(conceptDir, 0o755))
require.NoError(t, os.WriteFile(
filepath.Join(conceptDir, "shape-up.md"),
[]byte("---\ntitle: Shape Up\n---\n\n## Definition\n\nA methodology.\n"),
0o644,
))
pages := []wiki.Page{
{
Path: "wiki/sources/my-article.md",
Content: "---\ntitle: My Article\n---\n\n## Summary\n\nSee [[shape-up|Shape Up]].\n",
},
}
inv := makeInventory([]string{"shape-up"}, nil)
got := injectSourceRefs(pages, inv, brainDir)
// Should have loaded shape-up.md from disk and added it with source ref.
require.Len(t, got, 2)
var conceptPage wiki.Page
for _, p := range got {
if p.Path == "wiki/concepts/shape-up.md" {
conceptPage = p
}
}
assert.Contains(t, conceptPage.Content, "## Sources")
assert.Contains(t, conceptPage.Content, "[[my-article|My Article]]")
// Original content preserved.
assert.Contains(t, conceptPage.Content, "## Definition")
}
func TestInjectSourceRefs_NoSelfReference(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/my-article.md",
Content: "---\ntitle: My Article\n---\n\n## Summary\n\nSelf-link [[my-article|My Article]].\n",
},
}
got := injectSourceRefs(pages, makeInventory(nil, nil), t.TempDir())
// Only one page — source should not reference itself.
assert.Len(t, got, 1)
}
func TestInjectSourceRefs_DeduplicatesOnReingestion(t *testing.T) {
// Concept already has source ref from a prior ingestion.
pages := []wiki.Page{
{
Path: "wiki/sources/my-article.md",
Content: "---\ntitle: My Article\n---\n\n## Summary\n\nSee [[ddd|DDD]].\n",
},
{
Path: "wiki/concepts/ddd.md",
Content: "---\ntitle: DDD\n---\n\n## Definition\n\nA thing.\n\n## Sources\n\n- [[my-article|My Article]]\n",
},
}
got := injectSourceRefs(pages, makeInventory(nil, nil), t.TempDir())
require.Len(t, got, 2)
// The source ref must appear exactly once.
count := 0
for _, line := range splitLines(got[1].Content) {
if line == "- [[my-article|My Article]]" {
count++
}
}
assert.Equal(t, 1, count, "source ref should appear exactly once")
}
func TestInjectSourceRefs_InjectsIntoEntity(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/book.md",
Content: "---\ntitle: Book\n---\n\n## Summary\n\nBy [[ryan-singer|Ryan Singer]].\n",
},
{
Path: "wiki/entities/ryan-singer.md",
Content: "---\ntitle: Ryan Singer\n---\n\n## Description\n\nA designer.\n",
},
}
got := injectSourceRefs(pages, makeInventory(nil, nil), t.TempDir())
require.Len(t, got, 2)
var entity wiki.Page
for _, p := range got {
if p.Path == "wiki/entities/ryan-singer.md" {
entity = p
}
}
assert.Contains(t, entity.Content, "[[book|Book]]")
}
func TestExtractWikilinks(t *testing.T) {
content := "See [[foo|Foo]] and [[bar|Bar]] and [[foo|Foo again]]."
got := extractWikilinks(content)
assert.True(t, got["foo"])
assert.True(t, got["bar"])
assert.Len(t, got, 2, "duplicate slugs should be deduplicated")
}
// splitLines is a test helper.
func splitLines(s string) []string {
var out []string
for _, l := range splitNewlines(s) {
if l != "" {
out = append(out, l)
}
}
return out
}
func splitNewlines(s string) []string {
var lines []string
start := 0
for i, c := range s {
if c == '\n' {
lines = append(lines, s[start:i])
start = i + 1
}
}
lines = append(lines, s[start:])
return lines
}
```
- [ ] **Step 2: Run to verify they fail**
```bash
cd /Users/mathias/Documents/local-dev/AI/hyperguild/.worktrees/feat-source-backrefs/ingestion && go test ./internal/pipeline/... -run "TestInjectSourceRefs|TestExtractWikilinks" -v
```
Expected: compile error — `injectSourceRefs` and `extractWikilinks` not defined.
- [ ] **Step 3: Implement refs.go**
```go
// ingestion/internal/pipeline/refs.go
package pipeline
import (
"os"
"path/filepath"
"regexp"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
var wikilinkRE = regexp.MustCompile(`\[\[([^|\]]+)\|`)
// injectSourceRefs finds the source page in the proposed batch, extracts its wikilinks,
// and injects a back-reference into every linked concept or entity page.
// Pages that exist on disk but are not in the current batch are loaded and appended
// so they will be updated on write.
func injectSourceRefs(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry, brainDir string) []wiki.Page {
sourceSlug, sourceTitle, found := findSourcePage(pages)
if !found {
return pages
}
// Locate source page content for wikilink extraction.
var sourceContent string
for _, p := range pages {
if strings.HasPrefix(p.Path, "wiki/sources/") &&
strings.TrimSuffix(filepath.Base(p.Path), ".md") == sourceSlug {
sourceContent = p.Content
break
}
}
linkedSlugs := extractWikilinks(sourceContent)
sourceRef := "- [[" + sourceSlug + "|" + sourceTitle + "]]"
// Build slug → index map for proposed pages (excluding wiki/sources/).
bySlug := make(map[string]int, len(pages))
for i, p := range pages {
if !strings.HasPrefix(p.Path, "wiki/sources/") {
bySlug[strings.TrimSuffix(filepath.Base(p.Path), ".md")] = i
}
}
for slug := range linkedSlugs {
if slug == sourceSlug {
continue // no self-reference
}
if idx, ok := bySlug[slug]; ok {
// Concept/entity is in the proposed batch — inject inline.
pages[idx] = addSourceRef(pages[idx], sourceRef)
continue
}
// Not in proposed batch — look for it in the inventory (exists on disk).
pt, ok := findInInventory(slug, inventory)
if !ok {
continue
}
diskPath := filepath.Join(brainDir, "wiki", string(pt), slug+".md")
b, err := os.ReadFile(diskPath)
if err != nil {
continue // page not found on disk; skip
}
page := wiki.Page{
Path: "wiki/" + string(pt) + "/" + slug + ".md",
Content: string(b),
}
pages = append(pages, addSourceRef(page, sourceRef))
}
return pages
}
// addSourceRef injects sourceRef into the ## Sources bullet section of page.
// Uses wiki.Merge so that existing Sources entries are deduplicated and all
// other sections are preserved unchanged.
func addSourceRef(page wiki.Page, sourceRef string) wiki.Page {
patch := wiki.Page{
Path: page.Path,
Content: "\n## Sources\n\n" + sourceRef + "\n",
}
return wiki.Merge(page, patch)
}
// extractWikilinks returns the set of slugs referenced as [[slug|...]] in content.
func extractWikilinks(content string) map[string]bool {
slugs := make(map[string]bool)
for _, m := range wikilinkRE.FindAllStringSubmatch(content, -1) {
slugs[m[1]] = true
}
return slugs
}
// findSourcePage returns the slug and title of the first wiki/sources/ page in pages.
func findSourcePage(pages []wiki.Page) (slug, title string, found bool) {
for _, p := range pages {
if strings.HasPrefix(p.Path, "wiki/sources/") {
slug = strings.TrimSuffix(filepath.Base(p.Path), ".md")
title = extractTitle(p.Content)
if title == "" {
title = slug
}
return slug, title, true
}
}
return "", "", false
}
// findInInventory returns the PageType for a slug if it appears in the inventory.
func findInInventory(slug string, inventory map[wiki.PageType][]wiki.Entry) (wiki.PageType, bool) {
for pt, entries := range inventory {
for _, e := range entries {
if e.Slug == slug {
return pt, true
}
}
}
return "", false
}
```
- [ ] **Step 4: Run all pipeline tests**
```bash
cd /Users/mathias/Documents/local-dev/AI/hyperguild/.worktrees/feat-source-backrefs/ingestion && go test ./internal/pipeline/... -v
```
Expected: all existing tests PASS + 7 new refs tests PASS.
- [ ] **Step 5: Commit**
```bash
cd /Users/mathias/Documents/local-dev/AI/hyperguild/.worktrees/feat-source-backrefs && git add ingestion/internal/pipeline/refs.go ingestion/internal/pipeline/refs_test.go && git commit -m "feat(pipeline): inject source back-references into concept and entity pages"
```
---
### Task 2: Wire injectSourceRefs into pipeline.Run
**Files:**
- Modify: `ingestion/internal/pipeline/pipeline.go`
- [ ] **Step 1: Insert the call**
In `pipeline.go`, locate:
```go
resolved := Resolve(allPages, inventory)
merged := mergeAll(resolved)
```
Replace with:
```go
resolved := Resolve(allPages, inventory)
withRefs := injectSourceRefs(resolved, inventory, brainDir)
merged := mergeAll(withRefs)
```
No import changes needed — same package.
- [ ] **Step 2: Run all pipeline tests**
```bash
cd /Users/mathias/Documents/local-dev/AI/hyperguild/.worktrees/feat-source-backrefs/ingestion && go test ./internal/pipeline/... -v
```
Expected: all tests PASS. The existing `TestRun_WritesPages` and `TestRun_DryRunDoesNotWrite` use LLM mocks that return source pages with no wikilinks to concepts — `injectSourceRefs` is a no-op for them.
- [ ] **Step 3: Run full test suite + lint**
```bash
cd /Users/mathias/Documents/local-dev/AI/hyperguild/.worktrees/feat-source-backrefs/ingestion && go test ./... && golangci-lint run ./...
```
Expected: all packages PASS, 0 lint issues.
- [ ] **Step 4: Commit**
```bash
cd /Users/mathias/Documents/local-dev/AI/hyperguild/.worktrees/feat-source-backrefs && git add ingestion/internal/pipeline/pipeline.go && git commit -m "feat(pipeline): wire source back-reference injection into Run"
```
---
## Self-Review
**Spec coverage:**
| Requirement | Task |
|---|---|
| Concepts get `## Sources` back-link to ingested source | Task 1 |
| Entities get `## Sources` back-link | Task 1 (TestInjectSourceRefs_InjectsIntoEntity) |
| Existing pages on disk get updated with new source | Task 1 (TestInjectSourceRefs_LoadsConceptFromDisk) |
| Re-ingestion of same source does not duplicate the ref | Task 1 (TestInjectSourceRefs_DeduplicatesOnReingestion) |
| Source page does not reference itself | Task 1 (TestInjectSourceRefs_NoSelfReference) |
| No-op when batch has no source page | Task 1 (TestInjectSourceRefs_NoSourcePage) |
| Wired into Run between Resolve and mergeAll | Task 2 |
| Full test suite and lint pass | Task 2 Step 3 |
**Placeholder scan:** None.
**Type consistency:** `injectSourceRefs([]wiki.Page, map[wiki.PageType][]wiki.Entry, string) []wiki.Page` — used identically in refs.go (definition) and pipeline.go (call site).

View File

@@ -0,0 +1,148 @@
# Level 3: Strip Slug Authority from LLM — Design Spec
## Problem
The ingestion pipeline currently asks the LLM to produce full wiki pages including the file path (e.g. `wiki/sources/finbert-huggingface.md`). This causes two classes of bug:
1. **Slug proliferation** — the LLM invents different slugs for the same concept across chunks or runs, producing duplicate pages that diverge in content.
2. **Unstable paths** — the LLM may shorten, expand, or vary titles, making deduplication via `Resolve` unreliable because the slug mismatch is upstream of the normalizer.
## Solution
Strip slug authority from the LLM entirely. The LLM returns a minimal structured object. The pipeline computes all slugs deterministically from titles using `wiki.Slug(title)`.
---
## LLM JSON Contract
### Output format (per page)
```json
{
"title": "FinBERT",
"type": "concept",
"subtype": "framework",
"domain": "ai-llm",
"content": "## Definition\n\nA BERT-based model fine-tuned for financial sentiment...\n\n## Related\n\n- [[Sentiment Analysis]]\n- [[Hugging Face]]\n"
}
```
**Fields:**
| Field | Required | Values |
|-------|----------|--------|
| `title` | yes | Human-readable title, e.g. "FinBERT" |
| `type` | yes | `"source"` \| `"concept"` \| `"entity"` |
| `subtype` | for entity/source | entity: `person\|company\|tool\|model\|framework\|technology`; source: `article\|pdf\|book\|video\|note\|project` |
| `domain` | no | tag string, e.g. `ai-llm`, `finance` |
| `content` | yes | Markdown body sections only — no frontmatter, no path |
**Wikilinks in content:** `[[Display Name]]` only. No slug. The pipeline canonicalizes to `[[slug|Display Name]]` in a post-processing step.
**The LLM never writes slugs, paths, or frontmatter.**
---
## Pipeline Changes
### New type: `RawPage`
```go
type RawPage struct {
Title string
Type string // "source" | "concept" | "entity"
Subtype string
Domain string
Content string
}
```
### New step order
```
ParseRawPages → BuildPages → Resolve → CanonicalizeLinks → injectSourceRefs → mergeAll → write
```
### Step descriptions
**`ParseRawPages(output string) ([]RawPage, []string)`**
Replaces `ParsePages`. Deserializes JSON objects with the new schema. Same truncation-recovery logic as today. Returns `(pages, warnings)`.
**`BuildPages(rawPages []RawPage, sourceSlug, date string) []wiki.Page`**
Converts `RawPage → wiki.Page`:
- Computes slug: `wiki.Slug(page.Title)`
- Computes path: `wiki/<type>/<slug>.md`
- Assembles frontmatter:
```
---
title: <Title>
type: <type>
subtype: <subtype> # omitted if empty
domain: <domain> # omitted if empty
created: <date>
source: <sourceSlug> # omitted for the source page itself
---
```
- Concatenates frontmatter + content
**`Resolve(pages []wiki.Page, inventory) []wiki.Page`**
Unchanged. Normalizes near-duplicate titles to existing inventory slugs.
**`CanonicalizeLinks(pages []wiki.Page, inventory) ([]wiki.Page, []string)`**
New. Builds a title→slug map from inventory + current batch. Replaces `[[Display Name]]` with `[[slug|Display Name]]` in each page's content. Titles with no known slug are left as-is and returned as warnings.
**`injectSourceRefs`**
Unchanged. Reads `[[slug|...]]` links (post-canonicalization) to inject back-references.
**`mergeAll → write`**
Unchanged.
### `pipeline.Run` signature change
```go
func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryRun bool) (Result, error)
```
`source` is already passed (it's the display name / filename). A new internal `sourceSlug` is derived from it via `wiki.Slug(source)` before calling `BuildPages`. No API change needed.
---
## Files Changed
| File | Change |
|------|--------|
| `ingestion/internal/pipeline/parse.go` | Replace `ParsePages` with `ParseRawPages` + `RawPage` type |
| `ingestion/internal/pipeline/build.go` | New file: `BuildPages` |
| `ingestion/internal/pipeline/links.go` | New file: `CanonicalizeLinks` |
| `ingestion/internal/pipeline/pipeline.go` | Wire new steps; derive `sourceSlug` from `source` |
| `ingestion/internal/pipeline/prompt.go` | New system prompt + `BuildPrompt` for new JSON format |
| `brain/schema.md` | Update wikilink format and JSON schema docs |
`resolve.go`, `refs.go`, `backfill.go`, `merge.go` — no changes.
---
## Wikilink Format
- **LLM output**: `[[Display Name]]`
- **Stored on disk**: `[[slug|Display Name]]`
- **`CanonicalizeLinks`** converts between the two using the inventory
This matches Obsidian's display-alias syntax that the existing codebase already uses.
---
## Testing Strategy
- `ParseRawPages`: table-driven, cover valid JSON, truncated output, unknown type, missing title
- `BuildPages`: table-driven, cover slug computation, frontmatter assembly, source page (no `source:` field), entity with subtype
- `CanonicalizeLinks`: cover known title → replaced, unknown title → left as-is + warning, multiple links in one page
- Integration test: full `Run` call with mock LLM returning new JSON format, assert no slug duplication across two chunks of the same source
---
## Out of Scope
- Re-ingesting existing pages (user will trigger manually after deploy)
- Changing the `BackfillRefs` endpoint (already correct, slug-based)
- Changing the `Resolve` fuzzy-match algorithm

View File

@@ -68,6 +68,8 @@ func main() {
mux.HandleFunc("POST /write", h.Write) mux.HandleFunc("POST /write", h.Write)
mux.HandleFunc("POST /ingest", h.Ingest) mux.HandleFunc("POST /ingest", h.Ingest)
mux.HandleFunc("POST /ingest-path", h.IngestPath) mux.HandleFunc("POST /ingest-path", h.IngestPath)
mux.HandleFunc("POST /ingest-raw", h.IngestRaw)
mux.HandleFunc("POST /backfill-refs", h.BackfillRefs)
addr := ":" + port addr := ":" + port
watchIntervalLog := "disabled" watchIntervalLog := "disabled"

View File

@@ -272,6 +272,60 @@ func (h *Handler) IngestPath(w http.ResponseWriter, r *http.Request) {
writeJSON(w, ingestResponse{Pages: allPages, Warnings: allWarnings}) writeJSON(w, ingestResponse{Pages: allPages, Warnings: allWarnings})
} }
type ingestRawRequest struct {
Source string `json:"source"`
Pages []pipeline.RawPage `json:"pages"`
DryRun bool `json:"dry_run"`
}
// IngestRaw handles POST /ingest-raw — run the pipeline on pre-parsed RawPages,
// skipping the LLM extraction step. Use when the caller has already produced
// structured page data (e.g. from a more capable model or manual curation).
func (h *Handler) IngestRaw(w http.ResponseWriter, r *http.Request) {
var req ingestRawRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid JSON")
return
}
if strings.TrimSpace(req.Source) == "" {
writeError(w, http.StatusBadRequest, "source is required")
return
}
if len(req.Pages) == 0 {
writeError(w, http.StatusBadRequest, "pages is required and must be non-empty")
return
}
result, err := pipeline.RunRaw(h.brainDir, req.Source, req.Pages, req.DryRun)
if err != nil {
h.logger.Error("ingest-raw failed", "source", req.Source, "err", err)
writeError(w, http.StatusInternalServerError, "ingest error")
return
}
pages := result.Pages
if pages == nil {
pages = []string{}
}
warnings := result.Warnings
if warnings == nil {
warnings = []string{}
}
writeJSON(w, ingestResponse{Pages: pages, Warnings: warnings})
}
// BackfillRefs handles POST /backfill-refs — injects source back-references
// into all concept and entity pages based on existing wiki/sources/ pages.
func (h *Handler) BackfillRefs(w http.ResponseWriter, r *http.Request) {
n, err := pipeline.BackfillRefs(r.Context(), h.brainDir)
if err != nil {
h.logger.Error("backfill-refs failed", "err", err)
writeError(w, http.StatusInternalServerError, "backfill error")
return
}
writeJSON(w, map[string]int{"updated": n})
}
func writeJSON(w http.ResponseWriter, v any) { func writeJSON(w http.ResponseWriter, v any) {
w.Header().Set("Content-Type", "application/json") w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(v) //nolint:errcheck json.NewEncoder(w).Encode(v) //nolint:errcheck

View File

@@ -20,9 +20,9 @@ import (
"github.com/mathiasbq/hyperguild/ingestion/internal/pipeline" "github.com/mathiasbq/hyperguild/ingestion/internal/pipeline"
) )
// stubComplete returns a fixed JSON page so tests never call a real LLM. // stubComplete returns a fixed JSON RawPage so tests never call a real LLM.
func stubComplete(_ context.Context, _, _ string) (string, error) { func stubComplete(_ context.Context, _, _ string) (string, error) {
return `[{"path":"wiki/sources/test-source.md","content":"# Test Source\n\nSome content here.\n"}]`, nil return `[{"title":"Test Source","type":"source","subtype":"article","content":"## Summary\n\nSome content here.\n"}]`, nil
} }
func stubPipelineCfg() pipeline.Config { func stubPipelineCfg() pipeline.Config {
@@ -226,6 +226,85 @@ func TestIngestPath_File(t *testing.T) {
assert.NotEmpty(t, pagesSlice) assert.NotEmpty(t, pagesSlice)
} }
// ---------------------------------------------------------------------------
// POST /ingest-raw
// ---------------------------------------------------------------------------
func TestIngestRaw_Validation(t *testing.T) {
cases := []struct {
name string
body map[string]any
}{
{"missing source", map[string]any{"pages": []any{map[string]any{"title": "X", "type": "concept", "content": "x"}}}},
{"missing pages", map[string]any{"source": "test-source"}},
{"empty pages", map[string]any{"source": "test-source", "pages": []any{}}},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
_, h := setup(t)
body, _ := json.Marshal(tc.body)
req := httptest.NewRequest(http.MethodPost, "/ingest-raw", bytes.NewReader(body))
rec := httptest.NewRecorder()
h.IngestRaw(rec, req)
assert.Equal(t, http.StatusBadRequest, rec.Code)
})
}
}
func TestIngestRaw_Success(t *testing.T) {
dir, h := setup(t)
body, _ := json.Marshal(map[string]any{
"source": "test-article",
"pages": []any{
map[string]any{"title": "Test Article", "type": "source", "subtype": "article", "domain": "Testing", "content": "## Summary\n\nThis is a test article about [[Test Concept]].\n"},
map[string]any{"title": "Test Concept", "type": "concept", "domain": "Testing", "content": "A concept for testing.\n"},
},
})
req := httptest.NewRequest(http.MethodPost, "/ingest-raw", bytes.NewReader(body))
rec := httptest.NewRecorder()
h.IngestRaw(rec, req)
require.Equal(t, http.StatusOK, rec.Code)
var resp map[string]any
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &resp))
pages := resp["pages"].([]any)
assert.Len(t, pages, 2)
// Verify files were written
sourcePath := filepath.Join(dir, "wiki", "sources", "test-article.md")
assert.FileExists(t, sourcePath)
conceptPath := filepath.Join(dir, "wiki", "concepts", "test-concept.md")
assert.FileExists(t, conceptPath)
}
func TestIngestRaw_DryRun(t *testing.T) {
dir, h := setup(t)
body, _ := json.Marshal(map[string]any{
"source": "dry-run-test",
"pages": []any{
map[string]any{"title": "Dry Run Source", "type": "source", "subtype": "article", "content": "Content."},
},
"dry_run": true,
})
req := httptest.NewRequest(http.MethodPost, "/ingest-raw", bytes.NewReader(body))
rec := httptest.NewRecorder()
h.IngestRaw(rec, req)
require.Equal(t, http.StatusOK, rec.Code)
var resp map[string]any
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &resp))
pages := resp["pages"].([]any)
assert.NotEmpty(t, pages)
// Verify no files were written
sourcePath := filepath.Join(dir, "wiki", "sources", "dry-run-test.md")
assert.NoFileExists(t, sourcePath)
}
func TestIngestPath_Directory(t *testing.T) { func TestIngestPath_Directory(t *testing.T) {
_, h := setup(t) _, h := setup(t)

View File

@@ -0,0 +1,91 @@
// ingestion/internal/pipeline/backfill.go
package pipeline
import (
"context"
"fmt"
"os"
"path/filepath"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// BackfillRefs walks wiki/sources/ and injects source back-references into every
// concept and entity page that each source links to.
// Changes for all sources are accumulated in memory before writing, so multiple
// sources referencing the same concept are merged in one pass.
// Deduplication is handled by wiki.Merge — running this multiple times is safe.
// Returns the number of concept/entity pages written.
func BackfillRefs(ctx context.Context, brainDir string) (int, error) {
inventory, err := wiki.LoadInventory(brainDir)
if err != nil {
return 0, fmt.Errorf("load inventory: %w", err)
}
sourcesDir := filepath.Join(brainDir, "wiki", "sources")
entries, err := os.ReadDir(sourcesDir)
if err != nil {
if os.IsNotExist(err) {
return 0, nil
}
return 0, fmt.Errorf("read sources dir: %w", err)
}
// Accumulate all changes before writing: relPath → updated Page.
// Collecting first means two sources that both link the same concept
// get both refs merged before a single write.
pending := make(map[string]wiki.Page)
for _, e := range entries {
if ctx.Err() != nil {
return 0, ctx.Err()
}
if e.IsDir() || !strings.HasSuffix(e.Name(), ".md") {
continue
}
b, err := os.ReadFile(filepath.Join(sourcesDir, e.Name()))
if err != nil {
continue
}
sourceContent := string(b)
sourceSlug := strings.TrimSuffix(e.Name(), ".md")
sourceTitle := extractTitle(sourceContent)
if sourceTitle == "" {
sourceTitle = sourceSlug
}
sourceRef := "- [[" + sourceSlug + "|" + sourceTitle + "]]"
for slug := range extractWikilinks(sourceContent) {
if slug == sourceSlug {
continue
}
pt, ok := findInInventory(slug, inventory)
if !ok {
continue
}
relPath := "wiki/" + string(pt) + "/" + slug + ".md"
// Start from already-accumulated version if we've seen this page.
page, seen := pending[relPath]
if !seen {
raw, err := os.ReadFile(filepath.Join(brainDir, filepath.FromSlash(relPath)))
if err != nil {
continue
}
page = wiki.Page{Path: relPath, Content: string(raw)}
}
pending[relPath] = addSourceRef(page, sourceRef)
}
}
for relPath, page := range pending {
dest := filepath.Join(brainDir, filepath.FromSlash(relPath))
if err := os.WriteFile(dest, []byte(page.Content), 0o644); err != nil {
return 0, fmt.Errorf("write %s: %w", relPath, err)
}
}
return len(pending), nil
}

View File

@@ -0,0 +1,107 @@
// ingestion/internal/pipeline/backfill_test.go
package pipeline
import (
"context"
"os"
"path/filepath"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func setupBrainDir(t *testing.T) string {
t.Helper()
dir := t.TempDir()
for _, sub := range []string{"wiki/sources", "wiki/concepts", "wiki/entities"} {
require.NoError(t, os.MkdirAll(filepath.Join(dir, sub), 0o755))
}
return dir
}
func writeFile(t *testing.T, path, content string) {
t.Helper()
require.NoError(t, os.MkdirAll(filepath.Dir(path), 0o755))
require.NoError(t, os.WriteFile(path, []byte(content), 0o644))
}
func TestBackfillRefs_UpdatesConcept(t *testing.T) {
dir := setupBrainDir(t)
writeFile(t, filepath.Join(dir, "wiki/sources/shape-up.md"),
"---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[betting|Betting]].\n")
writeFile(t, filepath.Join(dir, "wiki/concepts/betting.md"),
"---\ntitle: Betting\n---\n\n## Definition\n\nA resource allocation technique.\n")
n, err := BackfillRefs(context.Background(), dir)
require.NoError(t, err)
assert.Equal(t, 1, n)
got, err := os.ReadFile(filepath.Join(dir, "wiki/concepts/betting.md"))
require.NoError(t, err)
assert.Contains(t, string(got), "## Sources")
assert.Contains(t, string(got), "[[shape-up|Shape Up]]")
assert.Contains(t, string(got), "## Definition") // original content preserved
}
func TestBackfillRefs_Deduplication(t *testing.T) {
dir := setupBrainDir(t)
writeFile(t, filepath.Join(dir, "wiki/sources/shape-up.md"),
"---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[betting|Betting]].\n")
writeFile(t, filepath.Join(dir, "wiki/concepts/betting.md"),
"---\ntitle: Betting\n---\n\n## Definition\n\nA technique.\n")
// Run twice — should not duplicate the ref.
_, err := BackfillRefs(context.Background(), dir)
require.NoError(t, err)
_, err = BackfillRefs(context.Background(), dir)
require.NoError(t, err)
got, err := os.ReadFile(filepath.Join(dir, "wiki/concepts/betting.md"))
require.NoError(t, err)
count := 0
for _, line := range splitLines(string(got)) {
if line == "- [[shape-up|Shape Up]]" {
count++
}
}
assert.Equal(t, 1, count, "ref should appear exactly once after two runs")
}
func TestBackfillRefs_MultipleSources(t *testing.T) {
dir := setupBrainDir(t)
writeFile(t, filepath.Join(dir, "wiki/sources/book-a.md"),
"---\ntitle: Book A\n---\n\n## Summary\n\nSee [[shaping|Shaping]].\n")
writeFile(t, filepath.Join(dir, "wiki/sources/book-b.md"),
"---\ntitle: Book B\n---\n\n## Summary\n\nAlso [[shaping|Shaping]].\n")
writeFile(t, filepath.Join(dir, "wiki/concepts/shaping.md"),
"---\ntitle: Shaping\n---\n\n## Definition\n\nA design activity.\n")
n, err := BackfillRefs(context.Background(), dir)
require.NoError(t, err)
assert.Equal(t, 1, n) // one concept page written
got, err := os.ReadFile(filepath.Join(dir, "wiki/concepts/shaping.md"))
require.NoError(t, err)
assert.Contains(t, string(got), "[[book-a|Book A]]")
assert.Contains(t, string(got), "[[book-b|Book B]]")
}
func TestBackfillRefs_NoSourcesDir(t *testing.T) {
dir := t.TempDir() // no wiki/sources subdir
n, err := BackfillRefs(context.Background(), dir)
require.NoError(t, err)
assert.Equal(t, 0, n)
}
func TestBackfillRefs_SkipsUnknownSlugs(t *testing.T) {
dir := setupBrainDir(t)
// Source links to a slug not in inventory and not on disk.
writeFile(t, filepath.Join(dir, "wiki/sources/article.md"),
"---\ntitle: Article\n---\n\n## Summary\n\nSee [[ghost-slug|Ghost]].\n")
n, err := BackfillRefs(context.Background(), dir)
require.NoError(t, err)
assert.Equal(t, 0, n)
}

View File

@@ -0,0 +1,106 @@
// ingestion/internal/pipeline/build.go
package pipeline
import (
"fmt"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// BuildPages converts RawPages from the LLM into wiki.Pages with computed slugs,
// paths, and YAML frontmatter. sourceSlug is the slug of the source being ingested
// (derived from the filename, not the LLM title). Pages whose title resolves to an
// empty slug are skipped and returned as warnings instead.
func BuildPages(rawPages []RawPage, sourceSlug, date string) ([]wiki.Page, []string) {
out := make([]wiki.Page, 0, len(rawPages))
var warnings []string
for _, rp := range rawPages {
slug := computeSlug(rp, sourceSlug)
if slug == "" {
warnings = append(warnings, fmt.Sprintf("skipped page with empty title (type: %s)", rp.Type))
continue
}
out = append(out, buildPage(rp, sourceSlug, date))
}
return out, warnings
}
func computeSlug(rp RawPage, sourceSlug string) string {
if rp.Type == "source" {
return sourceSlug
}
return wiki.Slug(rp.Title)
}
func buildPage(rp RawPage, sourceSlug, date string) wiki.Page {
var slug, dir string
switch rp.Type {
case "source":
slug = sourceSlug
dir = "wiki/sources"
case "concept":
slug = wiki.Slug(rp.Title)
dir = "wiki/concepts"
case "entity":
slug = wiki.Slug(rp.Title)
dir = "wiki/entities"
default:
slug = wiki.Slug(rp.Title)
dir = "wiki/" + rp.Type
}
path := dir + "/" + slug + ".md"
fm := buildFrontmatter(rp, date)
return wiki.Page{
Path: path,
Content: fm + "\n" + rp.Content,
}
}
func buildFrontmatter(rp RawPage, date string) string {
var sb strings.Builder
sb.WriteString("---\n")
fmt.Fprintf(&sb, "title: %s\n", yamlScalar(rp.Title))
switch rp.Type {
case "source":
subtype := rp.Subtype
if subtype == "" {
subtype = "article"
}
fmt.Fprintf(&sb, "type: %s\n", yamlScalar(subtype))
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "date_ingested: %s\n", date)
fmt.Fprintf(&sb, "last_updated: %s\n", date)
case "concept":
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
case "entity":
if rp.Subtype != "" {
fmt.Fprintf(&sb, "type: %s\n", yamlScalar(rp.Subtype))
}
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
default:
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
}
fmt.Fprintf(&sb, "aliases:\n - %s\n", yamlScalar(rp.Title))
sb.WriteString("---\n")
return sb.String()
}
func yamlScalar(s string) string {
return "'" + strings.ReplaceAll(s, "'", "''") + "'"
}

View File

@@ -0,0 +1,167 @@
// ingestion/internal/pipeline/build_test.go
package pipeline
import (
"strings"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestBuildPages_SourcePage(t *testing.T) {
raw := []RawPage{
{
Title: "Shape Up",
Type: "source",
Subtype: "book",
Domain: "product-strategy",
Content: "## Summary\n\nA book about shaping product work.\n",
},
}
pages, warnings := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
assert.Empty(t, warnings)
p := pages[0]
assert.Equal(t, "wiki/sources/shape-up.md", p.Path)
assert.Contains(t, p.Content, "title: 'Shape Up'")
assert.Contains(t, p.Content, "type: 'book'")
assert.Contains(t, p.Content, "domain: 'product-strategy'")
assert.Contains(t, p.Content, "date_ingested: 2026-04-23")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - 'Shape Up'")
assert.Contains(t, p.Content, "## Summary")
assert.True(t, strings.HasPrefix(p.Content, "---\n"), "content must start with frontmatter")
}
func TestBuildPages_ConceptPage(t *testing.T) {
raw := []RawPage{
{
Title: "Betting",
Type: "concept",
Domain: "product-strategy",
Content: "## Definition\n\nA resource allocation technique.\n",
},
}
pages, warnings := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
assert.Empty(t, warnings)
p := pages[0]
assert.Equal(t, "wiki/concepts/betting.md", p.Path)
assert.Contains(t, p.Content, "title: 'Betting'")
assert.Contains(t, p.Content, "domain: 'product-strategy'")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - 'Betting'")
assert.NotContains(t, p.Content, "date_ingested")
assert.Contains(t, p.Content, "## Definition")
}
func TestBuildPages_EntityPage(t *testing.T) {
raw := []RawPage{
{
Title: "Ryan Singer",
Type: "entity",
Subtype: "person",
Domain: "product-strategy",
Content: "## Description\n\nA product designer.\n",
},
}
pages, warnings := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
assert.Empty(t, warnings)
p := pages[0]
assert.Equal(t, "wiki/entities/ryan-singer.md", p.Path)
assert.Contains(t, p.Content, "title: 'Ryan Singer'")
assert.Contains(t, p.Content, "type: 'person'")
assert.Contains(t, p.Content, "domain: 'product-strategy'")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - 'Ryan Singer'")
assert.NotContains(t, p.Content, "date_ingested")
}
func TestBuildPages_SourceSlugUsedForSourcePage(t *testing.T) {
// LLM title differs from filename — pipeline uses sourceSlug for the source page path.
raw := []RawPage{
{Title: "FinBERT: A Pretrained Model", Type: "source", Subtype: "article", Content: "## Summary\n\nA model.\n"},
}
pages, _ := BuildPages(raw, "finbert-huggingface", "2026-04-23")
require.Len(t, pages, 1)
assert.Equal(t, "wiki/sources/finbert-huggingface.md", pages[0].Path)
}
func TestBuildPages_ConceptSlugDerivedFromTitle(t *testing.T) {
raw := []RawPage{
{Title: "Domain-Driven Design", Type: "concept", Content: "## Definition\n\nFoo.\n"},
}
pages, _ := BuildPages(raw, "some-source", "2026-04-23")
require.Len(t, pages, 1)
assert.Equal(t, "wiki/concepts/domain-driven-design.md", pages[0].Path)
}
func TestBuildPages_SourceDefaultSubtype(t *testing.T) {
// If subtype is omitted for a source, default to "article"
raw := []RawPage{
{Title: "Some Post", Type: "source", Content: "## Summary\n\nA post.\n"},
}
pages, _ := BuildPages(raw, "some-post", "2026-04-23")
require.Len(t, pages, 1)
assert.Contains(t, pages[0].Content, "type: 'article'")
}
func TestBuildPages_OmitsDomainWhenEmpty(t *testing.T) {
raw := []RawPage{
{Title: "Betting", Type: "concept", Content: "## Definition\n\nFoo.\n"},
}
pages, _ := BuildPages(raw, "src", "2026-04-23")
require.Len(t, pages, 1)
assert.NotContains(t, pages[0].Content, "domain:")
}
func TestBuildPages_MultiplePages(t *testing.T) {
raw := []RawPage{
{Title: "Shape Up", Type: "source", Subtype: "book", Content: "## Summary\n\nA book.\n"},
{Title: "Betting", Type: "concept", Content: "## Definition\n\nA technique.\n"},
{Title: "Ryan Singer", Type: "entity", Subtype: "person", Content: "## Description\n\nA designer.\n"},
}
pages, _ := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 3)
assert.Equal(t, "wiki/sources/shape-up.md", pages[0].Path)
assert.Equal(t, "wiki/concepts/betting.md", pages[1].Path)
assert.Equal(t, "wiki/entities/ryan-singer.md", pages[2].Path)
}
func TestBuildPages_TitleWithColon(t *testing.T) {
raw := []RawPage{
{Title: "Shape Up: The Basecamp Method", Type: "source", Subtype: "book", Content: "## Summary\n\nA book.\n"},
}
pages, _ := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
// Title with colon must be quoted in YAML
assert.Contains(t, pages[0].Content, "title: 'Shape Up: The Basecamp Method'")
assert.Contains(t, pages[0].Content, "aliases:\n - 'Shape Up: The Basecamp Method'")
}
func TestBuildPages_EntityNoSubtype(t *testing.T) {
raw := []RawPage{
{Title: "Basecamp", Type: "entity", Content: "## Description\n\nA company.\n"},
}
pages, _ := BuildPages(raw, "src", "2026-04-23")
require.Len(t, pages, 1)
assert.NotContains(t, pages[0].Content, "type:")
assert.Contains(t, pages[0].Content, "title: 'Basecamp'")
}
func TestBuildPages_EmptyTitleSkippedWithWarning(t *testing.T) {
raw := []RawPage{
{Title: "", Type: "concept", Content: "## Definition\n\nFoo.\n"},
{Title: "Betting", Type: "concept", Content: "## Definition\n\nA technique.\n"},
}
pages, warnings := BuildPages(raw, "src", "2026-04-23")
require.Len(t, pages, 1, "empty-title page should be skipped")
assert.Equal(t, "wiki/concepts/betting.md", pages[0].Path)
assert.Len(t, warnings, 1)
assert.Contains(t, warnings[0], "empty title")
}

View File

@@ -0,0 +1,70 @@
// ingestion/internal/pipeline/links.go
package pipeline
import (
"fmt"
"path/filepath"
"regexp"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// plainLinkRE matches [[Display Name]] — wikilinks without a slug pipe.
// It does NOT match [[slug|Display]] (those already have a pipe).
var plainLinkRE = regexp.MustCompile(`\[\[([^\]|]+)\]\]`)
// CanonicalizeLinks converts [[Display Name]] wikilinks to [[slug|Display Name]]
// using a title→slug map built from the inventory and current batch.
// Unknown titles are left as-is and returned as warnings.
func CanonicalizeLinks(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) ([]wiki.Page, []string) {
titleToSlug := buildTitleMap(pages, inventory)
var allWarnings []string
out := make([]wiki.Page, len(pages))
for i, p := range pages {
newContent, warnings := canonicalizeContent(p.Content, titleToSlug)
p.Content = newContent
out[i] = p
allWarnings = append(allWarnings, warnings...)
}
return out, allWarnings
}
// buildTitleMap builds a lowercase-title → slug map from inventory and current batch.
// Current batch entries take precedence over inventory (they may be updates).
func buildTitleMap(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) map[string]string {
m := make(map[string]string)
for _, entries := range inventory {
for _, e := range entries {
m[strings.ToLower(e.Title)] = e.Slug
}
}
// Current batch overrides inventory
for _, p := range pages {
title := extractTitle(p.Content)
slug := strings.TrimSuffix(filepath.Base(p.Path), ".md")
if title != "" && slug != "" {
m[strings.ToLower(title)] = slug
}
}
return m
}
func canonicalizeContent(content string, titleToSlug map[string]string) (string, []string) {
var warnings []string
result := plainLinkRE.ReplaceAllStringFunc(content, func(match string) string {
sub := plainLinkRE.FindStringSubmatch(match)
if len(sub) < 2 {
return match
}
displayName := sub[1]
slug, ok := titleToSlug[strings.ToLower(displayName)]
if !ok {
warnings = append(warnings, fmt.Sprintf("unknown wikilink: [[%s]]", displayName))
return match
}
return "[[" + slug + "|" + displayName + "]]"
})
return result, warnings
}

View File

@@ -0,0 +1,125 @@
// ingestion/internal/pipeline/links_test.go
package pipeline
import (
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
func TestCanonicalizeLinks_KnownTitle(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[Betting]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.NotContains(t, got[0].Content, "[[Betting]]")
}
func TestCanonicalizeLinks_UnknownTitleLeftAsIs(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[Ghost Concept]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.NotEmpty(t, warnings)
assert.Contains(t, got[0].Content, "[[Ghost Concept]]")
}
func TestCanonicalizeLinks_AlreadyCanonicalLinkUntouched(t *testing.T) {
// Links already in [[slug|Display]] format must not be double-converted
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[betting|Betting]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
// Should remain exactly as-is — not double-wrapped
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.NotContains(t, got[0].Content, "[[betting|[[betting|Betting]]]]")
}
func TestCanonicalizeLinks_CaseInsensitiveMatch(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/foo.md",
Content: "---\ntitle: 'Foo'\n---\n\n## Summary\n\nSee [[domain driven design]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "domain-driven-design", Title: "Domain Driven Design"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[domain-driven-design|domain driven design]]")
}
func TestCanonicalizeLinks_CurrentBatchPagesResolved(t *testing.T) {
// A concept created in the same batch should be canonicalizable
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[Betting]].\n",
},
{
Path: "wiki/concepts/betting.md",
Content: "---\ntitle: 'Betting'\n---\n\n## Definition\n\nA technique.\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{} // empty — Betting is in the batch, not inventory
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 2)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
}
func TestCanonicalizeLinks_MultipleLinksInOnePage(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/foo.md",
Content: "---\ntitle: 'Foo'\n---\n\n## Summary\n\nSee [[Betting]] and [[Shape Up]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
wiki.PageTypeSource: {
{Slug: "shape-up", Title: "Shape Up"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.Contains(t, got[0].Content, "[[shape-up|Shape Up]]")
}

View File

@@ -5,13 +5,22 @@ import (
"encoding/json" "encoding/json"
"fmt" "fmt"
"strings" "strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
) )
// ParsePages parses LLM output as a JSON array of {path, content} objects. // RawPage is the LLM's output format — minimal structured data with no path or frontmatter.
// If the array is truncated mid-object (token limit), it salvages all complete objects. // The pipeline derives slugs, paths, and frontmatter from these fields.
func ParsePages(output string) ([]wiki.Page, []string) { type RawPage struct {
Title string `json:"title"`
Type string `json:"type"` // "source" | "concept" | "entity"
Subtype string `json:"subtype"` // entity: person|company|tool|model|framework|technology; source: article|pdf|book|video|note|project
Domain string `json:"domain"`
Content string `json:"content"` // Markdown body only — no frontmatter
}
// ParseRawPages parses LLM output as a JSON array of RawPage objects.
// If the output contains invalid JSON escape sequences (e.g. \. from Markdown),
// it attempts repair before falling back to truncation recovery.
func ParseRawPages(output string) ([]RawPage, []string) {
output = strings.TrimSpace(output) output = strings.TrimSpace(output)
if output == "" { if output == "" {
return nil, []string{"LLM returned empty output"} return nil, []string{"LLM returned empty output"}
@@ -19,23 +28,30 @@ func ParsePages(output string) ([]wiki.Page, []string) {
output = stripFences(output) output = stripFences(output)
var pages []wiki.Page // Fast path: valid JSON.
var pages []RawPage
if err := json.Unmarshal([]byte(output), &pages); err == nil { if err := json.Unmarshal([]byte(output), &pages); err == nil {
return pages, nil return pages, nil
} }
// Repair pass: fix invalid escape sequences (e.g. \. \d from Markdown content).
repaired := repairJSON(output)
if err := json.Unmarshal([]byte(repaired), &pages); err == nil {
return pages, []string{"repaired invalid JSON escape sequences in LLM output"}
}
// Truncation recovery: find last `}` that closes a complete object. // Truncation recovery: find last `}` that closes a complete object.
idx := strings.LastIndex(output, "}") idx := strings.LastIndex(repaired, "}")
if idx < 0 { if idx < 0 {
return nil, []string{"LLM output contained no complete JSON objects"} return nil, []string{"LLM output contained no complete JSON objects"}
} }
start := strings.Index(output, "[") start := strings.Index(repaired, "[")
if start < 0 { if start < 0 {
return nil, []string{"LLM output contained no JSON array opening bracket"} return nil, []string{"LLM output contained no JSON array opening bracket"}
} }
candidate := output[start:idx+1] + "]" candidate := repaired[start:idx+1] + "]"
if err := json.Unmarshal([]byte(candidate), &pages); err != nil { if err := json.Unmarshal([]byte(candidate), &pages); err != nil {
return nil, []string{fmt.Sprintf("truncation recovery failed: %v", err)} return nil, []string{fmt.Sprintf("truncation recovery failed: %v", err)}
} }
@@ -43,6 +59,45 @@ func ParsePages(output string) ([]wiki.Page, []string) {
return pages, []string{fmt.Sprintf("LLM output was truncated; recovered %d page(s)", len(pages))} return pages, []string{fmt.Sprintf("LLM output was truncated; recovered %d page(s)", len(pages))}
} }
// repairJSON replaces invalid JSON escape sequences (e.g. \. \d \p) with
// a properly escaped backslash followed by the same character.
// It iterates byte-by-byte to correctly skip already-valid escape sequences
// (including \\) without requiring lookbehind support.
func repairJSON(s string) string {
var b strings.Builder
b.Grow(len(s))
i := 0
for i < len(s) {
if s[i] != '\\' {
b.WriteByte(s[i])
i++
continue
}
// We have a backslash. Peek at the next character.
if i+1 >= len(s) {
// Trailing backslash — emit as-is.
b.WriteByte(s[i])
i++
continue
}
next := s[i+1]
switch next {
case '"', '\\', '/', 'b', 'f', 'n', 'r', 't', 'u':
// Valid JSON escape sequence — emit both characters as-is.
b.WriteByte(s[i])
b.WriteByte(next)
i += 2
default:
// Invalid escape — double the backslash.
b.WriteByte('\\')
b.WriteByte('\\')
b.WriteByte(next)
i += 2
}
}
return b.String()
}
func stripFences(s string) string { func stripFences(s string) string {
for _, prefix := range []string{"```json\n", "```json\r\n", "```\n", "```\r\n"} { for _, prefix := range []string{"```json\n", "```json\r\n", "```\n", "```\r\n"} {
if strings.HasPrefix(s, prefix) { if strings.HasPrefix(s, prefix) {

View File

@@ -8,39 +8,80 @@ import (
"github.com/stretchr/testify/require" "github.com/stretchr/testify/require"
) )
func TestParsePages_ValidJSON(t *testing.T) { func TestParseRawPages_ValidJSON(t *testing.T) {
input := `[{"path":"wiki/sources/foo.md","content":"# Foo"},{"path":"wiki/concepts/bar.md","content":"# Bar"}]` input := `[{"title":"Shape Up","type":"source","subtype":"book","domain":"product-strategy","content":"## Summary\n\nFoo."},{"title":"Betting","type":"concept","content":"## Definition\n\nA technique."}]`
pages, warnings := ParsePages(input) pages, warnings := ParseRawPages(input)
require.Len(t, pages, 2) require.Len(t, pages, 2)
assert.Empty(t, warnings) assert.Empty(t, warnings)
assert.Equal(t, "wiki/sources/foo.md", pages[0].Path) assert.Equal(t, "Shape Up", pages[0].Title)
assert.Equal(t, "wiki/concepts/bar.md", pages[1].Path) assert.Equal(t, "source", pages[0].Type)
assert.Equal(t, "book", pages[0].Subtype)
assert.Equal(t, "product-strategy", pages[0].Domain)
assert.Equal(t, "Betting", pages[1].Title)
assert.Equal(t, "concept", pages[1].Type)
assert.Empty(t, pages[1].Subtype)
} }
func TestParsePages_StripsFences(t *testing.T) { func TestParseRawPages_StripsFences(t *testing.T) {
input := "```json\n[{\"path\":\"wiki/sources/foo.md\",\"content\":\"# Foo\"}]\n```" input := "```json\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"## Definition\\n\\nFoo.\"}]\n```"
pages, warnings := ParsePages(input) pages, warnings := ParseRawPages(input)
assert.Len(t, pages, 1)
assert.Empty(t, warnings)
}
func TestParsePages_TruncationRecovery(t *testing.T) {
input := `[{"path":"wiki/sources/foo.md","content":"# Foo"},{"path":"wiki/concepts/bar.md","content":"trunc`
pages, warnings := ParsePages(input)
require.Len(t, pages, 1) require.Len(t, pages, 1)
assert.Equal(t, "wiki/sources/foo.md", pages[0].Path) assert.Empty(t, warnings)
assert.Equal(t, "Foo", pages[0].Title)
}
func TestParseRawPages_TruncationRecovery(t *testing.T) {
input := `[{"title":"Foo","type":"concept","content":"## Definition\n\nFoo."},{"title":"Bar","type":"concept","content":"trunc`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Equal(t, "Foo", pages[0].Title)
assert.NotEmpty(t, warnings) assert.NotEmpty(t, warnings)
} }
func TestParsePages_EmptyInput(t *testing.T) { func TestParseRawPages_EmptyInput(t *testing.T) {
pages, warnings := ParsePages("") pages, warnings := ParseRawPages("")
assert.Empty(t, pages) assert.Empty(t, pages)
assert.NotEmpty(t, warnings) assert.NotEmpty(t, warnings)
} }
func TestParsePages_PlainFence(t *testing.T) { func TestParseRawPages_PlainFence(t *testing.T) {
input := "```\n[{\"path\":\"wiki/sources/foo.md\",\"content\":\"ok\"}]\n```" input := "```\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"ok\"}]\n```"
pages, warnings := ParsePages(input) pages, warnings := ParseRawPages(input)
assert.Len(t, pages, 1) require.Len(t, pages, 1)
assert.Empty(t, warnings) assert.Empty(t, warnings)
} }
func TestParseRawPages_MissingTitle(t *testing.T) {
// Missing title — still parsed, Title is empty string
input := `[{"type":"concept","content":"## Definition\n\nFoo."}]`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Empty(t, warnings)
assert.Empty(t, pages[0].Title)
}
func TestParseRawPages_InvalidEscapeRepaired(t *testing.T) {
// LLM copied markdown escaped list numbers (\.) into JSON — invalid escape
raw := "[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"Step 4\\. Do it.\"}]"
pages, warnings := ParseRawPages(raw)
require.Len(t, pages, 1)
assert.Equal(t, "Foo", pages[0].Title)
assert.Contains(t, pages[0].Content, `4\.`)
assert.NotEmpty(t, warnings) // repair warning
}
func TestRepairJSON_FixesInvalidEscapes(t *testing.T) {
cases := []struct {
in string
want string
}{
{`{"a":"foo\.bar"}`, `{"a":"foo\\.bar"}`},
{`{"a":"\\n is fine"}`, `{"a":"\\n is fine"}`}, // valid \n untouched
{`{"a":"\d+ items"}`, `{"a":"\\d+ items"}`},
{`{"a":"already \\ escaped"}`, `{"a":"already \\ escaped"}`}, // valid \\ untouched
}
for _, tc := range cases {
got := repairJSON(tc.in)
assert.Equal(t, tc.want, got, "input: %s", tc.in)
}
}

View File

@@ -41,9 +41,11 @@ func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryR
schema = loadSchema(brainDir) schema = loadSchema(brainDir)
} }
sourceSlug := wiki.Slug(source)
date := time.Now().UTC().Format("2006-01-02")
chunks := Chunk(content, cfg.ChunkSize) chunks := Chunk(content, cfg.ChunkSize)
var allPages []wiki.Page var allRaw []RawPage
var allWarnings []string var allWarnings []string
for _, chunk := range chunks { for _, chunk := range chunks {
@@ -52,17 +54,40 @@ func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryR
if err != nil { if err != nil {
return Result{}, fmt.Errorf("LLM call: %w", err) return Result{}, fmt.Errorf("LLM call: %w", err)
} }
pages, warnings := ParsePages(output) raw, warnings := ParseRawPages(output)
allPages = append(allPages, pages...) allRaw = append(allRaw, raw...)
allWarnings = append(allWarnings, warnings...) allWarnings = append(allWarnings, warnings...)
} }
resolved := Resolve(allPages, inventory) return buildAndWrite(allRaw, sourceSlug, date, brainDir, source, inventory, allWarnings, dryRun)
merged := mergeAll(resolved) }
// RunRaw runs the pipeline on pre-parsed RawPages, skipping the LLM extraction
// step. Use this when the caller has already produced the structured RawPage data
// (e.g. from a more capable model or manual curation).
func RunRaw(brainDir, source string, rawPages []RawPage, dryRun bool) (Result, error) {
inventory, err := wiki.LoadInventory(brainDir)
if err != nil {
return Result{}, fmt.Errorf("load inventory: %w", err)
}
sourceSlug := wiki.Slug(source)
date := time.Now().UTC().Format("2006-01-02") date := time.Now().UTC().Format("2006-01-02")
var written []string
return buildAndWrite(rawPages, sourceSlug, date, brainDir, source, inventory, nil, dryRun)
}
// buildAndWrite runs BuildPages through write for both Run and RunRaw.
func buildAndWrite(rawPages []RawPage, sourceSlug, date, brainDir, source string, inventory map[wiki.PageType][]wiki.Entry, warnings []string, dryRun bool) (Result, error) {
pages, buildWarnings := BuildPages(rawPages, sourceSlug, date)
warnings = append(warnings, buildWarnings...)
resolved := Resolve(pages, inventory)
canonicalized, linkWarnings := CanonicalizeLinks(resolved, inventory)
warnings = append(warnings, linkWarnings...)
withRefs := injectSourceRefs(canonicalized, inventory, brainDir)
merged := mergeAll(withRefs)
var written []string
for _, page := range merged { for _, page := range merged {
if !dryRun { if !dryRun {
dest := filepath.Join(brainDir, filepath.FromSlash(page.Path)) dest := filepath.Join(brainDir, filepath.FromSlash(page.Path))
@@ -78,14 +103,14 @@ func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryR
if !dryRun { if !dryRun {
if err := wiki.RebuildIndex(brainDir, date); err != nil { if err := wiki.RebuildIndex(brainDir, date); err != nil {
allWarnings = append(allWarnings, fmt.Sprintf("rebuild index: %v", err)) warnings = append(warnings, fmt.Sprintf("rebuild index: %v", err))
} }
if err := wiki.AppendLog(brainDir, source, written, allWarnings, date); err != nil { if err := wiki.AppendLog(brainDir, source, written, warnings, date); err != nil {
allWarnings = append(allWarnings, fmt.Sprintf("append log: %v", err)) warnings = append(warnings, fmt.Sprintf("append log: %v", err))
} }
} }
return Result{Pages: written, Warnings: allWarnings}, nil return Result{Pages: written, Warnings: warnings}, nil
} }
// mergeAll deduplicates pages by path, merging content from later occurrences. // mergeAll deduplicates pages by path, merging content from later occurrences.

View File

@@ -15,7 +15,6 @@ import (
"github.com/stretchr/testify/require" "github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/llm" "github.com/mathiasbq/hyperguild/ingestion/internal/llm"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
) )
func TestRun_WritesPages(t *testing.T) { func TestRun_WritesPages(t *testing.T) {
@@ -24,14 +23,19 @@ func TestRun_WritesPages(t *testing.T) {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755)) require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
} }
llmResponse := mustJSON([]wiki.Page{ llmResponse := mustJSON([]RawPage{
{ {
Path: "wiki/sources/test-article.md", Title: "Test Article",
Content: "---\ntitle: Test Article\ntype: article\ndomain: software-engineering\ndate_ingested: 2026-04-22\nlast_updated: 2026-04-22\naliases:\n - Test Article\n---\n\n## Summary\n\nA test article.\n\n## Key Claims\n\n- It tests things.\n\n## Concepts Introduced or Reinforced\n\n## Entities Mentioned\n\n## Open Questions Raised\n", Type: "source",
Subtype: "article",
Domain: "software-engineering",
Content: "## Summary\n\nA test article.\n\n## Key Claims\n\n- It tests things.\n\n## Concepts Introduced or Reinforced\n\n[[Testing]]\n\n## Entities Mentioned\n\n## Open Questions Raised\n",
}, },
{ {
Path: "wiki/concepts/testing.md", Title: "Testing",
Content: "---\ntitle: Testing\ndomain: software-engineering\nlast_updated: 2026-04-22\naliases:\n - Testing\n---\n\n## Definition\n\nThe practice of verifying software.\n\n## Why It Matters\n\nCatches bugs.\n\n## Related Concepts\n\n## Related Entities\n\n## Sources\n\n## Evolving Notes\n", Type: "concept",
Domain: "software-engineering",
Content: "## Definition\n\nThe practice of verifying software.\n\n## Why It Matters\n\nCatches bugs.\n\n## Related Concepts\n\n## Related Entities\n\n## Sources\n\n## Evolving Notes\n",
}, },
}) })
@@ -53,7 +57,6 @@ func TestRun_WritesPages(t *testing.T) {
result, err := Run(context.Background(), cfg, brainDir, "An article about testing.", "test-article", false) result, err := Run(context.Background(), cfg, brainDir, "An article about testing.", "test-article", false)
require.NoError(t, err) require.NoError(t, err)
assert.Len(t, result.Pages, 2) assert.Len(t, result.Pages, 2)
assert.Empty(t, result.Warnings)
_, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "test-article.md")) _, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "test-article.md"))
require.NoError(t, err) require.NoError(t, err)
@@ -71,9 +74,11 @@ func TestRun_DryRunDoesNotWrite(t *testing.T) {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755)) require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
} }
llmResponse := mustJSON([]wiki.Page{{ llmResponse := mustJSON([]RawPage{{
Path: "wiki/sources/foo.md", Title: "Foo",
Content: "---\ntitle: Foo\n---\n\n## Summary\n\nFoo.\n", Type: "source",
Subtype: "article",
Content: "## Summary\n\nFoo.\n",
}}) }})
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@@ -98,10 +103,10 @@ func TestRun_MergesDuplicatePaths(t *testing.T) {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755)) require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
} }
// LLM returns same path twice (simulates multi-chunk merge) // LLM returns same title twice (simulates multi-chunk duplicate)
llmResponse := mustJSON([]wiki.Page{ llmResponse := mustJSON([]RawPage{
{Path: "wiki/concepts/foo.md", Content: "---\ntitle: Foo\n---\n\n## Definition\n\nFirst.\n\n## Related Concepts\n\n- [[bar|Bar]]\n"}, {Title: "Foo", Type: "concept", Content: "## Definition\n\nFirst.\n\n## Related Concepts\n\n[[Bar]]\n"},
{Path: "wiki/concepts/foo.md", Content: "---\ntitle: Foo\n---\n\n## Definition\n\nSecond.\n\n## Related Concepts\n\n- [[baz|Baz]]\n"}, {Title: "Foo", Type: "concept", Content: "## Definition\n\nSecond.\n\n## Related Concepts\n\n[[Baz]]\n"},
}) })
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@@ -120,8 +125,9 @@ func TestRun_MergesDuplicatePaths(t *testing.T) {
require.NoError(t, err) require.NoError(t, err)
// keep-first for Definition, union for Related Concepts // keep-first for Definition, union for Related Concepts
assert.Contains(t, string(content), "First.") assert.Contains(t, string(content), "First.")
assert.Contains(t, string(content), "[[bar|Bar]]") // Bar and Baz unknown in empty inventory → left as plain [[links]]
assert.Contains(t, string(content), "[[baz|Baz]]") assert.Contains(t, string(content), "[[Bar]]")
assert.Contains(t, string(content), "[[Baz]]")
} }
func mustJSON(v any) string { func mustJSON(v any) string {

View File

@@ -12,12 +12,15 @@ import (
const systemPrompt = `You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided. const systemPrompt = `You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided.
Output ONLY a valid JSON array — no markdown fences, no other text before or after. Output ONLY a valid JSON array — no markdown fences, no other text before or after.
Each element must have: Each element must have exactly these fields:
"path" — relative path within the wiki, e.g. "wiki/sources/foo.md" "title" — exact page title (e.g. "FinBERT", "Ryan Singer", "Shape Up")
"content" — full markdown content of the page including YAML frontmatter "type" — exactly one of: "source", "concept", "entity"
"subtype" — for source: article|pdf|book|video|note|project; for entity: person|company|tool|model|framework|technology; omit for concept
"domain" — one of the domains in the schema (omit if none fits)
"content" — Markdown body only — NO frontmatter, NO path, NO slug
Follow the schema strictly: correct frontmatter fields, wikilinks as [[slug|Display Text]], Wikilinks in content: [[Display Name]] — just the display name, no slug, no pipe separator.
dates in YYYY-MM-DD format, and paraphrase rather than quoting verbatim.` Only link to pages listed in the inventory or pages you are creating in this response.`
// BuildPrompt constructs the user prompt for a single chunk. // BuildPrompt constructs the user prompt for a single chunk.
func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]wiki.Entry) string { func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]wiki.Entry) string {
@@ -30,7 +33,7 @@ func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]w
sb.WriteString("\n\n") sb.WriteString("\n\n")
sb.WriteString("## Existing wiki pages\n\n") sb.WriteString("## Existing wiki pages\n\n")
sb.WriteString("Link ONLY to pages in this inventory or pages you are creating in this response.\n\n") sb.WriteString("Reference these pages by display name only — [[Display Name]] — in your content.\n\n")
for _, pt := range []wiki.PageType{wiki.PageTypeConcept, wiki.PageTypeEntity, wiki.PageTypeSource} { for _, pt := range []wiki.PageType{wiki.PageTypeConcept, wiki.PageTypeEntity, wiki.PageTypeSource} {
entries := inventory[pt] entries := inventory[pt]
@@ -39,19 +42,19 @@ func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]w
fmt.Fprintf(&sb, "%s — (none yet)\n\n", label) fmt.Fprintf(&sb, "%s — (none yet)\n\n", label)
continue continue
} }
fmt.Fprintf(&sb, "%s — link ONLY under the matching section:\n", label) fmt.Fprintf(&sb, "%s:\n", label)
for _, e := range entries { for _, e := range entries {
fmt.Fprintf(&sb, " - [[%s|%s]]\n", e.Slug, e.Title) fmt.Fprintf(&sb, " - %s\n", e.Title)
} }
sb.WriteString("\n") sb.WriteString("\n")
} }
sb.WriteString("## Non-negotiable rules\n\n") sb.WriteString("## Non-negotiable rules\n\n")
sb.WriteString("1. Output ONLY a valid JSON array — no prose, no fences.\n") sb.WriteString("1. Output ONLY a valid JSON array — no prose, no fences.\n")
sb.WriteString("2. Slugs are kebab-case: lowercase, spaces→hyphens, no special chars.\n") sb.WriteString("2. Fields: title, type, subtype (if applicable), domain (if applicable), content.\n")
sb.WriteString("3. Wikilinks: [[slug|Display Text]] — the pipe is required.\n") sb.WriteString("3. Wikilinks: [[Display Name]] — no slug, no pipe. The pipeline handles slugs.\n")
sb.WriteString("4. Section links must match their section type.\n") sb.WriteString("4. Section links must match their section type (Related Concepts → concepts only, etc.).\n")
sb.WriteString("5. One source page per book — update it if inventory shows it exists.\n\n") sb.WriteString("5. One source page per book — if inventory shows it exists, return it as an UPDATE.\n\n")
fmt.Fprintf(&sb, "## Source: %s\n\n", source) fmt.Fprintf(&sb, "## Source: %s\n\n", source)
sb.WriteString(content) sb.WriteString(content)

View File

@@ -0,0 +1,115 @@
// ingestion/internal/pipeline/refs.go
package pipeline
import (
"os"
"path/filepath"
"regexp"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
var wikilinkRE = regexp.MustCompile(`\[\[([^|\]]+)\|`)
// injectSourceRefs finds the source page in the proposed batch, extracts its
// wikilinks, and injects a back-reference into every linked concept or entity page.
// Pages that exist on disk but are not in the current batch are loaded and
// appended so they will be updated on write.
func injectSourceRefs(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry, brainDir string) []wiki.Page {
sourceSlug, sourceTitle, found := findSourcePage(pages)
if !found {
return pages
}
var sourceContent string
for _, p := range pages {
if strings.HasPrefix(p.Path, "wiki/sources/") &&
strings.TrimSuffix(filepath.Base(p.Path), ".md") == sourceSlug {
sourceContent = p.Content
break
}
}
linkedSlugs := extractWikilinks(sourceContent)
sourceRef := "- [[" + sourceSlug + "|" + sourceTitle + "]]"
bySlug := make(map[string]int, len(pages))
for i, p := range pages {
if !strings.HasPrefix(p.Path, "wiki/sources/") {
bySlug[strings.TrimSuffix(filepath.Base(p.Path), ".md")] = i
}
}
for slug := range linkedSlugs {
if slug == sourceSlug {
continue
}
if idx, ok := bySlug[slug]; ok {
pages[idx] = addSourceRef(pages[idx], sourceRef)
continue
}
pt, ok := findInInventory(slug, inventory)
if !ok {
continue
}
diskPath := filepath.Join(brainDir, "wiki", string(pt), slug+".md")
b, err := os.ReadFile(diskPath)
if err != nil {
continue
}
page := wiki.Page{
Path: "wiki/" + string(pt) + "/" + slug + ".md",
Content: string(b),
}
pages = append(pages, addSourceRef(page, sourceRef))
}
return pages
}
// addSourceRef injects sourceRef into the ## Sources bullet section of page
// using wiki.Merge, which deduplicates bullets automatically.
func addSourceRef(page wiki.Page, sourceRef string) wiki.Page {
patch := wiki.Page{
Path: page.Path,
Content: "\n## Sources\n\n" + sourceRef + "\n",
}
return wiki.Merge(page, patch)
}
// extractWikilinks returns the set of slugs referenced as [[slug|...]] in content.
func extractWikilinks(content string) map[string]bool {
slugs := make(map[string]bool)
for _, m := range wikilinkRE.FindAllStringSubmatch(content, -1) {
slugs[m[1]] = true
}
return slugs
}
// findSourcePage returns the slug and title of the first wiki/sources/ page in pages.
func findSourcePage(pages []wiki.Page) (slug, title string, found bool) {
for _, p := range pages {
if strings.HasPrefix(p.Path, "wiki/sources/") {
slug = strings.TrimSuffix(filepath.Base(p.Path), ".md")
title = extractTitle(p.Content)
if title == "" {
title = slug
}
return slug, title, true
}
}
return "", "", false
}
// findInInventory returns the PageType for a slug if it appears in the inventory.
func findInInventory(slug string, inventory map[wiki.PageType][]wiki.Entry) (wiki.PageType, bool) {
for pt, entries := range inventory {
for _, e := range entries {
if e.Slug == slug {
return pt, true
}
}
}
return "", false
}

View File

@@ -0,0 +1,172 @@
// ingestion/internal/pipeline/refs_test.go
package pipeline
import (
"os"
"path/filepath"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
func makeInventory(concepts, entities []string) map[wiki.PageType][]wiki.Entry {
inv := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {},
wiki.PageTypeEntity: {},
wiki.PageTypeSource: {},
}
for _, slug := range concepts {
inv[wiki.PageTypeConcept] = append(inv[wiki.PageTypeConcept], wiki.Entry{Slug: slug, Title: slug})
}
for _, slug := range entities {
inv[wiki.PageTypeEntity] = append(inv[wiki.PageTypeEntity], wiki.Entry{Slug: slug, Title: slug})
}
return inv
}
func TestInjectSourceRefs_NoSourcePage(t *testing.T) {
pages := []wiki.Page{
{Path: "wiki/concepts/foo.md", Content: "---\ntitle: Foo\n---\n\n## Definition\n\nFoo.\n"},
}
got := injectSourceRefs(pages, makeInventory(nil, nil), t.TempDir())
assert.Equal(t, pages, got)
}
func TestInjectSourceRefs_InjectsIntoProposedConcept(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/my-article.md",
Content: "---\ntitle: My Article\n---\n\n## Summary\n\nSee [[domain-driven-design|Domain Driven Design]].\n",
},
{
Path: "wiki/concepts/domain-driven-design.md",
Content: "---\ntitle: Domain Driven Design\n---\n\n## Definition\n\nA methodology.\n",
},
}
got := injectSourceRefs(pages, makeInventory(nil, nil), t.TempDir())
require.Len(t, got, 2)
assert.Contains(t, got[1].Content, "## Sources")
assert.Contains(t, got[1].Content, "[[my-article|My Article]]")
}
func TestInjectSourceRefs_LoadsConceptFromDisk(t *testing.T) {
brainDir := t.TempDir()
conceptDir := filepath.Join(brainDir, "wiki", "concepts")
require.NoError(t, os.MkdirAll(conceptDir, 0o755))
require.NoError(t, os.WriteFile(
filepath.Join(conceptDir, "shape-up.md"),
[]byte("---\ntitle: Shape Up\n---\n\n## Definition\n\nA methodology.\n"),
0o644,
))
pages := []wiki.Page{
{
Path: "wiki/sources/my-article.md",
Content: "---\ntitle: My Article\n---\n\n## Summary\n\nSee [[shape-up|Shape Up]].\n",
},
}
inv := makeInventory([]string{"shape-up"}, nil)
got := injectSourceRefs(pages, inv, brainDir)
require.Len(t, got, 2)
var conceptPage wiki.Page
for _, p := range got {
if p.Path == "wiki/concepts/shape-up.md" {
conceptPage = p
}
}
assert.Contains(t, conceptPage.Content, "## Sources")
assert.Contains(t, conceptPage.Content, "[[my-article|My Article]]")
assert.Contains(t, conceptPage.Content, "## Definition")
}
func TestInjectSourceRefs_NoSelfReference(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/my-article.md",
Content: "---\ntitle: My Article\n---\n\n## Summary\n\nSelf-link [[my-article|My Article]].\n",
},
}
got := injectSourceRefs(pages, makeInventory(nil, nil), t.TempDir())
assert.Len(t, got, 1)
}
func TestInjectSourceRefs_DeduplicatesOnReingestion(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/my-article.md",
Content: "---\ntitle: My Article\n---\n\n## Summary\n\nSee [[ddd|DDD]].\n",
},
{
Path: "wiki/concepts/ddd.md",
Content: "---\ntitle: DDD\n---\n\n## Definition\n\nA thing.\n\n## Sources\n\n- [[my-article|My Article]]\n",
},
}
got := injectSourceRefs(pages, makeInventory(nil, nil), t.TempDir())
require.Len(t, got, 2)
count := 0
for _, line := range splitLines(got[1].Content) {
if line == "- [[my-article|My Article]]" {
count++
}
}
assert.Equal(t, 1, count, "source ref should appear exactly once")
}
func TestInjectSourceRefs_InjectsIntoEntity(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/book.md",
Content: "---\ntitle: Book\n---\n\n## Summary\n\nBy [[ryan-singer|Ryan Singer]].\n",
},
{
Path: "wiki/entities/ryan-singer.md",
Content: "---\ntitle: Ryan Singer\n---\n\n## Description\n\nA designer.\n",
},
}
got := injectSourceRefs(pages, makeInventory(nil, nil), t.TempDir())
require.Len(t, got, 2)
var entity wiki.Page
for _, p := range got {
if p.Path == "wiki/entities/ryan-singer.md" {
entity = p
}
}
assert.Contains(t, entity.Content, "[[book|Book]]")
}
func TestExtractWikilinks(t *testing.T) {
content := "See [[foo|Foo]] and [[bar|Bar]] and [[foo|Foo again]]."
got := extractWikilinks(content)
assert.True(t, got["foo"])
assert.True(t, got["bar"])
assert.Len(t, got, 2, "duplicate slugs should be deduplicated")
}
func splitLines(s string) []string {
var out []string
start := 0
for i := 0; i < len(s); i++ {
if s[i] == '\n' {
if line := s[start:i]; line != "" {
out = append(out, line)
}
start = i + 1
}
}
if last := s[start:]; last != "" {
out = append(out, last)
}
return out
}

View File

@@ -14,13 +14,12 @@ import (
"github.com/stretchr/testify/require" "github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/pipeline" "github.com/mathiasbq/hyperguild/ingestion/internal/pipeline"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
) )
// successComplete returns a valid JSON-encoded page array for any call. // successComplete returns a valid JSON-encoded RawPage array for any call.
func successComplete(page wiki.Page) pipeline.CompleteFunc { func successComplete(raw pipeline.RawPage) pipeline.CompleteFunc {
return func(ctx context.Context, system, user string) (string, error) { return func(ctx context.Context, system, user string) (string, error) {
b, err := json.Marshal([]wiki.Page{page}) b, err := json.Marshal([]pipeline.RawPage{raw})
if err != nil { if err != nil {
return "", err return "", err
} }
@@ -50,16 +49,19 @@ func TestStart_ProcessesFile(t *testing.T) {
require.NoError(t, os.WriteFile(rawFile, []byte("Content about Shape Up."), 0o644)) require.NoError(t, os.WriteFile(rawFile, []byte("Content about Shape Up."), 0o644))
date := time.Now().UTC().Format("2006-01-02") date := time.Now().UTC().Format("2006-01-02")
wikiPage := wiki.Page{ rawPage := pipeline.RawPage{
Path: "wiki/sources/shape-up-book.md", Title: "Shape Up Book",
Content: "---\ntitle: Shape Up Book\ntype: article\ndomain: product-management\ndate_ingested: " + date + "\nlast_updated: " + date + "\naliases:\n - Shape Up Book\n---\n\n## Summary\n\nA book about Shape Up.\n", Type: "source",
Subtype: "article",
Domain: "product-management",
Content: "## Summary\n\nA book about Shape Up.\n",
} }
cfg := Config{ cfg := Config{
BrainDir: brainDir, BrainDir: brainDir,
Interval: 50 * time.Millisecond, Interval: 50 * time.Millisecond,
Pipeline: pipeline.Config{ Pipeline: pipeline.Config{
Complete: successComplete(wikiPage), Complete: successComplete(rawPage),
ChunkSize: 0, ChunkSize: 0,
Schema: "# Schema\nThree page types.", Schema: "# Schema\nThree page types.",
}, },
@@ -193,12 +195,14 @@ func TestProcessDir_SkipsSubdirs(t *testing.T) {
// Track which sources were passed to Complete. // Track which sources were passed to Complete.
var processedSources []string var processedSources []string
completeFn := func(ctx context.Context, system, user string) (string, error) { completeFn := func(ctx context.Context, system, user string) (string, error) {
// Record that this was called; return a minimal valid page. // Record that this was called; return a minimal valid RawPage.
page := wiki.Page{ raw := pipeline.RawPage{
Path: "wiki/sources/valid.md", Title: "Valid",
Content: "---\ntitle: Valid\n---\n\n## Summary\n\nValid.\n", Type: "source",
Subtype: "article",
Content: "## Summary\n\nValid.\n",
} }
b, _ := json.Marshal([]wiki.Page{page}) b, _ := json.Marshal([]pipeline.RawPage{raw})
processedSources = append(processedSources, "called") processedSources = append(processedSources, "called")
return string(b), nil return string(b), nil
} }

View File

@@ -17,6 +17,8 @@ func (s *Skill) Handle(ctx context.Context, tool string, args json.RawMessage) (
return s.query(ctx, args) return s.query(ctx, args)
case "brain_write": case "brain_write":
return s.write(ctx, args) return s.write(ctx, args)
case "brain_ingest_raw":
return s.ingestRaw(ctx, args)
case "brain_ingest": case "brain_ingest":
return s.ingest(ctx, args) return s.ingest(ctx, args)
case "brain_search": case "brain_search":
@@ -98,6 +100,33 @@ func (s *Skill) ingest(ctx context.Context, args json.RawMessage) (json.RawMessa
return nil, fmt.Errorf("either content+source or path is required") return nil, fmt.Errorf("either content+source or path is required")
} }
type ingestRawArgs struct {
Source string `json:"source"`
Pages []any `json:"pages"`
DryRun bool `json:"dry_run,omitempty"`
}
func (s *Skill) ingestRaw(ctx context.Context, args json.RawMessage) (json.RawMessage, error) {
var a ingestRawArgs
if err := json.Unmarshal(args, &a); err != nil {
return nil, fmt.Errorf("parse args: %w", err)
}
if s.cfg.IngestSvcURL == "" {
return nil, fmt.Errorf("brain_ingest_raw: INGEST_SVC_URL not configured")
}
if a.Source == "" {
return nil, fmt.Errorf("source is required")
}
if len(a.Pages) == 0 {
return nil, fmt.Errorf("pages is required and must be non-empty")
}
return s.postTo(ctx, s.cfg.IngestSvcURL+"/ingest-raw", map[string]any{
"source": a.Source,
"pages": a.Pages,
"dry_run": a.DryRun,
})
}
type searchArgs struct { type searchArgs struct {
Query string `json:"query"` Query string `json:"query"`
Collection string `json:"collection,omitempty"` Collection string `json:"collection,omitempty"`

View File

@@ -55,6 +55,32 @@ func (s *Skill) Tools() []registry.ToolDef {
}, },
} }
if s.cfg.IngestSvcURL != "" { if s.cfg.IngestSvcURL != "" {
tools = append(tools, registry.ToolDef{
Name: "brain_ingest_raw",
Description: "Ingest pre-structured pages into the brain wiki, bypassing the LLM extraction step. " +
"Use when you (the calling agent) have already extracted entities, concepts, and content from a source. " +
"Provide source (human-readable name) and pages (array of {title, type, subtype, domain, content} objects). " +
"The pipeline computes slugs, paths, frontmatter, wikilink canonicalization, and source back-references. " +
"Returns the list of wiki pages written.",
InputSchema: schema([]string{"source", "pages"}, map[string]any{
"source": map[string]any{"type": "string", "description": "human-readable name for the source, e.g. 'shape-up-book'"},
"pages": map[string]any{
"type": "array",
"items": map[string]any{
"type": "object",
"required": []string{"title", "type", "content"},
"properties": map[string]any{
"title": map[string]any{"type": "string", "description": "page title, e.g. 'Hash Encoding'"},
"type": map[string]any{"type": "string", "enum": []string{"source", "concept", "entity"}, "description": "page type"},
"subtype": map[string]any{"type": "string", "description": "entity: person|company|tool|model|framework|technology; source: article|pdf|book|video|note|project"},
"domain": map[string]any{"type": "string", "description": "knowledge domain, e.g. 'Machine Learning'"},
"content": map[string]any{"type": "string", "description": "markdown body — no frontmatter, use [[Display Name]] for wikilinks"},
},
},
},
"dry_run": map[string]any{"type": "boolean"},
}),
})
tools = append(tools, registry.ToolDef{ tools = append(tools, registry.ToolDef{
Name: "brain_ingest", Name: "brain_ingest",
Description: "Ingest content into the brain wiki (brain/wiki/). Calls an LLM to produce structured wiki pages. " + Description: "Ingest content into the brain wiki (brain/wiki/). Calls an LLM to produce structured wiki pages. " +