Files
hyperguild/docs/superpowers/plans/2026-04-23-level3-slug-authority.md
2026-04-23 17:37:45 +02:00

41 KiB
Raw Permalink Blame History

Level 3: Strip Slug Authority from LLM — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Remove slug/path/frontmatter generation from the LLM; pipeline derives all slugs deterministically from titles via wiki.Slug().

Architecture: Add RawPage type + ParseRawPages (LLM returns minimal JSON), BuildPages (computes slugs/paths/frontmatter), and CanonicalizeLinks (converts [[Display Name]][[slug|Display Name]]). Wire into pipeline.Run. Update system prompt and schema doc.

Tech Stack: Go 1.23, encoding/json, regexp, strings, testify

Working directory for all commands: ingestion/ (the Go module root — always cd ingestion before running go commands)


File Map

File Action Responsibility
ingestion/internal/pipeline/parse.go Modify Replace ParsePages+wiki.Page deserialization with ParseRawPages+RawPage type
ingestion/internal/pipeline/parse_test.go Modify Replace old ParsePages tests with ParseRawPages tests
ingestion/internal/pipeline/build.go Create BuildPages — computes slug, path, frontmatter from RawPage
ingestion/internal/pipeline/build_test.go Create Tests for BuildPages
ingestion/internal/pipeline/links.go Create CanonicalizeLinks — converts [[Display Name]] → `[[slug
ingestion/internal/pipeline/links_test.go Create Tests for CanonicalizeLinks
ingestion/internal/pipeline/pipeline.go Modify Wire ParseRawPages → BuildPages → CanonicalizeLinks into Run; update tests
ingestion/internal/pipeline/pipeline_test.go Modify Update mock LLM responses to new JSON format
ingestion/internal/pipeline/prompt.go Modify New system prompt + updated BuildPrompt for new JSON contract
brain/schema.md Modify Update wikilink format and JSON output format

Unchanged: resolve.go, refs.go, backfill.go, merge.go, chunk.go, resolve_test.go, refs_test.go, backfill_test.go


Task 1: RawPage type + ParseRawPages

Files:

  • Modify: ingestion/internal/pipeline/parse.go

  • Modify: ingestion/internal/pipeline/parse_test.go

  • Step 1: Write failing tests

Replace the entire content of ingestion/internal/pipeline/parse_test.go with:

// ingestion/internal/pipeline/parse_test.go
package pipeline

import (
	"testing"

	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
)

func TestParseRawPages_ValidJSON(t *testing.T) {
	input := `[{"title":"Shape Up","type":"source","subtype":"book","domain":"product-strategy","content":"## Summary\n\nFoo."},{"title":"Betting","type":"concept","content":"## Definition\n\nA technique."}]`
	pages, warnings := ParseRawPages(input)
	require.Len(t, pages, 2)
	assert.Empty(t, warnings)
	assert.Equal(t, "Shape Up", pages[0].Title)
	assert.Equal(t, "source", pages[0].Type)
	assert.Equal(t, "book", pages[0].Subtype)
	assert.Equal(t, "product-strategy", pages[0].Domain)
	assert.Equal(t, "Betting", pages[1].Title)
	assert.Equal(t, "concept", pages[1].Type)
	assert.Empty(t, pages[1].Subtype)
}

func TestParseRawPages_StripsFences(t *testing.T) {
	input := "```json\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"## Definition\\n\\nFoo.\"}]\n```"
	pages, warnings := ParseRawPages(input)
	require.Len(t, pages, 1)
	assert.Empty(t, warnings)
	assert.Equal(t, "Foo", pages[0].Title)
}

func TestParseRawPages_TruncationRecovery(t *testing.T) {
	input := `[{"title":"Foo","type":"concept","content":"## Definition\n\nFoo."},{"title":"Bar","type":"concept","content":"trunc`
	pages, warnings := ParseRawPages(input)
	require.Len(t, pages, 1)
	assert.Equal(t, "Foo", pages[0].Title)
	assert.NotEmpty(t, warnings)
}

func TestParseRawPages_EmptyInput(t *testing.T) {
	pages, warnings := ParseRawPages("")
	assert.Empty(t, pages)
	assert.NotEmpty(t, warnings)
}

func TestParseRawPages_PlainFence(t *testing.T) {
	input := "```\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"ok\"}]\n```"
	pages, warnings := ParseRawPages(input)
	require.Len(t, pages, 1)
	assert.Empty(t, warnings)
}

func TestParseRawPages_MissingTitle(t *testing.T) {
	// Missing title — still parsed, Title is empty string
	input := `[{"type":"concept","content":"## Definition\n\nFoo."}]`
	pages, warnings := ParseRawPages(input)
	require.Len(t, pages, 1)
	assert.Empty(t, warnings)
	assert.Empty(t, pages[0].Title)
}
  • Step 2: Run tests to verify they fail
cd ingestion && go test ./internal/pipeline/ -run TestParseRawPages -v

Expected: FAIL — ParseRawPages undefined, RawPage undefined.

  • Step 3: Replace parse.go with new implementation

Replace the entire content of ingestion/internal/pipeline/parse.go with:

// ingestion/internal/pipeline/parse.go
package pipeline

import (
	"encoding/json"
	"fmt"
	"strings"
)

// RawPage is the LLM's output format — minimal structured data with no path or frontmatter.
// The pipeline derives slugs, paths, and frontmatter from these fields.
type RawPage struct {
	Title   string `json:"title"`
	Type    string `json:"type"`    // "source" | "concept" | "entity"
	Subtype string `json:"subtype"` // entity: person|company|tool|model|framework|technology; source: article|pdf|book|video|note|project
	Domain  string `json:"domain"`
	Content string `json:"content"` // Markdown body only — no frontmatter
}

// ParseRawPages parses LLM output as a JSON array of RawPage objects.
// If the array is truncated mid-object (token limit), it salvages all complete objects.
func ParseRawPages(output string) ([]RawPage, []string) {
	output = strings.TrimSpace(output)
	if output == "" {
		return nil, []string{"LLM returned empty output"}
	}

	output = stripFences(output)

	var pages []RawPage
	if err := json.Unmarshal([]byte(output), &pages); err == nil {
		return pages, nil
	}

	// Truncation recovery: find last `}` that closes a complete object.
	idx := strings.LastIndex(output, "}")
	if idx < 0 {
		return nil, []string{"LLM output contained no complete JSON objects"}
	}

	start := strings.Index(output, "[")
	if start < 0 {
		return nil, []string{"LLM output contained no JSON array opening bracket"}
	}

	candidate := output[start:idx+1] + "]"
	if err := json.Unmarshal([]byte(candidate), &pages); err != nil {
		return nil, []string{fmt.Sprintf("truncation recovery failed: %v", err)}
	}

	return pages, []string{fmt.Sprintf("LLM output was truncated; recovered %d page(s)", len(pages))}
}

func stripFences(s string) string {
	for _, prefix := range []string{"```json\n", "```json\r\n", "```\n", "```\r\n"} {
		if strings.HasPrefix(s, prefix) {
			s = strings.TrimPrefix(s, prefix)
			s = strings.TrimSuffix(strings.TrimSpace(s), "```")
			return strings.TrimSpace(s)
		}
	}
	return s
}
  • Step 4: Run tests to verify they pass
cd ingestion && go test ./internal/pipeline/ -run TestParseRawPages -v

Expected: all 6 tests PASS.

  • Step 5: Verify package still compiles (pipeline.go still references old ParsePages — that's expected for now)
cd ingestion && go build ./...

Expected: compile error mentioning ParsePages undefined — that's fine, we'll fix it in Task 4. If the error is something else, investigate.

Actually the old ParsePages no longer exists so pipeline.go will fail. That's OK — this task is intentionally breaking the pipeline step to set up for Task 4. If CI blocks on this, group Tasks 14 into a single commit.

  • Step 6: Commit
cd ingestion && git add internal/pipeline/parse.go internal/pipeline/parse_test.go
git commit -m "feat(pipeline): replace ParsePages with ParseRawPages + RawPage type"

Task 2: BuildPages

Files:

  • Create: ingestion/internal/pipeline/build.go

  • Create: ingestion/internal/pipeline/build_test.go

  • Step 1: Write failing tests

Create ingestion/internal/pipeline/build_test.go:

// ingestion/internal/pipeline/build_test.go
package pipeline

import (
	"strings"
	"testing"

	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
)

func TestBuildPages_SourcePage(t *testing.T) {
	raw := []RawPage{
		{
			Title:   "Shape Up",
			Type:    "source",
			Subtype: "book",
			Domain:  "product-strategy",
			Content: "## Summary\n\nA book about shaping product work.\n",
		},
	}
	pages := BuildPages(raw, "shape-up", "2026-04-23")
	require.Len(t, pages, 1)

	p := pages[0]
	assert.Equal(t, "wiki/sources/shape-up.md", p.Path)
	assert.Contains(t, p.Content, "title: Shape Up")
	assert.Contains(t, p.Content, "type: book")
	assert.Contains(t, p.Content, "domain: product-strategy")
	assert.Contains(t, p.Content, "date_ingested: 2026-04-23")
	assert.Contains(t, p.Content, "last_updated: 2026-04-23")
	assert.Contains(t, p.Content, "aliases:\n  - Shape Up")
	assert.Contains(t, p.Content, "## Summary")
	assert.True(t, strings.HasPrefix(p.Content, "---\n"), "content must start with frontmatter")
}

func TestBuildPages_ConceptPage(t *testing.T) {
	raw := []RawPage{
		{
			Title:   "Betting",
			Type:    "concept",
			Domain:  "product-strategy",
			Content: "## Definition\n\nA resource allocation technique.\n",
		},
	}
	pages := BuildPages(raw, "shape-up", "2026-04-23")
	require.Len(t, pages, 1)

	p := pages[0]
	assert.Equal(t, "wiki/concepts/betting.md", p.Path)
	assert.Contains(t, p.Content, "title: Betting")
	assert.Contains(t, p.Content, "domain: product-strategy")
	assert.Contains(t, p.Content, "last_updated: 2026-04-23")
	assert.Contains(t, p.Content, "aliases:\n  - Betting")
	assert.NotContains(t, p.Content, "date_ingested")
	assert.Contains(t, p.Content, "## Definition")
}

func TestBuildPages_EntityPage(t *testing.T) {
	raw := []RawPage{
		{
			Title:   "Ryan Singer",
			Type:    "entity",
			Subtype: "person",
			Domain:  "product-strategy",
			Content: "## Description\n\nA product designer.\n",
		},
	}
	pages := BuildPages(raw, "shape-up", "2026-04-23")
	require.Len(t, pages, 1)

	p := pages[0]
	assert.Equal(t, "wiki/entities/ryan-singer.md", p.Path)
	assert.Contains(t, p.Content, "title: Ryan Singer")
	assert.Contains(t, p.Content, "type: person")
	assert.Contains(t, p.Content, "domain: product-strategy")
	assert.Contains(t, p.Content, "last_updated: 2026-04-23")
	assert.Contains(t, p.Content, "aliases:\n  - Ryan Singer")
	assert.NotContains(t, p.Content, "date_ingested")
}

func TestBuildPages_SourceSlugUsedForSourcePage(t *testing.T) {
	// LLM title differs from filename — pipeline uses sourceSlug for the source page path.
	raw := []RawPage{
		{Title: "FinBERT: A Pretrained Model", Type: "source", Subtype: "article", Content: "## Summary\n\nA model.\n"},
	}
	pages := BuildPages(raw, "finbert-huggingface", "2026-04-23")
	require.Len(t, pages, 1)
	assert.Equal(t, "wiki/sources/finbert-huggingface.md", pages[0].Path)
	// Concept/entity slugs derived from title, not sourceSlug
}

func TestBuildPages_ConceptSlugDerivedFromTitle(t *testing.T) {
	raw := []RawPage{
		{Title: "Domain-Driven Design", Type: "concept", Content: "## Definition\n\nFoo.\n"},
	}
	pages := BuildPages(raw, "some-source", "2026-04-23")
	require.Len(t, pages, 1)
	assert.Equal(t, "wiki/concepts/domain-driven-design.md", pages[0].Path)
}

func TestBuildPages_SourceDefaultSubtype(t *testing.T) {
	// If subtype is omitted for a source, default to "article"
	raw := []RawPage{
		{Title: "Some Post", Type: "source", Content: "## Summary\n\nA post.\n"},
	}
	pages := BuildPages(raw, "some-post", "2026-04-23")
	require.Len(t, pages, 1)
	assert.Contains(t, pages[0].Content, "type: article")
}

func TestBuildPages_OmitsDomainWhenEmpty(t *testing.T) {
	raw := []RawPage{
		{Title: "Betting", Type: "concept", Content: "## Definition\n\nFoo.\n"},
	}
	pages := BuildPages(raw, "src", "2026-04-23")
	require.Len(t, pages, 1)
	assert.NotContains(t, pages[0].Content, "domain:")
}

func TestBuildPages_MultiplePages(t *testing.T) {
	raw := []RawPage{
		{Title: "Shape Up", Type: "source", Subtype: "book", Content: "## Summary\n\nA book.\n"},
		{Title: "Betting", Type: "concept", Content: "## Definition\n\nA technique.\n"},
		{Title: "Ryan Singer", Type: "entity", Subtype: "person", Content: "## Description\n\nA designer.\n"},
	}
	pages := BuildPages(raw, "shape-up", "2026-04-23")
	require.Len(t, pages, 3)
	assert.Equal(t, "wiki/sources/shape-up.md", pages[0].Path)
	assert.Equal(t, "wiki/concepts/betting.md", pages[1].Path)
	assert.Equal(t, "wiki/entities/ryan-singer.md", pages[2].Path)
}
  • Step 2: Run tests to verify they fail
cd ingestion && go test ./internal/pipeline/ -run TestBuildPages -v

Expected: FAIL — BuildPages undefined.

  • Step 3: Create build.go

Create ingestion/internal/pipeline/build.go:

// ingestion/internal/pipeline/build.go
package pipeline

import (
	"fmt"
	"strings"

	"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)

// BuildPages converts RawPages from the LLM into wiki.Pages with computed slugs,
// paths, and YAML frontmatter. sourceSlug is the slug of the source being ingested
// (derived from the filename, not the LLM title).
func BuildPages(rawPages []RawPage, sourceSlug, date string) []wiki.Page {
	out := make([]wiki.Page, 0, len(rawPages))
	for _, rp := range rawPages {
		out = append(out, buildPage(rp, sourceSlug, date))
	}
	return out
}

func buildPage(rp RawPage, sourceSlug, date string) wiki.Page {
	var slug, dir string
	switch rp.Type {
	case "source":
		slug = sourceSlug
		dir = "wiki/sources"
	case "concept":
		slug = wiki.Slug(rp.Title)
		dir = "wiki/concepts"
	case "entity":
		slug = wiki.Slug(rp.Title)
		dir = "wiki/entities"
	default:
		slug = wiki.Slug(rp.Title)
		dir = "wiki/" + rp.Type
	}

	path := dir + "/" + slug + ".md"
	fm := buildFrontmatter(rp, date)

	return wiki.Page{
		Path:    path,
		Content: fm + "\n" + rp.Content,
	}
}

func buildFrontmatter(rp RawPage, date string) string {
	var sb strings.Builder
	sb.WriteString("---\n")
	fmt.Fprintf(&sb, "title: %s\n", rp.Title)

	switch rp.Type {
	case "source":
		subtype := rp.Subtype
		if subtype == "" {
			subtype = "article"
		}
		fmt.Fprintf(&sb, "type: %s\n", subtype)
		if rp.Domain != "" {
			fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
		}
		fmt.Fprintf(&sb, "date_ingested: %s\n", date)
		fmt.Fprintf(&sb, "last_updated: %s\n", date)
	case "concept":
		if rp.Domain != "" {
			fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
		}
		fmt.Fprintf(&sb, "last_updated: %s\n", date)
	case "entity":
		if rp.Subtype != "" {
			fmt.Fprintf(&sb, "type: %s\n", rp.Subtype)
		}
		if rp.Domain != "" {
			fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
		}
		fmt.Fprintf(&sb, "last_updated: %s\n", date)
	default:
		if rp.Domain != "" {
			fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
		}
		fmt.Fprintf(&sb, "last_updated: %s\n", date)
	}

	fmt.Fprintf(&sb, "aliases:\n  - %s\n", rp.Title)
	sb.WriteString("---\n")
	return sb.String()
}
  • Step 4: Run tests to verify they pass
cd ingestion && go test ./internal/pipeline/ -run TestBuildPages -v

Expected: all 8 tests PASS.

  • Step 5: Commit
git add ingestion/internal/pipeline/build.go ingestion/internal/pipeline/build_test.go
git commit -m "feat(pipeline): add BuildPages — compute slugs/paths/frontmatter from RawPage"

Files:

  • Create: ingestion/internal/pipeline/links.go

  • Create: ingestion/internal/pipeline/links_test.go

  • Step 1: Write failing tests

Create ingestion/internal/pipeline/links_test.go:

// ingestion/internal/pipeline/links_test.go
package pipeline

import (
	"testing"

	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"

	"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)

func TestCanonicalizeLinks_KnownTitle(t *testing.T) {
	pages := []wiki.Page{
		{
			Path:    "wiki/sources/shape-up.md",
			Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[Betting]].\n",
		},
	}
	inventory := map[wiki.PageType][]wiki.Entry{
		wiki.PageTypeConcept: {
			{Slug: "betting", Title: "Betting"},
		},
	}
	got, warnings := CanonicalizeLinks(pages, inventory)
	require.Len(t, got, 1)
	assert.Empty(t, warnings)
	assert.Contains(t, got[0].Content, "[[betting|Betting]]")
	assert.NotContains(t, got[0].Content, "[[Betting]]")
}

func TestCanonicalizeLinks_UnknownTitleLeftAsIs(t *testing.T) {
	pages := []wiki.Page{
		{
			Path:    "wiki/sources/shape-up.md",
			Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[Ghost Concept]].\n",
		},
	}
	inventory := map[wiki.PageType][]wiki.Entry{}
	got, warnings := CanonicalizeLinks(pages, inventory)
	require.Len(t, got, 1)
	assert.NotEmpty(t, warnings)
	assert.Contains(t, got[0].Content, "[[Ghost Concept]]")
}

func TestCanonicalizeLinks_AlreadyCanonicalLinkUntouched(t *testing.T) {
	// Links already in [[slug|Display]] format must not be double-converted
	pages := []wiki.Page{
		{
			Path:    "wiki/sources/shape-up.md",
			Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[betting|Betting]].\n",
		},
	}
	inventory := map[wiki.PageType][]wiki.Entry{
		wiki.PageTypeConcept: {
			{Slug: "betting", Title: "Betting"},
		},
	}
	got, warnings := CanonicalizeLinks(pages, inventory)
	require.Len(t, got, 1)
	assert.Empty(t, warnings)
	// Should remain exactly as-is — not double-wrapped
	assert.Contains(t, got[0].Content, "[[betting|Betting]]")
	assert.NotContains(t, got[0].Content, "[[betting|[[betting|Betting]]]]")
}

func TestCanonicalizeLinks_CaseInsensitiveMatch(t *testing.T) {
	pages := []wiki.Page{
		{
			Path:    "wiki/sources/foo.md",
			Content: "---\ntitle: Foo\n---\n\n## Summary\n\nSee [[domain driven design]].\n",
		},
	}
	inventory := map[wiki.PageType][]wiki.Entry{
		wiki.PageTypeConcept: {
			{Slug: "domain-driven-design", Title: "Domain Driven Design"},
		},
	}
	got, warnings := CanonicalizeLinks(pages, inventory)
	require.Len(t, got, 1)
	assert.Empty(t, warnings)
	assert.Contains(t, got[0].Content, "[[domain-driven-design|domain driven design]]")
}

func TestCanonicalizeLinks_CurrentBatchPagesResolved(t *testing.T) {
	// A concept created in the same batch should be canonicalizable
	pages := []wiki.Page{
		{
			Path:    "wiki/sources/shape-up.md",
			Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[Betting]].\n",
		},
		{
			Path:    "wiki/concepts/betting.md",
			Content: "---\ntitle: Betting\n---\n\n## Definition\n\nA technique.\n",
		},
	}
	inventory := map[wiki.PageType][]wiki.Entry{} // empty — Betting is in the batch, not inventory

	got, warnings := CanonicalizeLinks(pages, inventory)
	require.Len(t, got, 2)
	assert.Empty(t, warnings)
	assert.Contains(t, got[0].Content, "[[betting|Betting]]")
}

func TestCanonicalizeLinks_MultipleLinksInOnePage(t *testing.T) {
	pages := []wiki.Page{
		{
			Path:    "wiki/sources/foo.md",
			Content: "---\ntitle: Foo\n---\n\n## Summary\n\nSee [[Betting]] and [[Shape Up]].\n",
		},
	}
	inventory := map[wiki.PageType][]wiki.Entry{
		wiki.PageTypeConcept: {
			{Slug: "betting", Title: "Betting"},
		},
		wiki.PageTypeSource: {
			{Slug: "shape-up", Title: "Shape Up"},
		},
	}
	got, warnings := CanonicalizeLinks(pages, inventory)
	require.Len(t, got, 1)
	assert.Empty(t, warnings)
	assert.Contains(t, got[0].Content, "[[betting|Betting]]")
	assert.Contains(t, got[0].Content, "[[shape-up|Shape Up]]")
}
  • Step 2: Run tests to verify they fail
cd ingestion && go test ./internal/pipeline/ -run TestCanonicalizeLinks -v

Expected: FAIL — CanonicalizeLinks undefined.

  • Step 3: Create links.go

Create ingestion/internal/pipeline/links.go:

// ingestion/internal/pipeline/links.go
package pipeline

import (
	"fmt"
	"path/filepath"
	"regexp"
	"strings"

	"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)

// plainLinkRE matches [[Display Name]] — wikilinks without a slug pipe.
// It does NOT match [[slug|Display]] (those already have a pipe).
var plainLinkRE = regexp.MustCompile(`\[\[([^\]|]+)\]\]`)

// CanonicalizeLinks converts [[Display Name]] wikilinks to [[slug|Display Name]]
// using a title→slug map built from the inventory and current batch.
// Unknown titles are left as-is and returned as warnings.
func CanonicalizeLinks(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) ([]wiki.Page, []string) {
	titleToSlug := buildTitleMap(pages, inventory)

	var allWarnings []string
	out := make([]wiki.Page, len(pages))
	for i, p := range pages {
		newContent, warnings := canonicalizeContent(p.Content, titleToSlug)
		p.Content = newContent
		out[i] = p
		allWarnings = append(allWarnings, warnings...)
	}
	return out, allWarnings
}

// buildTitleMap builds a lowercase-title → slug map from inventory and current batch.
// Current batch entries take precedence over inventory (they may be updates).
func buildTitleMap(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) map[string]string {
	m := make(map[string]string)
	for _, entries := range inventory {
		for _, e := range entries {
			m[strings.ToLower(e.Title)] = e.Slug
		}
	}
	// Current batch overrides inventory
	for _, p := range pages {
		title := extractTitle(p.Content)
		slug := strings.TrimSuffix(filepath.Base(p.Path), ".md")
		if title != "" && slug != "" {
			m[strings.ToLower(title)] = slug
		}
	}
	return m
}

func canonicalizeContent(content string, titleToSlug map[string]string) (string, []string) {
	var warnings []string
	result := plainLinkRE.ReplaceAllStringFunc(content, func(match string) string {
		sub := plainLinkRE.FindStringSubmatch(match)
		if len(sub) < 2 {
			return match
		}
		displayName := sub[1]
		slug, ok := titleToSlug[strings.ToLower(displayName)]
		if !ok {
			warnings = append(warnings, fmt.Sprintf("unknown wikilink: [[%s]]", displayName))
			return match
		}
		return "[[" + slug + "|" + displayName + "]]"
	})
	return result, warnings
}
  • Step 4: Run tests to verify they pass
cd ingestion && go test ./internal/pipeline/ -run TestCanonicalizeLinks -v

Expected: all 6 tests PASS.

  • Step 5: Commit
git add ingestion/internal/pipeline/links.go ingestion/internal/pipeline/links_test.go
git commit -m "feat(pipeline): add CanonicalizeLinks — convert [[Display Name]] to [[slug|Display Name]]"

Task 4: Wire pipeline.go + update pipeline_test.go

Files:

  • Modify: ingestion/internal/pipeline/pipeline.go

  • Modify: ingestion/internal/pipeline/pipeline_test.go

  • Step 1: Update pipeline_test.go to use new LLM response format

Replace the entire content of ingestion/internal/pipeline/pipeline_test.go with:

// ingestion/internal/pipeline/pipeline_test.go
package pipeline

import (
	"context"
	"encoding/json"
	"net/http"
	"net/http/httptest"
	"os"
	"path/filepath"
	"testing"
	"time"

	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"

	"github.com/mathiasbq/hyperguild/ingestion/internal/llm"
)

func TestRun_WritesPages(t *testing.T) {
	brainDir := t.TempDir()
	for _, sub := range []string{"wiki/concepts", "wiki/entities", "wiki/sources"} {
		require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
	}

	llmResponse := mustJSON([]RawPage{
		{
			Title:   "Test Article",
			Type:    "source",
			Subtype: "article",
			Domain:  "software-engineering",
			Content: "## Summary\n\nA test article.\n\n## Key Claims\n\n- It tests things.\n\n## Concepts Introduced or Reinforced\n\n[[Testing]]\n\n## Entities Mentioned\n\n## Open Questions Raised\n",
		},
		{
			Title:   "Testing",
			Type:    "concept",
			Domain:  "software-engineering",
			Content: "## Definition\n\nThe practice of verifying software.\n\n## Why It Matters\n\nCatches bugs.\n\n## Related Concepts\n\n## Related Entities\n\n## Sources\n\n## Evolving Notes\n",
		},
	})

	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		w.Header().Set("Content-Type", "application/json")
		_ = json.NewEncoder(w).Encode(map[string]any{
			"choices": []map[string]any{
				{"message": map[string]any{"role": "assistant", "content": llmResponse}},
			},
		})
	}))
	defer srv.Close()

	cfg := Config{
		Complete:  llm.New(srv.URL, "", "test-model", 30*time.Second).Complete,
		ChunkSize: 0,
	}

	result, err := Run(context.Background(), cfg, brainDir, "An article about testing.", "test-article", false)
	require.NoError(t, err)
	assert.Len(t, result.Pages, 2)

	_, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "test-article.md"))
	require.NoError(t, err)
	_, err = os.Stat(filepath.Join(brainDir, "wiki", "concepts", "testing.md"))
	require.NoError(t, err)
	_, err = os.Stat(filepath.Join(brainDir, "wiki", "index.md"))
	require.NoError(t, err)
	_, err = os.Stat(filepath.Join(brainDir, "log.md"))
	require.NoError(t, err)
}

func TestRun_DryRunDoesNotWrite(t *testing.T) {
	brainDir := t.TempDir()
	for _, sub := range []string{"wiki/concepts", "wiki/entities", "wiki/sources"} {
		require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
	}

	llmResponse := mustJSON([]RawPage{{
		Title:   "Foo",
		Type:    "source",
		Subtype: "article",
		Content: "## Summary\n\nFoo.\n",
	}})

	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		_ = json.NewEncoder(w).Encode(map[string]any{
			"choices": []map[string]any{{"message": map[string]any{"content": llmResponse}}},
		})
	}))
	defer srv.Close()

	cfg := Config{Complete: llm.New(srv.URL, "", "m", 30*time.Second).Complete}
	result, err := Run(context.Background(), cfg, brainDir, "foo content", "foo", true)
	require.NoError(t, err)
	assert.Len(t, result.Pages, 1)

	_, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "foo.md"))
	assert.True(t, os.IsNotExist(err))
}

func TestRun_MergesDuplicatePaths(t *testing.T) {
	brainDir := t.TempDir()
	for _, sub := range []string{"wiki/concepts", "wiki/entities", "wiki/sources"} {
		require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
	}

	// LLM returns same title twice (simulates multi-chunk duplicate)
	llmResponse := mustJSON([]RawPage{
		{Title: "Foo", Type: "concept", Content: "## Definition\n\nFirst.\n\n## Related Concepts\n\n[[Bar]]\n"},
		{Title: "Foo", Type: "concept", Content: "## Definition\n\nSecond.\n\n## Related Concepts\n\n[[Baz]]\n"},
	})

	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		_ = json.NewEncoder(w).Encode(map[string]any{
			"choices": []map[string]any{{"message": map[string]any{"content": llmResponse}}},
		})
	}))
	defer srv.Close()

	cfg := Config{Complete: llm.New(srv.URL, "", "m", 30*time.Second).Complete}
	result, err := Run(context.Background(), cfg, brainDir, "content", "foo", false)
	require.NoError(t, err)
	assert.Len(t, result.Pages, 1) // deduplicated

	content, err := os.ReadFile(filepath.Join(brainDir, "wiki", "concepts", "foo.md"))
	require.NoError(t, err)
	// keep-first for Definition, union for Related Concepts
	assert.Contains(t, string(content), "First.")
	// Bar and Baz unknown → left as plain [[links]]
	assert.Contains(t, string(content), "[[Bar]]")
	assert.Contains(t, string(content), "[[Baz]]")
}

func mustJSON(v any) string {
	b, err := json.Marshal(v)
	if err != nil {
		panic(err)
	}
	return string(b)
}
  • Step 2: Run tests to verify they fail (expected — pipeline.go still uses old ParsePages)
cd ingestion && go test ./internal/pipeline/ -run TestRun -v

Expected: compile error or test failures — that's fine.

  • Step 3: Update pipeline.go to wire new steps

Replace the entire content of ingestion/internal/pipeline/pipeline.go with:

// ingestion/internal/pipeline/pipeline.go
package pipeline

import (
	"context"
	"fmt"
	"os"
	"path/filepath"
	"strings"
	"time"

	"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)

// CompleteFunc is the function signature for LLM calls.
type CompleteFunc func(ctx context.Context, system, user string) (string, error)

// Config holds pipeline configuration.
type Config struct {
	Complete  CompleteFunc
	ChunkSize int    // 0 = no chunking
	Schema    string // overrides brain/schema.md when set (useful in tests)
}

// Result is the outcome of a pipeline run.
type Result struct {
	Pages    []string // relative paths written (or would-be written in dry-run)
	Warnings []string
}

// Run ingests content and writes structured wiki pages to brainDir/wiki/.
// In dry-run mode, pages are returned but not written to disk.
func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryRun bool) (Result, error) {
	inventory, err := wiki.LoadInventory(brainDir)
	if err != nil {
		return Result{}, fmt.Errorf("load inventory: %w", err)
	}

	schema := cfg.Schema
	if schema == "" {
		schema = loadSchema(brainDir)
	}

	sourceSlug := wiki.Slug(source)
	date := time.Now().UTC().Format("2006-01-02")
	chunks := Chunk(content, cfg.ChunkSize)

	var allRaw []RawPage
	var allWarnings []string

	for _, chunk := range chunks {
		userPrompt := BuildPrompt(schema, source, chunk, inventory)
		output, err := cfg.Complete(ctx, systemPrompt, userPrompt)
		if err != nil {
			return Result{}, fmt.Errorf("LLM call: %w", err)
		}
		raw, warnings := ParseRawPages(output)
		allRaw = append(allRaw, raw...)
		allWarnings = append(allWarnings, warnings...)
	}

	pages := BuildPages(allRaw, sourceSlug, date)
	resolved := Resolve(pages, inventory)
	canonicalized, linkWarnings := CanonicalizeLinks(resolved, inventory)
	allWarnings = append(allWarnings, linkWarnings...)
	withRefs := injectSourceRefs(canonicalized, inventory, brainDir)
	merged := mergeAll(withRefs)

	var written []string
	for _, page := range merged {
		if !dryRun {
			dest := filepath.Join(brainDir, filepath.FromSlash(page.Path))
			if err := os.MkdirAll(filepath.Dir(dest), 0o755); err != nil {
				return Result{}, fmt.Errorf("mkdir for %s: %w", page.Path, err)
			}
			if err := os.WriteFile(dest, []byte(page.Content), 0o644); err != nil {
				return Result{}, fmt.Errorf("write %s: %w", page.Path, err)
			}
		}
		written = append(written, page.Path)
	}

	if !dryRun {
		if err := wiki.RebuildIndex(brainDir, date); err != nil {
			allWarnings = append(allWarnings, fmt.Sprintf("rebuild index: %v", err))
		}
		if err := wiki.AppendLog(brainDir, source, written, allWarnings, date); err != nil {
			allWarnings = append(allWarnings, fmt.Sprintf("append log: %v", err))
		}
	}

	return Result{Pages: written, Warnings: allWarnings}, nil
}

// mergeAll deduplicates pages by path, merging content from later occurrences.
func mergeAll(pages []wiki.Page) []wiki.Page {
	order := make([]string, 0, len(pages))
	byPath := make(map[string]wiki.Page, len(pages))
	for _, p := range pages {
		if _, seen := byPath[p.Path]; !seen {
			order = append(order, p.Path)
			byPath[p.Path] = p
		} else {
			byPath[p.Path] = wiki.Merge(byPath[p.Path], p)
		}
	}
	result := make([]wiki.Page, 0, len(order))
	for _, path := range order {
		result = append(result, byPath[path])
	}
	return result
}

const defaultSchema = `# Brain Wiki Schema
Three page types: wiki/sources/, wiki/concepts/, wiki/entities/.
See brain/schema.md for the full schema.
`

func loadSchema(brainDir string) string {
	b, err := os.ReadFile(filepath.Join(brainDir, "schema.md"))
	if err != nil {
		return defaultSchema
	}
	return strings.TrimSpace(string(b))
}
  • Step 4: Run all pipeline tests
cd ingestion && go test ./internal/pipeline/ -v

Expected: all tests PASS. If TestRun_WritesPages has a warning about unknown wikilink [[Testing]] in warnings (since the Testing concept is in the same batch and should be resolved via CanonicalizeLinks), check that CanonicalizeLinks builds the batch map correctly before checking result.Warnings.

Note: TestRun_WritesPages now calls assert.Len(t, result.Pages, 2) without assert.Empty(t, result.Warnings) — unknown-link warnings from CanonicalizeLinks are acceptable since "Testing" is in the batch and should canonicalize cleanly. If the test fails with unexpected warnings, investigate buildTitleMap — it should find "Testing" from the current batch.

  • Step 5: Run full test suite
cd ingestion && go test ./... 2>&1

Expected: all packages pass.

  • Step 6: Commit
git add ingestion/internal/pipeline/pipeline.go ingestion/internal/pipeline/pipeline_test.go
git commit -m "feat(pipeline): wire ParseRawPages+BuildPages+CanonicalizeLinks into Run"

Task 5: Update prompt.go

Files:

  • Modify: ingestion/internal/pipeline/prompt.go

  • Step 1: Replace prompt.go

Replace the entire content of ingestion/internal/pipeline/prompt.go with:

// ingestion/internal/pipeline/prompt.go
package pipeline

import (
	"fmt"
	"strings"
	"time"

	"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)

const systemPrompt = `You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided.

Output ONLY a valid JSON array — no markdown fences, no other text before or after.
Each element must have exactly these fields:
  "title"   — exact page title (e.g. "FinBERT", "Ryan Singer", "Shape Up")
  "type"    — exactly one of: "source", "concept", "entity"
  "subtype" — for source: article|pdf|book|video|note|project; for entity: person|company|tool|model|framework|technology; omit for concept
  "domain"  — one of the domains in the schema (omit if none fits)
  "content" — Markdown body only — NO frontmatter, NO path, NO slug

Wikilinks in content: [[Display Name]] — just the display name, no slug, no pipe separator.
Only link to pages listed in the inventory or pages you are creating in this response.`

// BuildPrompt constructs the user prompt for a single chunk.
func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]wiki.Entry) string {
	var sb strings.Builder

	fmt.Fprintf(&sb, "Today's date is %s.\n\n", time.Now().UTC().Format("2006-01-02"))

	sb.WriteString("## Schema\n\n")
	sb.WriteString(schema)
	sb.WriteString("\n\n")

	sb.WriteString("## Existing wiki pages\n\n")
	sb.WriteString("Reference these pages by display name only — [[Display Name]] — in your content.\n\n")

	for _, pt := range []wiki.PageType{wiki.PageTypeConcept, wiki.PageTypeEntity, wiki.PageTypeSource} {
		entries := inventory[pt]
		label := strings.ToUpper(string(pt)[:1]) + string(pt)[1:]
		if len(entries) == 0 {
			fmt.Fprintf(&sb, "%s — (none yet)\n\n", label)
			continue
		}
		fmt.Fprintf(&sb, "%s:\n", label)
		for _, e := range entries {
			fmt.Fprintf(&sb, "  - %s\n", e.Title)
		}
		sb.WriteString("\n")
	}

	sb.WriteString("## Non-negotiable rules\n\n")
	sb.WriteString("1. Output ONLY a valid JSON array — no prose, no fences.\n")
	sb.WriteString("2. Fields: title, type, subtype (if applicable), domain (if applicable), content.\n")
	sb.WriteString("3. Wikilinks: [[Display Name]] — no slug, no pipe. The pipeline handles slugs.\n")
	sb.WriteString("4. Section links must match their section type (Related Concepts → concepts only, etc.).\n")
	sb.WriteString("5. One source page per book — if inventory shows it exists, return it as an UPDATE.\n\n")

	fmt.Fprintf(&sb, "## Source: %s\n\n", source)
	sb.WriteString(content)

	return sb.String()
}
  • Step 2: Verify compile and tests still pass
cd ingestion && go test ./... 2>&1

Expected: all tests PASS (prompt.go has no direct unit tests — covered via integration tests in pipeline_test.go).

  • Step 3: Commit
git add ingestion/internal/pipeline/prompt.go
git commit -m "feat(pipeline): update system prompt for new LLM JSON contract (no slugs)"

Task 6: Update brain/schema.md

Files:

  • Modify: brain/schema.md

  • Step 1: Update the schema doc

Replace the entire content of brain/schema.md with:

# Brain Wiki Schema

This document defines the three page types in the brain wiki.
The LLM must follow this schema exactly when generating wiki pages.

## Output Format

Return a JSON array. Each element:

```json
{
  "title":   "exact page title",
  "type":    "source | concept | entity",
  "subtype": "see below — omit for concept",
  "domain":  "see domains — omit if none fits",
  "content": "Markdown body only — no frontmatter, no path"
}
  • subtype for source: article | pdf | book | video | note | project
  • subtype for entity: person | company | tool | model | framework | technology
  • The pipeline computes slugs and frontmatter — never include them in output.

All cross-references use [[Display Name]] — just the display name, no slug, no pipe.

Rules:

  • Only link to pages in the inventory or pages you are creating in this response
  • The pipeline converts [[Display Name]] to [[slug|Display Name]] automatically
  • Section links must match their section type (Related Concepts → concept pages only, etc.)

Examples: [[Domain Driven Design]], [[Ryan Singer]], [[Shape Up]]

Domains

Use one of: ai-llm, software-engineering, product-strategy, finance-markets, personal, consulting, climate, infrastructure, security


Source Pages — wiki/sources/.md

One page per ingested source. Books are NEVER split across multiple source pages — update the existing one.

Body sections (in this order):

Summary

23 sentences. Core argument or finding.

Key Claims

Bulleted list. Paraphrase — no verbatim quotes or code.

Concepts Introduced or Reinforced

Wikilinks to concept pages ONLY. One per line.

Entities Mentioned

Wikilinks to entity pages ONLY. One per line.

Open Questions Raised

Gaps or follow-up questions from this source.

For books only, also add:

Chapters

One bullet per chapter with 12 sentence summary.

Argument Arc

Overall narrative as it becomes clear across chapters.

Updates

Dated entries appended on re-ingestion. NEVER rewrite — only append.


Concept Pages — wiki/concepts/.md

One page per idea, framework, methodology, or pattern.

Body sections (in this order):

Definition

One-paragraph plain-language explanation.

Why It Matters

Practical significance. Why should anyone care?

Wikilinks to concept pages ONLY.

Wikilinks to entity pages ONLY.

Sources

Wikilinks to source pages ONLY.

Evolving Notes

Updated as new sources arrive. Append, do not rewrite.


Entity Pages — wiki/entities/.md

One page per person, tool, organisation, technology, or product.

Body sections (in this order):

Description

One-line description.

Relevance

Why this entity matters to this knowledge base.

Key Positions, Products, or Claims

With dates where known.

Wikilinks to concept pages ONLY.

Wikilinks to entity pages ONLY.

Sources

Wikilinks to source pages ONLY.


Non-Negotiable Rules

  1. Output ONLY a valid JSON array — no markdown fences, no prose before or after
  2. Each element: {"title": "...", "type": "...", "subtype": "...", "domain": "...", "content": "..."}
  3. Never include slugs, paths, or frontmatter in output — the pipeline handles these
  4. Wikilinks: [[Display Name]] only — no pipe, no slug
  5. Dates always YYYY-MM-DD (used only in content body where contextually relevant)
  6. Never reproduce verbatim code — describe the pattern or technique
  7. Section links must match their section type
  8. One source page per book — if inventory shows it exists, include it as an UPDATE

- [ ] **Step 2: Verify Go tests still pass (schema.md is loaded at runtime, not compile time)**

```bash
cd ingestion && go test ./... 2>&1

Expected: all tests PASS.

  • Step 3: Commit
git add brain/schema.md
git commit -m "docs(schema): update LLM output format and wikilink convention for Level 3"

Self-Review

Spec coverage check:

Spec requirement Task
LLM returns {title, type, subtype, domain, content} Task 1 (RawPage), Task 5 (prompt)
Pipeline computes all slugs via wiki.Slug(title) Task 2 (BuildPages)
Source page uses sourceSlug from filename Task 2 (buildPage: case "source": slug = sourceSlug)
Frontmatter assembled by pipeline Task 2 (buildFrontmatter)
[[Display Name]] → `[[slug Display Name]]` canonicalization
CanonicalizeLinks runs after BuildPages, before injectSourceRefs Task 4 (pipeline.go step order)
Unknown titles left as-is + warnings Task 3 (links_test: UnknownTitleLeftAsIs)
Current-batch pages resolvable Task 3 (buildTitleMap includes batch)
injectSourceRefs unchanged — (no task needed)
Resolve unchanged — (no task needed)
brain/schema.md updated Task 6
Integration test for no slug duplication across chunks Task 4 (TestRun_MergesDuplicatePaths uses same title twice)

All spec requirements covered.

Placeholder scan: None found.

Type consistency:

  • RawPage defined in Task 1, used in Tasks 2, 4 ✓
  • BuildPages([]RawPage, string, string) []wiki.Page defined in Task 2, called in Task 4 ✓
  • CanonicalizeLinks([]wiki.Page, map[wiki.PageType][]wiki.Entry) ([]wiki.Page, []string) defined in Task 3, called in Task 4 ✓
  • ParseRawPages(string) ([]RawPage, []string) defined in Task 1, called in Task 4 ✓
  • extractTitle used in Task 3 (links.go) — defined in resolve.go (same package) ✓