41 KiB
Level 3: Strip Slug Authority from LLM — Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Remove slug/path/frontmatter generation from the LLM; pipeline derives all slugs deterministically from titles via wiki.Slug().
Architecture: Add RawPage type + ParseRawPages (LLM returns minimal JSON), BuildPages (computes slugs/paths/frontmatter), and CanonicalizeLinks (converts [[Display Name]] → [[slug|Display Name]]). Wire into pipeline.Run. Update system prompt and schema doc.
Tech Stack: Go 1.23, encoding/json, regexp, strings, testify
Working directory for all commands: ingestion/ (the Go module root — always cd ingestion before running go commands)
File Map
| File | Action | Responsibility |
|---|---|---|
ingestion/internal/pipeline/parse.go |
Modify | Replace ParsePages+wiki.Page deserialization with ParseRawPages+RawPage type |
ingestion/internal/pipeline/parse_test.go |
Modify | Replace old ParsePages tests with ParseRawPages tests |
ingestion/internal/pipeline/build.go |
Create | BuildPages — computes slug, path, frontmatter from RawPage |
ingestion/internal/pipeline/build_test.go |
Create | Tests for BuildPages |
ingestion/internal/pipeline/links.go |
Create | CanonicalizeLinks — converts [[Display Name]] → `[[slug |
ingestion/internal/pipeline/links_test.go |
Create | Tests for CanonicalizeLinks |
ingestion/internal/pipeline/pipeline.go |
Modify | Wire ParseRawPages → BuildPages → CanonicalizeLinks into Run; update tests |
ingestion/internal/pipeline/pipeline_test.go |
Modify | Update mock LLM responses to new JSON format |
ingestion/internal/pipeline/prompt.go |
Modify | New system prompt + updated BuildPrompt for new JSON contract |
brain/schema.md |
Modify | Update wikilink format and JSON output format |
Unchanged: resolve.go, refs.go, backfill.go, merge.go, chunk.go, resolve_test.go, refs_test.go, backfill_test.go
Task 1: RawPage type + ParseRawPages
Files:
-
Modify:
ingestion/internal/pipeline/parse.go -
Modify:
ingestion/internal/pipeline/parse_test.go -
Step 1: Write failing tests
Replace the entire content of ingestion/internal/pipeline/parse_test.go with:
// ingestion/internal/pipeline/parse_test.go
package pipeline
import (
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestParseRawPages_ValidJSON(t *testing.T) {
input := `[{"title":"Shape Up","type":"source","subtype":"book","domain":"product-strategy","content":"## Summary\n\nFoo."},{"title":"Betting","type":"concept","content":"## Definition\n\nA technique."}]`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 2)
assert.Empty(t, warnings)
assert.Equal(t, "Shape Up", pages[0].Title)
assert.Equal(t, "source", pages[0].Type)
assert.Equal(t, "book", pages[0].Subtype)
assert.Equal(t, "product-strategy", pages[0].Domain)
assert.Equal(t, "Betting", pages[1].Title)
assert.Equal(t, "concept", pages[1].Type)
assert.Empty(t, pages[1].Subtype)
}
func TestParseRawPages_StripsFences(t *testing.T) {
input := "```json\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"## Definition\\n\\nFoo.\"}]\n```"
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Empty(t, warnings)
assert.Equal(t, "Foo", pages[0].Title)
}
func TestParseRawPages_TruncationRecovery(t *testing.T) {
input := `[{"title":"Foo","type":"concept","content":"## Definition\n\nFoo."},{"title":"Bar","type":"concept","content":"trunc`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Equal(t, "Foo", pages[0].Title)
assert.NotEmpty(t, warnings)
}
func TestParseRawPages_EmptyInput(t *testing.T) {
pages, warnings := ParseRawPages("")
assert.Empty(t, pages)
assert.NotEmpty(t, warnings)
}
func TestParseRawPages_PlainFence(t *testing.T) {
input := "```\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"ok\"}]\n```"
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Empty(t, warnings)
}
func TestParseRawPages_MissingTitle(t *testing.T) {
// Missing title — still parsed, Title is empty string
input := `[{"type":"concept","content":"## Definition\n\nFoo."}]`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Empty(t, warnings)
assert.Empty(t, pages[0].Title)
}
- Step 2: Run tests to verify they fail
cd ingestion && go test ./internal/pipeline/ -run TestParseRawPages -v
Expected: FAIL — ParseRawPages undefined, RawPage undefined.
- Step 3: Replace
parse.gowith new implementation
Replace the entire content of ingestion/internal/pipeline/parse.go with:
// ingestion/internal/pipeline/parse.go
package pipeline
import (
"encoding/json"
"fmt"
"strings"
)
// RawPage is the LLM's output format — minimal structured data with no path or frontmatter.
// The pipeline derives slugs, paths, and frontmatter from these fields.
type RawPage struct {
Title string `json:"title"`
Type string `json:"type"` // "source" | "concept" | "entity"
Subtype string `json:"subtype"` // entity: person|company|tool|model|framework|technology; source: article|pdf|book|video|note|project
Domain string `json:"domain"`
Content string `json:"content"` // Markdown body only — no frontmatter
}
// ParseRawPages parses LLM output as a JSON array of RawPage objects.
// If the array is truncated mid-object (token limit), it salvages all complete objects.
func ParseRawPages(output string) ([]RawPage, []string) {
output = strings.TrimSpace(output)
if output == "" {
return nil, []string{"LLM returned empty output"}
}
output = stripFences(output)
var pages []RawPage
if err := json.Unmarshal([]byte(output), &pages); err == nil {
return pages, nil
}
// Truncation recovery: find last `}` that closes a complete object.
idx := strings.LastIndex(output, "}")
if idx < 0 {
return nil, []string{"LLM output contained no complete JSON objects"}
}
start := strings.Index(output, "[")
if start < 0 {
return nil, []string{"LLM output contained no JSON array opening bracket"}
}
candidate := output[start:idx+1] + "]"
if err := json.Unmarshal([]byte(candidate), &pages); err != nil {
return nil, []string{fmt.Sprintf("truncation recovery failed: %v", err)}
}
return pages, []string{fmt.Sprintf("LLM output was truncated; recovered %d page(s)", len(pages))}
}
func stripFences(s string) string {
for _, prefix := range []string{"```json\n", "```json\r\n", "```\n", "```\r\n"} {
if strings.HasPrefix(s, prefix) {
s = strings.TrimPrefix(s, prefix)
s = strings.TrimSuffix(strings.TrimSpace(s), "```")
return strings.TrimSpace(s)
}
}
return s
}
- Step 4: Run tests to verify they pass
cd ingestion && go test ./internal/pipeline/ -run TestParseRawPages -v
Expected: all 6 tests PASS.
- Step 5: Verify package still compiles (pipeline.go still references old ParsePages — that's expected for now)
cd ingestion && go build ./...
Expected: compile error mentioning ParsePages undefined — that's fine, we'll fix it in Task 4. If the error is something else, investigate.
Actually the old ParsePages no longer exists so pipeline.go will fail. That's OK — this task is intentionally breaking the pipeline step to set up for Task 4. If CI blocks on this, group Tasks 1–4 into a single commit.
- Step 6: Commit
cd ingestion && git add internal/pipeline/parse.go internal/pipeline/parse_test.go
git commit -m "feat(pipeline): replace ParsePages with ParseRawPages + RawPage type"
Task 2: BuildPages
Files:
-
Create:
ingestion/internal/pipeline/build.go -
Create:
ingestion/internal/pipeline/build_test.go -
Step 1: Write failing tests
Create ingestion/internal/pipeline/build_test.go:
// ingestion/internal/pipeline/build_test.go
package pipeline
import (
"strings"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestBuildPages_SourcePage(t *testing.T) {
raw := []RawPage{
{
Title: "Shape Up",
Type: "source",
Subtype: "book",
Domain: "product-strategy",
Content: "## Summary\n\nA book about shaping product work.\n",
},
}
pages := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
p := pages[0]
assert.Equal(t, "wiki/sources/shape-up.md", p.Path)
assert.Contains(t, p.Content, "title: Shape Up")
assert.Contains(t, p.Content, "type: book")
assert.Contains(t, p.Content, "domain: product-strategy")
assert.Contains(t, p.Content, "date_ingested: 2026-04-23")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - Shape Up")
assert.Contains(t, p.Content, "## Summary")
assert.True(t, strings.HasPrefix(p.Content, "---\n"), "content must start with frontmatter")
}
func TestBuildPages_ConceptPage(t *testing.T) {
raw := []RawPage{
{
Title: "Betting",
Type: "concept",
Domain: "product-strategy",
Content: "## Definition\n\nA resource allocation technique.\n",
},
}
pages := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
p := pages[0]
assert.Equal(t, "wiki/concepts/betting.md", p.Path)
assert.Contains(t, p.Content, "title: Betting")
assert.Contains(t, p.Content, "domain: product-strategy")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - Betting")
assert.NotContains(t, p.Content, "date_ingested")
assert.Contains(t, p.Content, "## Definition")
}
func TestBuildPages_EntityPage(t *testing.T) {
raw := []RawPage{
{
Title: "Ryan Singer",
Type: "entity",
Subtype: "person",
Domain: "product-strategy",
Content: "## Description\n\nA product designer.\n",
},
}
pages := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
p := pages[0]
assert.Equal(t, "wiki/entities/ryan-singer.md", p.Path)
assert.Contains(t, p.Content, "title: Ryan Singer")
assert.Contains(t, p.Content, "type: person")
assert.Contains(t, p.Content, "domain: product-strategy")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - Ryan Singer")
assert.NotContains(t, p.Content, "date_ingested")
}
func TestBuildPages_SourceSlugUsedForSourcePage(t *testing.T) {
// LLM title differs from filename — pipeline uses sourceSlug for the source page path.
raw := []RawPage{
{Title: "FinBERT: A Pretrained Model", Type: "source", Subtype: "article", Content: "## Summary\n\nA model.\n"},
}
pages := BuildPages(raw, "finbert-huggingface", "2026-04-23")
require.Len(t, pages, 1)
assert.Equal(t, "wiki/sources/finbert-huggingface.md", pages[0].Path)
// Concept/entity slugs derived from title, not sourceSlug
}
func TestBuildPages_ConceptSlugDerivedFromTitle(t *testing.T) {
raw := []RawPage{
{Title: "Domain-Driven Design", Type: "concept", Content: "## Definition\n\nFoo.\n"},
}
pages := BuildPages(raw, "some-source", "2026-04-23")
require.Len(t, pages, 1)
assert.Equal(t, "wiki/concepts/domain-driven-design.md", pages[0].Path)
}
func TestBuildPages_SourceDefaultSubtype(t *testing.T) {
// If subtype is omitted for a source, default to "article"
raw := []RawPage{
{Title: "Some Post", Type: "source", Content: "## Summary\n\nA post.\n"},
}
pages := BuildPages(raw, "some-post", "2026-04-23")
require.Len(t, pages, 1)
assert.Contains(t, pages[0].Content, "type: article")
}
func TestBuildPages_OmitsDomainWhenEmpty(t *testing.T) {
raw := []RawPage{
{Title: "Betting", Type: "concept", Content: "## Definition\n\nFoo.\n"},
}
pages := BuildPages(raw, "src", "2026-04-23")
require.Len(t, pages, 1)
assert.NotContains(t, pages[0].Content, "domain:")
}
func TestBuildPages_MultiplePages(t *testing.T) {
raw := []RawPage{
{Title: "Shape Up", Type: "source", Subtype: "book", Content: "## Summary\n\nA book.\n"},
{Title: "Betting", Type: "concept", Content: "## Definition\n\nA technique.\n"},
{Title: "Ryan Singer", Type: "entity", Subtype: "person", Content: "## Description\n\nA designer.\n"},
}
pages := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 3)
assert.Equal(t, "wiki/sources/shape-up.md", pages[0].Path)
assert.Equal(t, "wiki/concepts/betting.md", pages[1].Path)
assert.Equal(t, "wiki/entities/ryan-singer.md", pages[2].Path)
}
- Step 2: Run tests to verify they fail
cd ingestion && go test ./internal/pipeline/ -run TestBuildPages -v
Expected: FAIL — BuildPages undefined.
- Step 3: Create
build.go
Create ingestion/internal/pipeline/build.go:
// ingestion/internal/pipeline/build.go
package pipeline
import (
"fmt"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// BuildPages converts RawPages from the LLM into wiki.Pages with computed slugs,
// paths, and YAML frontmatter. sourceSlug is the slug of the source being ingested
// (derived from the filename, not the LLM title).
func BuildPages(rawPages []RawPage, sourceSlug, date string) []wiki.Page {
out := make([]wiki.Page, 0, len(rawPages))
for _, rp := range rawPages {
out = append(out, buildPage(rp, sourceSlug, date))
}
return out
}
func buildPage(rp RawPage, sourceSlug, date string) wiki.Page {
var slug, dir string
switch rp.Type {
case "source":
slug = sourceSlug
dir = "wiki/sources"
case "concept":
slug = wiki.Slug(rp.Title)
dir = "wiki/concepts"
case "entity":
slug = wiki.Slug(rp.Title)
dir = "wiki/entities"
default:
slug = wiki.Slug(rp.Title)
dir = "wiki/" + rp.Type
}
path := dir + "/" + slug + ".md"
fm := buildFrontmatter(rp, date)
return wiki.Page{
Path: path,
Content: fm + "\n" + rp.Content,
}
}
func buildFrontmatter(rp RawPage, date string) string {
var sb strings.Builder
sb.WriteString("---\n")
fmt.Fprintf(&sb, "title: %s\n", rp.Title)
switch rp.Type {
case "source":
subtype := rp.Subtype
if subtype == "" {
subtype = "article"
}
fmt.Fprintf(&sb, "type: %s\n", subtype)
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
}
fmt.Fprintf(&sb, "date_ingested: %s\n", date)
fmt.Fprintf(&sb, "last_updated: %s\n", date)
case "concept":
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
case "entity":
if rp.Subtype != "" {
fmt.Fprintf(&sb, "type: %s\n", rp.Subtype)
}
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
default:
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", rp.Domain)
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
}
fmt.Fprintf(&sb, "aliases:\n - %s\n", rp.Title)
sb.WriteString("---\n")
return sb.String()
}
- Step 4: Run tests to verify they pass
cd ingestion && go test ./internal/pipeline/ -run TestBuildPages -v
Expected: all 8 tests PASS.
- Step 5: Commit
git add ingestion/internal/pipeline/build.go ingestion/internal/pipeline/build_test.go
git commit -m "feat(pipeline): add BuildPages — compute slugs/paths/frontmatter from RawPage"
Task 3: CanonicalizeLinks
Files:
-
Create:
ingestion/internal/pipeline/links.go -
Create:
ingestion/internal/pipeline/links_test.go -
Step 1: Write failing tests
Create ingestion/internal/pipeline/links_test.go:
// ingestion/internal/pipeline/links_test.go
package pipeline
import (
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
func TestCanonicalizeLinks_KnownTitle(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[Betting]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.NotContains(t, got[0].Content, "[[Betting]]")
}
func TestCanonicalizeLinks_UnknownTitleLeftAsIs(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[Ghost Concept]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.NotEmpty(t, warnings)
assert.Contains(t, got[0].Content, "[[Ghost Concept]]")
}
func TestCanonicalizeLinks_AlreadyCanonicalLinkUntouched(t *testing.T) {
// Links already in [[slug|Display]] format must not be double-converted
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[betting|Betting]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
// Should remain exactly as-is — not double-wrapped
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.NotContains(t, got[0].Content, "[[betting|[[betting|Betting]]]]")
}
func TestCanonicalizeLinks_CaseInsensitiveMatch(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/foo.md",
Content: "---\ntitle: Foo\n---\n\n## Summary\n\nSee [[domain driven design]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "domain-driven-design", Title: "Domain Driven Design"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[domain-driven-design|domain driven design]]")
}
func TestCanonicalizeLinks_CurrentBatchPagesResolved(t *testing.T) {
// A concept created in the same batch should be canonicalizable
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: Shape Up\n---\n\n## Summary\n\nSee [[Betting]].\n",
},
{
Path: "wiki/concepts/betting.md",
Content: "---\ntitle: Betting\n---\n\n## Definition\n\nA technique.\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{} // empty — Betting is in the batch, not inventory
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 2)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
}
func TestCanonicalizeLinks_MultipleLinksInOnePage(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/foo.md",
Content: "---\ntitle: Foo\n---\n\n## Summary\n\nSee [[Betting]] and [[Shape Up]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
wiki.PageTypeSource: {
{Slug: "shape-up", Title: "Shape Up"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.Contains(t, got[0].Content, "[[shape-up|Shape Up]]")
}
- Step 2: Run tests to verify they fail
cd ingestion && go test ./internal/pipeline/ -run TestCanonicalizeLinks -v
Expected: FAIL — CanonicalizeLinks undefined.
- Step 3: Create
links.go
Create ingestion/internal/pipeline/links.go:
// ingestion/internal/pipeline/links.go
package pipeline
import (
"fmt"
"path/filepath"
"regexp"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// plainLinkRE matches [[Display Name]] — wikilinks without a slug pipe.
// It does NOT match [[slug|Display]] (those already have a pipe).
var plainLinkRE = regexp.MustCompile(`\[\[([^\]|]+)\]\]`)
// CanonicalizeLinks converts [[Display Name]] wikilinks to [[slug|Display Name]]
// using a title→slug map built from the inventory and current batch.
// Unknown titles are left as-is and returned as warnings.
func CanonicalizeLinks(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) ([]wiki.Page, []string) {
titleToSlug := buildTitleMap(pages, inventory)
var allWarnings []string
out := make([]wiki.Page, len(pages))
for i, p := range pages {
newContent, warnings := canonicalizeContent(p.Content, titleToSlug)
p.Content = newContent
out[i] = p
allWarnings = append(allWarnings, warnings...)
}
return out, allWarnings
}
// buildTitleMap builds a lowercase-title → slug map from inventory and current batch.
// Current batch entries take precedence over inventory (they may be updates).
func buildTitleMap(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) map[string]string {
m := make(map[string]string)
for _, entries := range inventory {
for _, e := range entries {
m[strings.ToLower(e.Title)] = e.Slug
}
}
// Current batch overrides inventory
for _, p := range pages {
title := extractTitle(p.Content)
slug := strings.TrimSuffix(filepath.Base(p.Path), ".md")
if title != "" && slug != "" {
m[strings.ToLower(title)] = slug
}
}
return m
}
func canonicalizeContent(content string, titleToSlug map[string]string) (string, []string) {
var warnings []string
result := plainLinkRE.ReplaceAllStringFunc(content, func(match string) string {
sub := plainLinkRE.FindStringSubmatch(match)
if len(sub) < 2 {
return match
}
displayName := sub[1]
slug, ok := titleToSlug[strings.ToLower(displayName)]
if !ok {
warnings = append(warnings, fmt.Sprintf("unknown wikilink: [[%s]]", displayName))
return match
}
return "[[" + slug + "|" + displayName + "]]"
})
return result, warnings
}
- Step 4: Run tests to verify they pass
cd ingestion && go test ./internal/pipeline/ -run TestCanonicalizeLinks -v
Expected: all 6 tests PASS.
- Step 5: Commit
git add ingestion/internal/pipeline/links.go ingestion/internal/pipeline/links_test.go
git commit -m "feat(pipeline): add CanonicalizeLinks — convert [[Display Name]] to [[slug|Display Name]]"
Task 4: Wire pipeline.go + update pipeline_test.go
Files:
-
Modify:
ingestion/internal/pipeline/pipeline.go -
Modify:
ingestion/internal/pipeline/pipeline_test.go -
Step 1: Update
pipeline_test.goto use new LLM response format
Replace the entire content of ingestion/internal/pipeline/pipeline_test.go with:
// ingestion/internal/pipeline/pipeline_test.go
package pipeline
import (
"context"
"encoding/json"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"testing"
"time"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/llm"
)
func TestRun_WritesPages(t *testing.T) {
brainDir := t.TempDir()
for _, sub := range []string{"wiki/concepts", "wiki/entities", "wiki/sources"} {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
}
llmResponse := mustJSON([]RawPage{
{
Title: "Test Article",
Type: "source",
Subtype: "article",
Domain: "software-engineering",
Content: "## Summary\n\nA test article.\n\n## Key Claims\n\n- It tests things.\n\n## Concepts Introduced or Reinforced\n\n[[Testing]]\n\n## Entities Mentioned\n\n## Open Questions Raised\n",
},
{
Title: "Testing",
Type: "concept",
Domain: "software-engineering",
Content: "## Definition\n\nThe practice of verifying software.\n\n## Why It Matters\n\nCatches bugs.\n\n## Related Concepts\n\n## Related Entities\n\n## Sources\n\n## Evolving Notes\n",
},
})
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]any{
"choices": []map[string]any{
{"message": map[string]any{"role": "assistant", "content": llmResponse}},
},
})
}))
defer srv.Close()
cfg := Config{
Complete: llm.New(srv.URL, "", "test-model", 30*time.Second).Complete,
ChunkSize: 0,
}
result, err := Run(context.Background(), cfg, brainDir, "An article about testing.", "test-article", false)
require.NoError(t, err)
assert.Len(t, result.Pages, 2)
_, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "test-article.md"))
require.NoError(t, err)
_, err = os.Stat(filepath.Join(brainDir, "wiki", "concepts", "testing.md"))
require.NoError(t, err)
_, err = os.Stat(filepath.Join(brainDir, "wiki", "index.md"))
require.NoError(t, err)
_, err = os.Stat(filepath.Join(brainDir, "log.md"))
require.NoError(t, err)
}
func TestRun_DryRunDoesNotWrite(t *testing.T) {
brainDir := t.TempDir()
for _, sub := range []string{"wiki/concepts", "wiki/entities", "wiki/sources"} {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
}
llmResponse := mustJSON([]RawPage{{
Title: "Foo",
Type: "source",
Subtype: "article",
Content: "## Summary\n\nFoo.\n",
}})
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
_ = json.NewEncoder(w).Encode(map[string]any{
"choices": []map[string]any{{"message": map[string]any{"content": llmResponse}}},
})
}))
defer srv.Close()
cfg := Config{Complete: llm.New(srv.URL, "", "m", 30*time.Second).Complete}
result, err := Run(context.Background(), cfg, brainDir, "foo content", "foo", true)
require.NoError(t, err)
assert.Len(t, result.Pages, 1)
_, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "foo.md"))
assert.True(t, os.IsNotExist(err))
}
func TestRun_MergesDuplicatePaths(t *testing.T) {
brainDir := t.TempDir()
for _, sub := range []string{"wiki/concepts", "wiki/entities", "wiki/sources"} {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
}
// LLM returns same title twice (simulates multi-chunk duplicate)
llmResponse := mustJSON([]RawPage{
{Title: "Foo", Type: "concept", Content: "## Definition\n\nFirst.\n\n## Related Concepts\n\n[[Bar]]\n"},
{Title: "Foo", Type: "concept", Content: "## Definition\n\nSecond.\n\n## Related Concepts\n\n[[Baz]]\n"},
})
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
_ = json.NewEncoder(w).Encode(map[string]any{
"choices": []map[string]any{{"message": map[string]any{"content": llmResponse}}},
})
}))
defer srv.Close()
cfg := Config{Complete: llm.New(srv.URL, "", "m", 30*time.Second).Complete}
result, err := Run(context.Background(), cfg, brainDir, "content", "foo", false)
require.NoError(t, err)
assert.Len(t, result.Pages, 1) // deduplicated
content, err := os.ReadFile(filepath.Join(brainDir, "wiki", "concepts", "foo.md"))
require.NoError(t, err)
// keep-first for Definition, union for Related Concepts
assert.Contains(t, string(content), "First.")
// Bar and Baz unknown → left as plain [[links]]
assert.Contains(t, string(content), "[[Bar]]")
assert.Contains(t, string(content), "[[Baz]]")
}
func mustJSON(v any) string {
b, err := json.Marshal(v)
if err != nil {
panic(err)
}
return string(b)
}
- Step 2: Run tests to verify they fail (expected — pipeline.go still uses old ParsePages)
cd ingestion && go test ./internal/pipeline/ -run TestRun -v
Expected: compile error or test failures — that's fine.
- Step 3: Update
pipeline.goto wire new steps
Replace the entire content of ingestion/internal/pipeline/pipeline.go with:
// ingestion/internal/pipeline/pipeline.go
package pipeline
import (
"context"
"fmt"
"os"
"path/filepath"
"strings"
"time"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// CompleteFunc is the function signature for LLM calls.
type CompleteFunc func(ctx context.Context, system, user string) (string, error)
// Config holds pipeline configuration.
type Config struct {
Complete CompleteFunc
ChunkSize int // 0 = no chunking
Schema string // overrides brain/schema.md when set (useful in tests)
}
// Result is the outcome of a pipeline run.
type Result struct {
Pages []string // relative paths written (or would-be written in dry-run)
Warnings []string
}
// Run ingests content and writes structured wiki pages to brainDir/wiki/.
// In dry-run mode, pages are returned but not written to disk.
func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryRun bool) (Result, error) {
inventory, err := wiki.LoadInventory(brainDir)
if err != nil {
return Result{}, fmt.Errorf("load inventory: %w", err)
}
schema := cfg.Schema
if schema == "" {
schema = loadSchema(brainDir)
}
sourceSlug := wiki.Slug(source)
date := time.Now().UTC().Format("2006-01-02")
chunks := Chunk(content, cfg.ChunkSize)
var allRaw []RawPage
var allWarnings []string
for _, chunk := range chunks {
userPrompt := BuildPrompt(schema, source, chunk, inventory)
output, err := cfg.Complete(ctx, systemPrompt, userPrompt)
if err != nil {
return Result{}, fmt.Errorf("LLM call: %w", err)
}
raw, warnings := ParseRawPages(output)
allRaw = append(allRaw, raw...)
allWarnings = append(allWarnings, warnings...)
}
pages := BuildPages(allRaw, sourceSlug, date)
resolved := Resolve(pages, inventory)
canonicalized, linkWarnings := CanonicalizeLinks(resolved, inventory)
allWarnings = append(allWarnings, linkWarnings...)
withRefs := injectSourceRefs(canonicalized, inventory, brainDir)
merged := mergeAll(withRefs)
var written []string
for _, page := range merged {
if !dryRun {
dest := filepath.Join(brainDir, filepath.FromSlash(page.Path))
if err := os.MkdirAll(filepath.Dir(dest), 0o755); err != nil {
return Result{}, fmt.Errorf("mkdir for %s: %w", page.Path, err)
}
if err := os.WriteFile(dest, []byte(page.Content), 0o644); err != nil {
return Result{}, fmt.Errorf("write %s: %w", page.Path, err)
}
}
written = append(written, page.Path)
}
if !dryRun {
if err := wiki.RebuildIndex(brainDir, date); err != nil {
allWarnings = append(allWarnings, fmt.Sprintf("rebuild index: %v", err))
}
if err := wiki.AppendLog(brainDir, source, written, allWarnings, date); err != nil {
allWarnings = append(allWarnings, fmt.Sprintf("append log: %v", err))
}
}
return Result{Pages: written, Warnings: allWarnings}, nil
}
// mergeAll deduplicates pages by path, merging content from later occurrences.
func mergeAll(pages []wiki.Page) []wiki.Page {
order := make([]string, 0, len(pages))
byPath := make(map[string]wiki.Page, len(pages))
for _, p := range pages {
if _, seen := byPath[p.Path]; !seen {
order = append(order, p.Path)
byPath[p.Path] = p
} else {
byPath[p.Path] = wiki.Merge(byPath[p.Path], p)
}
}
result := make([]wiki.Page, 0, len(order))
for _, path := range order {
result = append(result, byPath[path])
}
return result
}
const defaultSchema = `# Brain Wiki Schema
Three page types: wiki/sources/, wiki/concepts/, wiki/entities/.
See brain/schema.md for the full schema.
`
func loadSchema(brainDir string) string {
b, err := os.ReadFile(filepath.Join(brainDir, "schema.md"))
if err != nil {
return defaultSchema
}
return strings.TrimSpace(string(b))
}
- Step 4: Run all pipeline tests
cd ingestion && go test ./internal/pipeline/ -v
Expected: all tests PASS. If TestRun_WritesPages has a warning about unknown wikilink [[Testing]] in warnings (since the Testing concept is in the same batch and should be resolved via CanonicalizeLinks), check that CanonicalizeLinks builds the batch map correctly before checking result.Warnings.
Note: TestRun_WritesPages now calls assert.Len(t, result.Pages, 2) without assert.Empty(t, result.Warnings) — unknown-link warnings from CanonicalizeLinks are acceptable since "Testing" is in the batch and should canonicalize cleanly. If the test fails with unexpected warnings, investigate buildTitleMap — it should find "Testing" from the current batch.
- Step 5: Run full test suite
cd ingestion && go test ./... 2>&1
Expected: all packages pass.
- Step 6: Commit
git add ingestion/internal/pipeline/pipeline.go ingestion/internal/pipeline/pipeline_test.go
git commit -m "feat(pipeline): wire ParseRawPages+BuildPages+CanonicalizeLinks into Run"
Task 5: Update prompt.go
Files:
-
Modify:
ingestion/internal/pipeline/prompt.go -
Step 1: Replace
prompt.go
Replace the entire content of ingestion/internal/pipeline/prompt.go with:
// ingestion/internal/pipeline/prompt.go
package pipeline
import (
"fmt"
"strings"
"time"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
const systemPrompt = `You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided.
Output ONLY a valid JSON array — no markdown fences, no other text before or after.
Each element must have exactly these fields:
"title" — exact page title (e.g. "FinBERT", "Ryan Singer", "Shape Up")
"type" — exactly one of: "source", "concept", "entity"
"subtype" — for source: article|pdf|book|video|note|project; for entity: person|company|tool|model|framework|technology; omit for concept
"domain" — one of the domains in the schema (omit if none fits)
"content" — Markdown body only — NO frontmatter, NO path, NO slug
Wikilinks in content: [[Display Name]] — just the display name, no slug, no pipe separator.
Only link to pages listed in the inventory or pages you are creating in this response.`
// BuildPrompt constructs the user prompt for a single chunk.
func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]wiki.Entry) string {
var sb strings.Builder
fmt.Fprintf(&sb, "Today's date is %s.\n\n", time.Now().UTC().Format("2006-01-02"))
sb.WriteString("## Schema\n\n")
sb.WriteString(schema)
sb.WriteString("\n\n")
sb.WriteString("## Existing wiki pages\n\n")
sb.WriteString("Reference these pages by display name only — [[Display Name]] — in your content.\n\n")
for _, pt := range []wiki.PageType{wiki.PageTypeConcept, wiki.PageTypeEntity, wiki.PageTypeSource} {
entries := inventory[pt]
label := strings.ToUpper(string(pt)[:1]) + string(pt)[1:]
if len(entries) == 0 {
fmt.Fprintf(&sb, "%s — (none yet)\n\n", label)
continue
}
fmt.Fprintf(&sb, "%s:\n", label)
for _, e := range entries {
fmt.Fprintf(&sb, " - %s\n", e.Title)
}
sb.WriteString("\n")
}
sb.WriteString("## Non-negotiable rules\n\n")
sb.WriteString("1. Output ONLY a valid JSON array — no prose, no fences.\n")
sb.WriteString("2. Fields: title, type, subtype (if applicable), domain (if applicable), content.\n")
sb.WriteString("3. Wikilinks: [[Display Name]] — no slug, no pipe. The pipeline handles slugs.\n")
sb.WriteString("4. Section links must match their section type (Related Concepts → concepts only, etc.).\n")
sb.WriteString("5. One source page per book — if inventory shows it exists, return it as an UPDATE.\n\n")
fmt.Fprintf(&sb, "## Source: %s\n\n", source)
sb.WriteString(content)
return sb.String()
}
- Step 2: Verify compile and tests still pass
cd ingestion && go test ./... 2>&1
Expected: all tests PASS (prompt.go has no direct unit tests — covered via integration tests in pipeline_test.go).
- Step 3: Commit
git add ingestion/internal/pipeline/prompt.go
git commit -m "feat(pipeline): update system prompt for new LLM JSON contract (no slugs)"
Task 6: Update brain/schema.md
Files:
-
Modify:
brain/schema.md -
Step 1: Update the schema doc
Replace the entire content of brain/schema.md with:
# Brain Wiki Schema
This document defines the three page types in the brain wiki.
The LLM must follow this schema exactly when generating wiki pages.
## Output Format
Return a JSON array. Each element:
```json
{
"title": "exact page title",
"type": "source | concept | entity",
"subtype": "see below — omit for concept",
"domain": "see domains — omit if none fits",
"content": "Markdown body only — no frontmatter, no path"
}
subtypefor source:article | pdf | book | video | note | projectsubtypefor entity:person | company | tool | model | framework | technology- The pipeline computes slugs and frontmatter — never include them in output.
Wikilink Format
All cross-references use [[Display Name]] — just the display name, no slug, no pipe.
Rules:
- Only link to pages in the inventory or pages you are creating in this response
- The pipeline converts
[[Display Name]]to[[slug|Display Name]]automatically - Section links must match their section type (Related Concepts → concept pages only, etc.)
Examples: [[Domain Driven Design]], [[Ryan Singer]], [[Shape Up]]
Domains
Use one of: ai-llm, software-engineering, product-strategy, finance-markets,
personal, consulting, climate, infrastructure, security
Source Pages — wiki/sources/.md
One page per ingested source. Books are NEVER split across multiple source pages — update the existing one.
Body sections (in this order):
Summary
2–3 sentences. Core argument or finding.
Key Claims
Bulleted list. Paraphrase — no verbatim quotes or code.
Concepts Introduced or Reinforced
Wikilinks to concept pages ONLY. One per line.
Entities Mentioned
Wikilinks to entity pages ONLY. One per line.
Open Questions Raised
Gaps or follow-up questions from this source.
For books only, also add:
Chapters
One bullet per chapter with 1–2 sentence summary.
Argument Arc
Overall narrative as it becomes clear across chapters.
Updates
Dated entries appended on re-ingestion. NEVER rewrite — only append.
Concept Pages — wiki/concepts/.md
One page per idea, framework, methodology, or pattern.
Body sections (in this order):
Definition
One-paragraph plain-language explanation.
Why It Matters
Practical significance. Why should anyone care?
Related Concepts
Wikilinks to concept pages ONLY.
Related Entities
Wikilinks to entity pages ONLY.
Sources
Wikilinks to source pages ONLY.
Evolving Notes
Updated as new sources arrive. Append, do not rewrite.
Entity Pages — wiki/entities/.md
One page per person, tool, organisation, technology, or product.
Body sections (in this order):
Description
One-line description.
Relevance
Why this entity matters to this knowledge base.
Key Positions, Products, or Claims
With dates where known.
Related Concepts
Wikilinks to concept pages ONLY.
Related Entities
Wikilinks to entity pages ONLY.
Sources
Wikilinks to source pages ONLY.
Non-Negotiable Rules
- Output ONLY a valid JSON array — no markdown fences, no prose before or after
- Each element:
{"title": "...", "type": "...", "subtype": "...", "domain": "...", "content": "..."} - Never include slugs, paths, or frontmatter in output — the pipeline handles these
- Wikilinks:
[[Display Name]]only — no pipe, no slug - Dates always YYYY-MM-DD (used only in content body where contextually relevant)
- Never reproduce verbatim code — describe the pattern or technique
- Section links must match their section type
- One source page per book — if inventory shows it exists, include it as an UPDATE
- [ ] **Step 2: Verify Go tests still pass (schema.md is loaded at runtime, not compile time)**
```bash
cd ingestion && go test ./... 2>&1
Expected: all tests PASS.
- Step 3: Commit
git add brain/schema.md
git commit -m "docs(schema): update LLM output format and wikilink convention for Level 3"
Self-Review
Spec coverage check:
| Spec requirement | Task |
|---|---|
LLM returns {title, type, subtype, domain, content} |
Task 1 (RawPage), Task 5 (prompt) |
Pipeline computes all slugs via wiki.Slug(title) |
Task 2 (BuildPages) |
Source page uses sourceSlug from filename |
Task 2 (buildPage: case "source": slug = sourceSlug) |
| Frontmatter assembled by pipeline | Task 2 (buildFrontmatter) |
[[Display Name]] → `[[slug |
Display Name]]` canonicalization |
| CanonicalizeLinks runs after BuildPages, before injectSourceRefs | Task 4 (pipeline.go step order) |
| Unknown titles left as-is + warnings | Task 3 (links_test: UnknownTitleLeftAsIs) |
| Current-batch pages resolvable | Task 3 (buildTitleMap includes batch) |
injectSourceRefs unchanged |
— (no task needed) |
Resolve unchanged |
— (no task needed) |
brain/schema.md updated |
Task 6 |
| Integration test for no slug duplication across chunks | Task 4 (TestRun_MergesDuplicatePaths uses same title twice) |
All spec requirements covered.
Placeholder scan: None found.
Type consistency:
RawPagedefined in Task 1, used in Tasks 2, 4 ✓BuildPages([]RawPage, string, string) []wiki.Pagedefined in Task 2, called in Task 4 ✓CanonicalizeLinks([]wiki.Page, map[wiki.PageType][]wiki.Entry) ([]wiki.Page, []string)defined in Task 3, called in Task 4 ✓ParseRawPages(string) ([]RawPage, []string)defined in Task 1, called in Task 4 ✓extractTitleused in Task 3 (links.go) — defined inresolve.go(same package) ✓