10 Commits

Author SHA1 Message Date
Mathias Bergqvist
923a665365 fix(pipeline): skip RawPages with empty title in BuildPages instead of producing broken paths
All checks were successful
CI / Lint / Test / Vet (push) Successful in 9s
CI / Mirror to GitHub (push) Has been skipped
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:55:37 +02:00
Mathias Bergqvist
537aebc302 feat(pipeline): update system prompt for new LLM JSON contract (no slugs)
- Change prompt to reflect new output format: title, type, subtype, domain, content
- Remove slug/path generation responsibility from LLM — pipeline now handles it
- Wikilinks change from [[slug|Display Name]] to [[Display Name]] only
- LLM no longer includes frontmatter or paths in output

docs(schema): update LLM output format and wikilink convention for Level 3

- Specify JSON schema: title, type, subtype, domain, content fields
- Remove frontmatter requirements from schema output (handled by pipeline)
- Simplify wikilink format to [[Display Name]] — no slug or pipe
- Pipeline now responsible for slug generation and frontmatter construction

These changes shift slug/frontmatter generation from LLM to pipeline,
reducing cognitive load on the model and improving control over output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:45:21 +02:00
Mathias Bergqvist
de35d4dbb0 feat(pipeline): wire ParseRawPages+BuildPages+CanonicalizeLinks into Run
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 19:07:33 +02:00
Mathias Bergqvist
26855f69b0 feat(pipeline): add CanonicalizeLinks — convert [[Display Name]] to [[slug|Display Name]] 2026-04-23 18:59:10 +02:00
Mathias Bergqvist
a7b363d589 fix(pipeline): quote YAML scalar fields in buildFrontmatter to prevent injection 2026-04-23 18:56:39 +02:00
Mathias Bergqvist
7b57051af8 feat(pipeline): add BuildPages — compute slugs/paths/frontmatter from RawPage 2026-04-23 18:50:37 +02:00
Mathias Bergqvist
a620f6cb01 fix(pipeline): guard empty-title bridge + skip stale integration tests until task4
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 18:46:07 +02:00
Mathias Bergqvist
26b5636b43 feat(pipeline): replace ParsePages with ParseRawPages + RawPage type
Strips slug authority from the LLM. The new RawPage type carries only
{title, type, subtype, domain, content} — no paths or frontmatter.
Pipeline will derive slugs deterministically (Task 4).

pipeline.go gets a temporary bridge stub (TODO task4) to keep the
package compiling between tasks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 18:41:33 +02:00
Mathias Bergqvist
989f375aec docs: add Level 3 implementation plan 2026-04-23 17:37:45 +02:00
Mathias Bergqvist
6403d5e444 docs: add Level 3 slug authority design spec 2026-04-23 17:23:22 +02:00
14 changed files with 2092 additions and 130 deletions

View File

@@ -3,21 +3,34 @@
This document defines the three page types in the brain wiki. This document defines the three page types in the brain wiki.
The LLM must follow this schema exactly when generating wiki pages. The LLM must follow this schema exactly when generating wiki pages.
## Output Format
Return a JSON array. Each element:
```json
{
"title": "exact page title",
"type": "source | concept | entity",
"subtype": "see below — omit for concept",
"domain": "see domains — omit if none fits",
"content": "Markdown body only — no frontmatter, no path"
}
```
- `subtype` for **source**: `article | pdf | book | video | note | project`
- `subtype` for **entity**: `person | company | tool | model | framework | technology`
- The pipeline computes slugs and frontmatter — never include them in output.
## Wikilink Format ## Wikilink Format
All cross-references use `[[slug|Display Text]]`. All cross-references use `[[Display Name]]` — just the display name, no slug, no pipe.
Rules: Rules:
- slug = lowercase filename without .md, spaces → hyphens, strip all non-alphanumeric except hyphens - Only link to pages in the inventory or pages you are creating in this response
- The `|` separator is REQUIRED — never use `[[Title]]` without a slug - The pipeline converts `[[Display Name]]` to `[[slug|Display Name]]` automatically
- Examples: `[[domain-driven-design|Domain Driven Design]]`, `[[ryan-singer|Ryan Singer]]` - Section links must match their section type (Related Concepts → concept pages only, etc.)
- Slugs must resolve to an existing file in the inventory, or a file you are creating in this response
Slug generation examples: Examples: `[[Domain Driven Design]]`, `[[Ryan Singer]]`, `[[Shape Up]]`
- "Domain Driven Design" → `domain-driven-design`
- "It's Complicated" → `its-complicated`
- "gRPC" → `grpc`
- "GPT-4o" → `gpt-4o`
## Domains ## Domains
@@ -30,17 +43,6 @@ Use one of: `ai-llm`, `software-engineering`, `product-strategy`, `finance-marke
One page per ingested source. Books are NEVER split across multiple source pages — update the existing one. One page per ingested source. Books are NEVER split across multiple source pages — update the existing one.
Required frontmatter:
```yaml
title: <exact title>
type: article | pdf | book | video | note | project
domain: <domain>
date_ingested: YYYY-MM-DD
last_updated: YYYY-MM-DD
aliases:
- <exact title>
```
Body sections (in this order): Body sections (in this order):
### Summary ### Summary
@@ -50,10 +52,10 @@ Body sections (in this order):
Bulleted list. Paraphrase — no verbatim quotes or code. Bulleted list. Paraphrase — no verbatim quotes or code.
### Concepts Introduced or Reinforced ### Concepts Introduced or Reinforced
Wikilinks to wiki/concepts/ ONLY. One per line. Wikilinks to concept pages ONLY. One per line.
### Entities Mentioned ### Entities Mentioned
Wikilinks to wiki/entities/ ONLY. One per line. Wikilinks to entity pages ONLY. One per line.
### Open Questions Raised ### Open Questions Raised
Gaps or follow-up questions from this source. Gaps or follow-up questions from this source.
@@ -75,15 +77,6 @@ Dated entries appended on re-ingestion. NEVER rewrite — only append.
One page per idea, framework, methodology, or pattern. One page per idea, framework, methodology, or pattern.
Required frontmatter:
```yaml
title: <concept name>
domain: <domain>
last_updated: YYYY-MM-DD
aliases:
- <exact title>
```
Body sections (in this order): Body sections (in this order):
### Definition ### Definition
@@ -93,13 +86,13 @@ One-paragraph plain-language explanation.
Practical significance. Why should anyone care? Practical significance. Why should anyone care?
### Related Concepts ### Related Concepts
Wikilinks to wiki/concepts/ ONLY. Wikilinks to concept pages ONLY.
### Related Entities ### Related Entities
Wikilinks to wiki/entities/ ONLY. Wikilinks to entity pages ONLY.
### Sources ### Sources
Wikilinks to wiki/sources/ ONLY. Wikilinks to source pages ONLY.
### Evolving Notes ### Evolving Notes
Updated as new sources arrive. Append, do not rewrite. Updated as new sources arrive. Append, do not rewrite.
@@ -110,16 +103,6 @@ Updated as new sources arrive. Append, do not rewrite.
One page per person, tool, organisation, technology, or product. One page per person, tool, organisation, technology, or product.
Required frontmatter:
```yaml
title: <name>
type: person | company | tool | model | framework | technology
domain: <domain>
last_updated: YYYY-MM-DD
aliases:
- <exact title>
```
Body sections (in this order): Body sections (in this order):
### Description ### Description
@@ -132,23 +115,23 @@ Why this entity matters to this knowledge base.
With dates where known. With dates where known.
### Related Concepts ### Related Concepts
Wikilinks to wiki/concepts/ ONLY. Wikilinks to concept pages ONLY.
### Related Entities ### Related Entities
Wikilinks to wiki/entities/ ONLY. Wikilinks to entity pages ONLY.
### Sources ### Sources
Wikilinks to wiki/sources/ ONLY. Wikilinks to source pages ONLY.
--- ---
## Non-Negotiable Rules ## Non-Negotiable Rules
1. Output ONLY a valid JSON array — no markdown fences, no prose before or after 1. Output ONLY a valid JSON array — no markdown fences, no prose before or after
2. Each element: `{"path": "wiki/<type>/<slug>.md", "content": "...full markdown..."}` 2. Each element: `{"title": "...", "type": "...", "subtype": "...", "domain": "...", "content": "..."}`
3. Slugs are kebab-case: lowercase, spaces→hyphens, strip special characters 3. Never include slugs, paths, or frontmatter in output — the pipeline handles these
4. Every wikilink must be `[[slug|Display Text]]` — the pipe separator is required 4. Wikilinks: `[[Display Name]]` only — no pipe, no slug
5. Dates always YYYY-MM-DD 5. Dates always YYYY-MM-DD (used only in content body where contextually relevant)
6. Never reproduce verbatim code — describe the pattern or technique 6. Never reproduce verbatim code — describe the pattern or technique
7. Section links must match their section type (Related Concepts → concepts/ only, etc.) 7. Section links must match their section type
8. One source page per book — if inventory shows it exists, include it as an UPDATE 8. One source page per book — if inventory shows it exists, include it as an UPDATE

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,148 @@
# Level 3: Strip Slug Authority from LLM — Design Spec
## Problem
The ingestion pipeline currently asks the LLM to produce full wiki pages including the file path (e.g. `wiki/sources/finbert-huggingface.md`). This causes two classes of bug:
1. **Slug proliferation** — the LLM invents different slugs for the same concept across chunks or runs, producing duplicate pages that diverge in content.
2. **Unstable paths** — the LLM may shorten, expand, or vary titles, making deduplication via `Resolve` unreliable because the slug mismatch is upstream of the normalizer.
## Solution
Strip slug authority from the LLM entirely. The LLM returns a minimal structured object. The pipeline computes all slugs deterministically from titles using `wiki.Slug(title)`.
---
## LLM JSON Contract
### Output format (per page)
```json
{
"title": "FinBERT",
"type": "concept",
"subtype": "framework",
"domain": "ai-llm",
"content": "## Definition\n\nA BERT-based model fine-tuned for financial sentiment...\n\n## Related\n\n- [[Sentiment Analysis]]\n- [[Hugging Face]]\n"
}
```
**Fields:**
| Field | Required | Values |
|-------|----------|--------|
| `title` | yes | Human-readable title, e.g. "FinBERT" |
| `type` | yes | `"source"` \| `"concept"` \| `"entity"` |
| `subtype` | for entity/source | entity: `person\|company\|tool\|model\|framework\|technology`; source: `article\|pdf\|book\|video\|note\|project` |
| `domain` | no | tag string, e.g. `ai-llm`, `finance` |
| `content` | yes | Markdown body sections only — no frontmatter, no path |
**Wikilinks in content:** `[[Display Name]]` only. No slug. The pipeline canonicalizes to `[[slug|Display Name]]` in a post-processing step.
**The LLM never writes slugs, paths, or frontmatter.**
---
## Pipeline Changes
### New type: `RawPage`
```go
type RawPage struct {
Title string
Type string // "source" | "concept" | "entity"
Subtype string
Domain string
Content string
}
```
### New step order
```
ParseRawPages → BuildPages → Resolve → CanonicalizeLinks → injectSourceRefs → mergeAll → write
```
### Step descriptions
**`ParseRawPages(output string) ([]RawPage, []string)`**
Replaces `ParsePages`. Deserializes JSON objects with the new schema. Same truncation-recovery logic as today. Returns `(pages, warnings)`.
**`BuildPages(rawPages []RawPage, sourceSlug, date string) []wiki.Page`**
Converts `RawPage → wiki.Page`:
- Computes slug: `wiki.Slug(page.Title)`
- Computes path: `wiki/<type>/<slug>.md`
- Assembles frontmatter:
```
---
title: <Title>
type: <type>
subtype: <subtype> # omitted if empty
domain: <domain> # omitted if empty
created: <date>
source: <sourceSlug> # omitted for the source page itself
---
```
- Concatenates frontmatter + content
**`Resolve(pages []wiki.Page, inventory) []wiki.Page`**
Unchanged. Normalizes near-duplicate titles to existing inventory slugs.
**`CanonicalizeLinks(pages []wiki.Page, inventory) ([]wiki.Page, []string)`**
New. Builds a title→slug map from inventory + current batch. Replaces `[[Display Name]]` with `[[slug|Display Name]]` in each page's content. Titles with no known slug are left as-is and returned as warnings.
**`injectSourceRefs`**
Unchanged. Reads `[[slug|...]]` links (post-canonicalization) to inject back-references.
**`mergeAll → write`**
Unchanged.
### `pipeline.Run` signature change
```go
func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryRun bool) (Result, error)
```
`source` is already passed (it's the display name / filename). A new internal `sourceSlug` is derived from it via `wiki.Slug(source)` before calling `BuildPages`. No API change needed.
---
## Files Changed
| File | Change |
|------|--------|
| `ingestion/internal/pipeline/parse.go` | Replace `ParsePages` with `ParseRawPages` + `RawPage` type |
| `ingestion/internal/pipeline/build.go` | New file: `BuildPages` |
| `ingestion/internal/pipeline/links.go` | New file: `CanonicalizeLinks` |
| `ingestion/internal/pipeline/pipeline.go` | Wire new steps; derive `sourceSlug` from `source` |
| `ingestion/internal/pipeline/prompt.go` | New system prompt + `BuildPrompt` for new JSON format |
| `brain/schema.md` | Update wikilink format and JSON schema docs |
`resolve.go`, `refs.go`, `backfill.go`, `merge.go` — no changes.
---
## Wikilink Format
- **LLM output**: `[[Display Name]]`
- **Stored on disk**: `[[slug|Display Name]]`
- **`CanonicalizeLinks`** converts between the two using the inventory
This matches Obsidian's display-alias syntax that the existing codebase already uses.
---
## Testing Strategy
- `ParseRawPages`: table-driven, cover valid JSON, truncated output, unknown type, missing title
- `BuildPages`: table-driven, cover slug computation, frontmatter assembly, source page (no `source:` field), entity with subtype
- `CanonicalizeLinks`: cover known title → replaced, unknown title → left as-is + warning, multiple links in one page
- Integration test: full `Run` call with mock LLM returning new JSON format, assert no slug duplication across two chunks of the same source
---
## Out of Scope
- Re-ingesting existing pages (user will trigger manually after deploy)
- Changing the `BackfillRefs` endpoint (already correct, slug-based)
- Changing the `Resolve` fuzzy-match algorithm

View File

@@ -20,9 +20,9 @@ import (
"github.com/mathiasbq/hyperguild/ingestion/internal/pipeline" "github.com/mathiasbq/hyperguild/ingestion/internal/pipeline"
) )
// stubComplete returns a fixed JSON page so tests never call a real LLM. // stubComplete returns a fixed JSON RawPage so tests never call a real LLM.
func stubComplete(_ context.Context, _, _ string) (string, error) { func stubComplete(_ context.Context, _, _ string) (string, error) {
return `[{"path":"wiki/sources/test-source.md","content":"# Test Source\n\nSome content here.\n"}]`, nil return `[{"title":"Test Source","type":"source","subtype":"article","content":"## Summary\n\nSome content here.\n"}]`, nil
} }
func stubPipelineCfg() pipeline.Config { func stubPipelineCfg() pipeline.Config {

View File

@@ -0,0 +1,106 @@
// ingestion/internal/pipeline/build.go
package pipeline
import (
"fmt"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// BuildPages converts RawPages from the LLM into wiki.Pages with computed slugs,
// paths, and YAML frontmatter. sourceSlug is the slug of the source being ingested
// (derived from the filename, not the LLM title). Pages whose title resolves to an
// empty slug are skipped and returned as warnings instead.
func BuildPages(rawPages []RawPage, sourceSlug, date string) ([]wiki.Page, []string) {
out := make([]wiki.Page, 0, len(rawPages))
var warnings []string
for _, rp := range rawPages {
slug := computeSlug(rp, sourceSlug)
if slug == "" {
warnings = append(warnings, fmt.Sprintf("skipped page with empty title (type: %s)", rp.Type))
continue
}
out = append(out, buildPage(rp, sourceSlug, date))
}
return out, warnings
}
func computeSlug(rp RawPage, sourceSlug string) string {
if rp.Type == "source" {
return sourceSlug
}
return wiki.Slug(rp.Title)
}
func buildPage(rp RawPage, sourceSlug, date string) wiki.Page {
var slug, dir string
switch rp.Type {
case "source":
slug = sourceSlug
dir = "wiki/sources"
case "concept":
slug = wiki.Slug(rp.Title)
dir = "wiki/concepts"
case "entity":
slug = wiki.Slug(rp.Title)
dir = "wiki/entities"
default:
slug = wiki.Slug(rp.Title)
dir = "wiki/" + rp.Type
}
path := dir + "/" + slug + ".md"
fm := buildFrontmatter(rp, date)
return wiki.Page{
Path: path,
Content: fm + "\n" + rp.Content,
}
}
func buildFrontmatter(rp RawPage, date string) string {
var sb strings.Builder
sb.WriteString("---\n")
fmt.Fprintf(&sb, "title: %s\n", yamlScalar(rp.Title))
switch rp.Type {
case "source":
subtype := rp.Subtype
if subtype == "" {
subtype = "article"
}
fmt.Fprintf(&sb, "type: %s\n", yamlScalar(subtype))
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "date_ingested: %s\n", date)
fmt.Fprintf(&sb, "last_updated: %s\n", date)
case "concept":
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
case "entity":
if rp.Subtype != "" {
fmt.Fprintf(&sb, "type: %s\n", yamlScalar(rp.Subtype))
}
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
default:
if rp.Domain != "" {
fmt.Fprintf(&sb, "domain: %s\n", yamlScalar(rp.Domain))
}
fmt.Fprintf(&sb, "last_updated: %s\n", date)
}
fmt.Fprintf(&sb, "aliases:\n - %s\n", yamlScalar(rp.Title))
sb.WriteString("---\n")
return sb.String()
}
func yamlScalar(s string) string {
return "'" + strings.ReplaceAll(s, "'", "''") + "'"
}

View File

@@ -0,0 +1,167 @@
// ingestion/internal/pipeline/build_test.go
package pipeline
import (
"strings"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestBuildPages_SourcePage(t *testing.T) {
raw := []RawPage{
{
Title: "Shape Up",
Type: "source",
Subtype: "book",
Domain: "product-strategy",
Content: "## Summary\n\nA book about shaping product work.\n",
},
}
pages, warnings := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
assert.Empty(t, warnings)
p := pages[0]
assert.Equal(t, "wiki/sources/shape-up.md", p.Path)
assert.Contains(t, p.Content, "title: 'Shape Up'")
assert.Contains(t, p.Content, "type: 'book'")
assert.Contains(t, p.Content, "domain: 'product-strategy'")
assert.Contains(t, p.Content, "date_ingested: 2026-04-23")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - 'Shape Up'")
assert.Contains(t, p.Content, "## Summary")
assert.True(t, strings.HasPrefix(p.Content, "---\n"), "content must start with frontmatter")
}
func TestBuildPages_ConceptPage(t *testing.T) {
raw := []RawPage{
{
Title: "Betting",
Type: "concept",
Domain: "product-strategy",
Content: "## Definition\n\nA resource allocation technique.\n",
},
}
pages, warnings := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
assert.Empty(t, warnings)
p := pages[0]
assert.Equal(t, "wiki/concepts/betting.md", p.Path)
assert.Contains(t, p.Content, "title: 'Betting'")
assert.Contains(t, p.Content, "domain: 'product-strategy'")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - 'Betting'")
assert.NotContains(t, p.Content, "date_ingested")
assert.Contains(t, p.Content, "## Definition")
}
func TestBuildPages_EntityPage(t *testing.T) {
raw := []RawPage{
{
Title: "Ryan Singer",
Type: "entity",
Subtype: "person",
Domain: "product-strategy",
Content: "## Description\n\nA product designer.\n",
},
}
pages, warnings := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
assert.Empty(t, warnings)
p := pages[0]
assert.Equal(t, "wiki/entities/ryan-singer.md", p.Path)
assert.Contains(t, p.Content, "title: 'Ryan Singer'")
assert.Contains(t, p.Content, "type: 'person'")
assert.Contains(t, p.Content, "domain: 'product-strategy'")
assert.Contains(t, p.Content, "last_updated: 2026-04-23")
assert.Contains(t, p.Content, "aliases:\n - 'Ryan Singer'")
assert.NotContains(t, p.Content, "date_ingested")
}
func TestBuildPages_SourceSlugUsedForSourcePage(t *testing.T) {
// LLM title differs from filename — pipeline uses sourceSlug for the source page path.
raw := []RawPage{
{Title: "FinBERT: A Pretrained Model", Type: "source", Subtype: "article", Content: "## Summary\n\nA model.\n"},
}
pages, _ := BuildPages(raw, "finbert-huggingface", "2026-04-23")
require.Len(t, pages, 1)
assert.Equal(t, "wiki/sources/finbert-huggingface.md", pages[0].Path)
}
func TestBuildPages_ConceptSlugDerivedFromTitle(t *testing.T) {
raw := []RawPage{
{Title: "Domain-Driven Design", Type: "concept", Content: "## Definition\n\nFoo.\n"},
}
pages, _ := BuildPages(raw, "some-source", "2026-04-23")
require.Len(t, pages, 1)
assert.Equal(t, "wiki/concepts/domain-driven-design.md", pages[0].Path)
}
func TestBuildPages_SourceDefaultSubtype(t *testing.T) {
// If subtype is omitted for a source, default to "article"
raw := []RawPage{
{Title: "Some Post", Type: "source", Content: "## Summary\n\nA post.\n"},
}
pages, _ := BuildPages(raw, "some-post", "2026-04-23")
require.Len(t, pages, 1)
assert.Contains(t, pages[0].Content, "type: 'article'")
}
func TestBuildPages_OmitsDomainWhenEmpty(t *testing.T) {
raw := []RawPage{
{Title: "Betting", Type: "concept", Content: "## Definition\n\nFoo.\n"},
}
pages, _ := BuildPages(raw, "src", "2026-04-23")
require.Len(t, pages, 1)
assert.NotContains(t, pages[0].Content, "domain:")
}
func TestBuildPages_MultiplePages(t *testing.T) {
raw := []RawPage{
{Title: "Shape Up", Type: "source", Subtype: "book", Content: "## Summary\n\nA book.\n"},
{Title: "Betting", Type: "concept", Content: "## Definition\n\nA technique.\n"},
{Title: "Ryan Singer", Type: "entity", Subtype: "person", Content: "## Description\n\nA designer.\n"},
}
pages, _ := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 3)
assert.Equal(t, "wiki/sources/shape-up.md", pages[0].Path)
assert.Equal(t, "wiki/concepts/betting.md", pages[1].Path)
assert.Equal(t, "wiki/entities/ryan-singer.md", pages[2].Path)
}
func TestBuildPages_TitleWithColon(t *testing.T) {
raw := []RawPage{
{Title: "Shape Up: The Basecamp Method", Type: "source", Subtype: "book", Content: "## Summary\n\nA book.\n"},
}
pages, _ := BuildPages(raw, "shape-up", "2026-04-23")
require.Len(t, pages, 1)
// Title with colon must be quoted in YAML
assert.Contains(t, pages[0].Content, "title: 'Shape Up: The Basecamp Method'")
assert.Contains(t, pages[0].Content, "aliases:\n - 'Shape Up: The Basecamp Method'")
}
func TestBuildPages_EntityNoSubtype(t *testing.T) {
raw := []RawPage{
{Title: "Basecamp", Type: "entity", Content: "## Description\n\nA company.\n"},
}
pages, _ := BuildPages(raw, "src", "2026-04-23")
require.Len(t, pages, 1)
assert.NotContains(t, pages[0].Content, "type:")
assert.Contains(t, pages[0].Content, "title: 'Basecamp'")
}
func TestBuildPages_EmptyTitleSkippedWithWarning(t *testing.T) {
raw := []RawPage{
{Title: "", Type: "concept", Content: "## Definition\n\nFoo.\n"},
{Title: "Betting", Type: "concept", Content: "## Definition\n\nA technique.\n"},
}
pages, warnings := BuildPages(raw, "src", "2026-04-23")
require.Len(t, pages, 1, "empty-title page should be skipped")
assert.Equal(t, "wiki/concepts/betting.md", pages[0].Path)
assert.Len(t, warnings, 1)
assert.Contains(t, warnings[0], "empty title")
}

View File

@@ -0,0 +1,70 @@
// ingestion/internal/pipeline/links.go
package pipeline
import (
"fmt"
"path/filepath"
"regexp"
"strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
// plainLinkRE matches [[Display Name]] — wikilinks without a slug pipe.
// It does NOT match [[slug|Display]] (those already have a pipe).
var plainLinkRE = regexp.MustCompile(`\[\[([^\]|]+)\]\]`)
// CanonicalizeLinks converts [[Display Name]] wikilinks to [[slug|Display Name]]
// using a title→slug map built from the inventory and current batch.
// Unknown titles are left as-is and returned as warnings.
func CanonicalizeLinks(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) ([]wiki.Page, []string) {
titleToSlug := buildTitleMap(pages, inventory)
var allWarnings []string
out := make([]wiki.Page, len(pages))
for i, p := range pages {
newContent, warnings := canonicalizeContent(p.Content, titleToSlug)
p.Content = newContent
out[i] = p
allWarnings = append(allWarnings, warnings...)
}
return out, allWarnings
}
// buildTitleMap builds a lowercase-title → slug map from inventory and current batch.
// Current batch entries take precedence over inventory (they may be updates).
func buildTitleMap(pages []wiki.Page, inventory map[wiki.PageType][]wiki.Entry) map[string]string {
m := make(map[string]string)
for _, entries := range inventory {
for _, e := range entries {
m[strings.ToLower(e.Title)] = e.Slug
}
}
// Current batch overrides inventory
for _, p := range pages {
title := extractTitle(p.Content)
slug := strings.TrimSuffix(filepath.Base(p.Path), ".md")
if title != "" && slug != "" {
m[strings.ToLower(title)] = slug
}
}
return m
}
func canonicalizeContent(content string, titleToSlug map[string]string) (string, []string) {
var warnings []string
result := plainLinkRE.ReplaceAllStringFunc(content, func(match string) string {
sub := plainLinkRE.FindStringSubmatch(match)
if len(sub) < 2 {
return match
}
displayName := sub[1]
slug, ok := titleToSlug[strings.ToLower(displayName)]
if !ok {
warnings = append(warnings, fmt.Sprintf("unknown wikilink: [[%s]]", displayName))
return match
}
return "[[" + slug + "|" + displayName + "]]"
})
return result, warnings
}

View File

@@ -0,0 +1,125 @@
// ingestion/internal/pipeline/links_test.go
package pipeline
import (
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
)
func TestCanonicalizeLinks_KnownTitle(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[Betting]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.NotContains(t, got[0].Content, "[[Betting]]")
}
func TestCanonicalizeLinks_UnknownTitleLeftAsIs(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[Ghost Concept]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.NotEmpty(t, warnings)
assert.Contains(t, got[0].Content, "[[Ghost Concept]]")
}
func TestCanonicalizeLinks_AlreadyCanonicalLinkUntouched(t *testing.T) {
// Links already in [[slug|Display]] format must not be double-converted
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[betting|Betting]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
// Should remain exactly as-is — not double-wrapped
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.NotContains(t, got[0].Content, "[[betting|[[betting|Betting]]]]")
}
func TestCanonicalizeLinks_CaseInsensitiveMatch(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/foo.md",
Content: "---\ntitle: 'Foo'\n---\n\n## Summary\n\nSee [[domain driven design]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "domain-driven-design", Title: "Domain Driven Design"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[domain-driven-design|domain driven design]]")
}
func TestCanonicalizeLinks_CurrentBatchPagesResolved(t *testing.T) {
// A concept created in the same batch should be canonicalizable
pages := []wiki.Page{
{
Path: "wiki/sources/shape-up.md",
Content: "---\ntitle: 'Shape Up'\n---\n\n## Summary\n\nSee [[Betting]].\n",
},
{
Path: "wiki/concepts/betting.md",
Content: "---\ntitle: 'Betting'\n---\n\n## Definition\n\nA technique.\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{} // empty — Betting is in the batch, not inventory
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 2)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
}
func TestCanonicalizeLinks_MultipleLinksInOnePage(t *testing.T) {
pages := []wiki.Page{
{
Path: "wiki/sources/foo.md",
Content: "---\ntitle: 'Foo'\n---\n\n## Summary\n\nSee [[Betting]] and [[Shape Up]].\n",
},
}
inventory := map[wiki.PageType][]wiki.Entry{
wiki.PageTypeConcept: {
{Slug: "betting", Title: "Betting"},
},
wiki.PageTypeSource: {
{Slug: "shape-up", Title: "Shape Up"},
},
}
got, warnings := CanonicalizeLinks(pages, inventory)
require.Len(t, got, 1)
assert.Empty(t, warnings)
assert.Contains(t, got[0].Content, "[[betting|Betting]]")
assert.Contains(t, got[0].Content, "[[shape-up|Shape Up]]")
}

View File

@@ -5,13 +5,21 @@ import (
"encoding/json" "encoding/json"
"fmt" "fmt"
"strings" "strings"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
) )
// ParsePages parses LLM output as a JSON array of {path, content} objects. // RawPage is the LLM's output format — minimal structured data with no path or frontmatter.
// The pipeline derives slugs, paths, and frontmatter from these fields.
type RawPage struct {
Title string `json:"title"`
Type string `json:"type"` // "source" | "concept" | "entity"
Subtype string `json:"subtype"` // entity: person|company|tool|model|framework|technology; source: article|pdf|book|video|note|project
Domain string `json:"domain"`
Content string `json:"content"` // Markdown body only — no frontmatter
}
// ParseRawPages parses LLM output as a JSON array of RawPage objects.
// If the array is truncated mid-object (token limit), it salvages all complete objects. // If the array is truncated mid-object (token limit), it salvages all complete objects.
func ParsePages(output string) ([]wiki.Page, []string) { func ParseRawPages(output string) ([]RawPage, []string) {
output = strings.TrimSpace(output) output = strings.TrimSpace(output)
if output == "" { if output == "" {
return nil, []string{"LLM returned empty output"} return nil, []string{"LLM returned empty output"}
@@ -19,7 +27,7 @@ func ParsePages(output string) ([]wiki.Page, []string) {
output = stripFences(output) output = stripFences(output)
var pages []wiki.Page var pages []RawPage
if err := json.Unmarshal([]byte(output), &pages); err == nil { if err := json.Unmarshal([]byte(output), &pages); err == nil {
return pages, nil return pages, nil
} }

View File

@@ -8,39 +8,54 @@ import (
"github.com/stretchr/testify/require" "github.com/stretchr/testify/require"
) )
func TestParsePages_ValidJSON(t *testing.T) { func TestParseRawPages_ValidJSON(t *testing.T) {
input := `[{"path":"wiki/sources/foo.md","content":"# Foo"},{"path":"wiki/concepts/bar.md","content":"# Bar"}]` input := `[{"title":"Shape Up","type":"source","subtype":"book","domain":"product-strategy","content":"## Summary\n\nFoo."},{"title":"Betting","type":"concept","content":"## Definition\n\nA technique."}]`
pages, warnings := ParsePages(input) pages, warnings := ParseRawPages(input)
require.Len(t, pages, 2) require.Len(t, pages, 2)
assert.Empty(t, warnings) assert.Empty(t, warnings)
assert.Equal(t, "wiki/sources/foo.md", pages[0].Path) assert.Equal(t, "Shape Up", pages[0].Title)
assert.Equal(t, "wiki/concepts/bar.md", pages[1].Path) assert.Equal(t, "source", pages[0].Type)
assert.Equal(t, "book", pages[0].Subtype)
assert.Equal(t, "product-strategy", pages[0].Domain)
assert.Equal(t, "Betting", pages[1].Title)
assert.Equal(t, "concept", pages[1].Type)
assert.Empty(t, pages[1].Subtype)
} }
func TestParsePages_StripsFences(t *testing.T) { func TestParseRawPages_StripsFences(t *testing.T) {
input := "```json\n[{\"path\":\"wiki/sources/foo.md\",\"content\":\"# Foo\"}]\n```" input := "```json\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"## Definition\\n\\nFoo.\"}]\n```"
pages, warnings := ParsePages(input) pages, warnings := ParseRawPages(input)
assert.Len(t, pages, 1)
assert.Empty(t, warnings)
}
func TestParsePages_TruncationRecovery(t *testing.T) {
input := `[{"path":"wiki/sources/foo.md","content":"# Foo"},{"path":"wiki/concepts/bar.md","content":"trunc`
pages, warnings := ParsePages(input)
require.Len(t, pages, 1) require.Len(t, pages, 1)
assert.Equal(t, "wiki/sources/foo.md", pages[0].Path) assert.Empty(t, warnings)
assert.Equal(t, "Foo", pages[0].Title)
}
func TestParseRawPages_TruncationRecovery(t *testing.T) {
input := `[{"title":"Foo","type":"concept","content":"## Definition\n\nFoo."},{"title":"Bar","type":"concept","content":"trunc`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Equal(t, "Foo", pages[0].Title)
assert.NotEmpty(t, warnings) assert.NotEmpty(t, warnings)
} }
func TestParsePages_EmptyInput(t *testing.T) { func TestParseRawPages_EmptyInput(t *testing.T) {
pages, warnings := ParsePages("") pages, warnings := ParseRawPages("")
assert.Empty(t, pages) assert.Empty(t, pages)
assert.NotEmpty(t, warnings) assert.NotEmpty(t, warnings)
} }
func TestParsePages_PlainFence(t *testing.T) { func TestParseRawPages_PlainFence(t *testing.T) {
input := "```\n[{\"path\":\"wiki/sources/foo.md\",\"content\":\"ok\"}]\n```" input := "```\n[{\"title\":\"Foo\",\"type\":\"concept\",\"content\":\"ok\"}]\n```"
pages, warnings := ParsePages(input) pages, warnings := ParseRawPages(input)
assert.Len(t, pages, 1) require.Len(t, pages, 1)
assert.Empty(t, warnings) assert.Empty(t, warnings)
} }
func TestParseRawPages_MissingTitle(t *testing.T) {
// Missing title — still parsed, Title is empty string
input := `[{"type":"concept","content":"## Definition\n\nFoo."}]`
pages, warnings := ParseRawPages(input)
require.Len(t, pages, 1)
assert.Empty(t, warnings)
assert.Empty(t, pages[0].Title)
}

View File

@@ -41,9 +41,11 @@ func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryR
schema = loadSchema(brainDir) schema = loadSchema(brainDir)
} }
sourceSlug := wiki.Slug(source)
date := time.Now().UTC().Format("2006-01-02")
chunks := Chunk(content, cfg.ChunkSize) chunks := Chunk(content, cfg.ChunkSize)
var allPages []wiki.Page var allRaw []RawPage
var allWarnings []string var allWarnings []string
for _, chunk := range chunks { for _, chunk := range chunks {
@@ -52,18 +54,20 @@ func Run(ctx context.Context, cfg Config, brainDir, content, source string, dryR
if err != nil { if err != nil {
return Result{}, fmt.Errorf("LLM call: %w", err) return Result{}, fmt.Errorf("LLM call: %w", err)
} }
pages, warnings := ParsePages(output) raw, warnings := ParseRawPages(output)
allPages = append(allPages, pages...) allRaw = append(allRaw, raw...)
allWarnings = append(allWarnings, warnings...) allWarnings = append(allWarnings, warnings...)
} }
resolved := Resolve(allPages, inventory) pages, buildWarnings := BuildPages(allRaw, sourceSlug, date)
withRefs := injectSourceRefs(resolved, inventory, brainDir) allWarnings = append(allWarnings, buildWarnings...)
resolved := Resolve(pages, inventory)
canonicalized, linkWarnings := CanonicalizeLinks(resolved, inventory)
allWarnings = append(allWarnings, linkWarnings...)
withRefs := injectSourceRefs(canonicalized, inventory, brainDir)
merged := mergeAll(withRefs) merged := mergeAll(withRefs)
date := time.Now().UTC().Format("2006-01-02")
var written []string var written []string
for _, page := range merged { for _, page := range merged {
if !dryRun { if !dryRun {
dest := filepath.Join(brainDir, filepath.FromSlash(page.Path)) dest := filepath.Join(brainDir, filepath.FromSlash(page.Path))

View File

@@ -15,7 +15,6 @@ import (
"github.com/stretchr/testify/require" "github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/llm" "github.com/mathiasbq/hyperguild/ingestion/internal/llm"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
) )
func TestRun_WritesPages(t *testing.T) { func TestRun_WritesPages(t *testing.T) {
@@ -24,14 +23,19 @@ func TestRun_WritesPages(t *testing.T) {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755)) require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
} }
llmResponse := mustJSON([]wiki.Page{ llmResponse := mustJSON([]RawPage{
{ {
Path: "wiki/sources/test-article.md", Title: "Test Article",
Content: "---\ntitle: Test Article\ntype: article\ndomain: software-engineering\ndate_ingested: 2026-04-22\nlast_updated: 2026-04-22\naliases:\n - Test Article\n---\n\n## Summary\n\nA test article.\n\n## Key Claims\n\n- It tests things.\n\n## Concepts Introduced or Reinforced\n\n## Entities Mentioned\n\n## Open Questions Raised\n", Type: "source",
Subtype: "article",
Domain: "software-engineering",
Content: "## Summary\n\nA test article.\n\n## Key Claims\n\n- It tests things.\n\n## Concepts Introduced or Reinforced\n\n[[Testing]]\n\n## Entities Mentioned\n\n## Open Questions Raised\n",
}, },
{ {
Path: "wiki/concepts/testing.md", Title: "Testing",
Content: "---\ntitle: Testing\ndomain: software-engineering\nlast_updated: 2026-04-22\naliases:\n - Testing\n---\n\n## Definition\n\nThe practice of verifying software.\n\n## Why It Matters\n\nCatches bugs.\n\n## Related Concepts\n\n## Related Entities\n\n## Sources\n\n## Evolving Notes\n", Type: "concept",
Domain: "software-engineering",
Content: "## Definition\n\nThe practice of verifying software.\n\n## Why It Matters\n\nCatches bugs.\n\n## Related Concepts\n\n## Related Entities\n\n## Sources\n\n## Evolving Notes\n",
}, },
}) })
@@ -53,7 +57,6 @@ func TestRun_WritesPages(t *testing.T) {
result, err := Run(context.Background(), cfg, brainDir, "An article about testing.", "test-article", false) result, err := Run(context.Background(), cfg, brainDir, "An article about testing.", "test-article", false)
require.NoError(t, err) require.NoError(t, err)
assert.Len(t, result.Pages, 2) assert.Len(t, result.Pages, 2)
assert.Empty(t, result.Warnings)
_, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "test-article.md")) _, err = os.Stat(filepath.Join(brainDir, "wiki", "sources", "test-article.md"))
require.NoError(t, err) require.NoError(t, err)
@@ -71,9 +74,11 @@ func TestRun_DryRunDoesNotWrite(t *testing.T) {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755)) require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
} }
llmResponse := mustJSON([]wiki.Page{{ llmResponse := mustJSON([]RawPage{{
Path: "wiki/sources/foo.md", Title: "Foo",
Content: "---\ntitle: Foo\n---\n\n## Summary\n\nFoo.\n", Type: "source",
Subtype: "article",
Content: "## Summary\n\nFoo.\n",
}}) }})
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@@ -98,10 +103,10 @@ func TestRun_MergesDuplicatePaths(t *testing.T) {
require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755)) require.NoError(t, os.MkdirAll(filepath.Join(brainDir, sub), 0o755))
} }
// LLM returns same path twice (simulates multi-chunk merge) // LLM returns same title twice (simulates multi-chunk duplicate)
llmResponse := mustJSON([]wiki.Page{ llmResponse := mustJSON([]RawPage{
{Path: "wiki/concepts/foo.md", Content: "---\ntitle: Foo\n---\n\n## Definition\n\nFirst.\n\n## Related Concepts\n\n- [[bar|Bar]]\n"}, {Title: "Foo", Type: "concept", Content: "## Definition\n\nFirst.\n\n## Related Concepts\n\n[[Bar]]\n"},
{Path: "wiki/concepts/foo.md", Content: "---\ntitle: Foo\n---\n\n## Definition\n\nSecond.\n\n## Related Concepts\n\n- [[baz|Baz]]\n"}, {Title: "Foo", Type: "concept", Content: "## Definition\n\nSecond.\n\n## Related Concepts\n\n[[Baz]]\n"},
}) })
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@@ -120,8 +125,9 @@ func TestRun_MergesDuplicatePaths(t *testing.T) {
require.NoError(t, err) require.NoError(t, err)
// keep-first for Definition, union for Related Concepts // keep-first for Definition, union for Related Concepts
assert.Contains(t, string(content), "First.") assert.Contains(t, string(content), "First.")
assert.Contains(t, string(content), "[[bar|Bar]]") // Bar and Baz unknown in empty inventory → left as plain [[links]]
assert.Contains(t, string(content), "[[baz|Baz]]") assert.Contains(t, string(content), "[[Bar]]")
assert.Contains(t, string(content), "[[Baz]]")
} }
func mustJSON(v any) string { func mustJSON(v any) string {

View File

@@ -12,12 +12,15 @@ import (
const systemPrompt = `You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided. const systemPrompt = `You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided.
Output ONLY a valid JSON array — no markdown fences, no other text before or after. Output ONLY a valid JSON array — no markdown fences, no other text before or after.
Each element must have: Each element must have exactly these fields:
"path" — relative path within the wiki, e.g. "wiki/sources/foo.md" "title" — exact page title (e.g. "FinBERT", "Ryan Singer", "Shape Up")
"content" — full markdown content of the page including YAML frontmatter "type" — exactly one of: "source", "concept", "entity"
"subtype" — for source: article|pdf|book|video|note|project; for entity: person|company|tool|model|framework|technology; omit for concept
"domain" — one of the domains in the schema (omit if none fits)
"content" — Markdown body only — NO frontmatter, NO path, NO slug
Follow the schema strictly: correct frontmatter fields, wikilinks as [[slug|Display Text]], Wikilinks in content: [[Display Name]] — just the display name, no slug, no pipe separator.
dates in YYYY-MM-DD format, and paraphrase rather than quoting verbatim.` Only link to pages listed in the inventory or pages you are creating in this response.`
// BuildPrompt constructs the user prompt for a single chunk. // BuildPrompt constructs the user prompt for a single chunk.
func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]wiki.Entry) string { func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]wiki.Entry) string {
@@ -30,7 +33,7 @@ func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]w
sb.WriteString("\n\n") sb.WriteString("\n\n")
sb.WriteString("## Existing wiki pages\n\n") sb.WriteString("## Existing wiki pages\n\n")
sb.WriteString("Link ONLY to pages in this inventory or pages you are creating in this response.\n\n") sb.WriteString("Reference these pages by display name only — [[Display Name]] — in your content.\n\n")
for _, pt := range []wiki.PageType{wiki.PageTypeConcept, wiki.PageTypeEntity, wiki.PageTypeSource} { for _, pt := range []wiki.PageType{wiki.PageTypeConcept, wiki.PageTypeEntity, wiki.PageTypeSource} {
entries := inventory[pt] entries := inventory[pt]
@@ -39,19 +42,19 @@ func BuildPrompt(schema, source, content string, inventory map[wiki.PageType][]w
fmt.Fprintf(&sb, "%s — (none yet)\n\n", label) fmt.Fprintf(&sb, "%s — (none yet)\n\n", label)
continue continue
} }
fmt.Fprintf(&sb, "%s — link ONLY under the matching section:\n", label) fmt.Fprintf(&sb, "%s:\n", label)
for _, e := range entries { for _, e := range entries {
fmt.Fprintf(&sb, " - [[%s|%s]]\n", e.Slug, e.Title) fmt.Fprintf(&sb, " - %s\n", e.Title)
} }
sb.WriteString("\n") sb.WriteString("\n")
} }
sb.WriteString("## Non-negotiable rules\n\n") sb.WriteString("## Non-negotiable rules\n\n")
sb.WriteString("1. Output ONLY a valid JSON array — no prose, no fences.\n") sb.WriteString("1. Output ONLY a valid JSON array — no prose, no fences.\n")
sb.WriteString("2. Slugs are kebab-case: lowercase, spaces→hyphens, no special chars.\n") sb.WriteString("2. Fields: title, type, subtype (if applicable), domain (if applicable), content.\n")
sb.WriteString("3. Wikilinks: [[slug|Display Text]] — the pipe is required.\n") sb.WriteString("3. Wikilinks: [[Display Name]] — no slug, no pipe. The pipeline handles slugs.\n")
sb.WriteString("4. Section links must match their section type.\n") sb.WriteString("4. Section links must match their section type (Related Concepts → concepts only, etc.).\n")
sb.WriteString("5. One source page per book — update it if inventory shows it exists.\n\n") sb.WriteString("5. One source page per book — if inventory shows it exists, return it as an UPDATE.\n\n")
fmt.Fprintf(&sb, "## Source: %s\n\n", source) fmt.Fprintf(&sb, "## Source: %s\n\n", source)
sb.WriteString(content) sb.WriteString(content)

View File

@@ -14,13 +14,12 @@ import (
"github.com/stretchr/testify/require" "github.com/stretchr/testify/require"
"github.com/mathiasbq/hyperguild/ingestion/internal/pipeline" "github.com/mathiasbq/hyperguild/ingestion/internal/pipeline"
"github.com/mathiasbq/hyperguild/ingestion/internal/wiki"
) )
// successComplete returns a valid JSON-encoded page array for any call. // successComplete returns a valid JSON-encoded RawPage array for any call.
func successComplete(page wiki.Page) pipeline.CompleteFunc { func successComplete(raw pipeline.RawPage) pipeline.CompleteFunc {
return func(ctx context.Context, system, user string) (string, error) { return func(ctx context.Context, system, user string) (string, error) {
b, err := json.Marshal([]wiki.Page{page}) b, err := json.Marshal([]pipeline.RawPage{raw})
if err != nil { if err != nil {
return "", err return "", err
} }
@@ -50,16 +49,19 @@ func TestStart_ProcessesFile(t *testing.T) {
require.NoError(t, os.WriteFile(rawFile, []byte("Content about Shape Up."), 0o644)) require.NoError(t, os.WriteFile(rawFile, []byte("Content about Shape Up."), 0o644))
date := time.Now().UTC().Format("2006-01-02") date := time.Now().UTC().Format("2006-01-02")
wikiPage := wiki.Page{ rawPage := pipeline.RawPage{
Path: "wiki/sources/shape-up-book.md", Title: "Shape Up Book",
Content: "---\ntitle: Shape Up Book\ntype: article\ndomain: product-management\ndate_ingested: " + date + "\nlast_updated: " + date + "\naliases:\n - Shape Up Book\n---\n\n## Summary\n\nA book about Shape Up.\n", Type: "source",
Subtype: "article",
Domain: "product-management",
Content: "## Summary\n\nA book about Shape Up.\n",
} }
cfg := Config{ cfg := Config{
BrainDir: brainDir, BrainDir: brainDir,
Interval: 50 * time.Millisecond, Interval: 50 * time.Millisecond,
Pipeline: pipeline.Config{ Pipeline: pipeline.Config{
Complete: successComplete(wikiPage), Complete: successComplete(rawPage),
ChunkSize: 0, ChunkSize: 0,
Schema: "# Schema\nThree page types.", Schema: "# Schema\nThree page types.",
}, },
@@ -193,12 +195,14 @@ func TestProcessDir_SkipsSubdirs(t *testing.T) {
// Track which sources were passed to Complete. // Track which sources were passed to Complete.
var processedSources []string var processedSources []string
completeFn := func(ctx context.Context, system, user string) (string, error) { completeFn := func(ctx context.Context, system, user string) (string, error) {
// Record that this was called; return a minimal valid page. // Record that this was called; return a minimal valid RawPage.
page := wiki.Page{ raw := pipeline.RawPage{
Path: "wiki/sources/valid.md", Title: "Valid",
Content: "---\ntitle: Valid\n---\n\n## Summary\n\nValid.\n", Type: "source",
Subtype: "article",
Content: "## Summary\n\nValid.\n",
} }
b, _ := json.Marshal([]wiki.Page{page}) b, _ := json.Marshal([]pipeline.RawPage{raw})
processedSources = append(processedSources, "called") processedSources = append(processedSources, "called")
return string(b), nil return string(b), nil
} }