10 KiB
Brain Ingestion Pipeline — Design Spec
Date: 2026-04-22 Status: approved Author: Mathias + Claude
Overview
Add a structured ingestion pipeline to the hyperguild brain. The pipeline accepts raw content (directly or from files) and uses an LLM to produce structured wiki pages in brain/wiki/ — the declarative layer of the Two-Layer Brain. Three fixed knowledge classes: concepts, entities, sources.
This spec covers:
- Three new packages in the
ingestionGo module (llm,wiki,pipeline,watcher) - Two new HTTP endpoints on the ingestion server (
/ingest,/ingest-path) - A background file watcher for
brain/raw/ - Config additions to both the ingestion server and the supervisor
It does not cover Layer 2 (training data, brain/training-data/) — that is the trainer worker's concern.
Information Model
Three fixed wiki page classes, matching the Two-Layer Brain design spec and the existing ingestion-svc model:
wiki/sources/<slug>.md
One page per ingested source (project, book, article, note). Updated (not replaced) on re-ingestion.
Required frontmatter: title, type (article|pdf|book|video|note|project), domain, source_url, date_ingested, last_updated, aliases.
Body sections: Summary · Key Claims · Concepts Introduced or Reinforced · Entities Mentioned · Open Questions Raised. Books add: Chapters · Argument Arc · Updates (dated, append-only).
wiki/concepts/<slug>.md
One page per idea, framework, methodology, or pattern (e.g. Domain Driven Design, TDD, event sourcing).
Required frontmatter: title, domain, last_updated, aliases.
Body sections: Definition · Why It Matters · Related Concepts · Related Entities · Sources · Evolving Notes.
wiki/entities/<slug>.md
One page per person, tool, organisation, technology, or product.
Required frontmatter: title, type (person|company|tool|model|framework|technology), domain, last_updated, aliases.
Body sections: Description · Relevance · Key Positions/Products/Claims · Related Concepts · Related Entities · Sources.
Wikilink format
All cross-references use [[slug|Display Text]]. Slug = lowercase title, spaces→hyphens, non-alphanumeric stripped. Slugs must resolve to an existing file in the wiki.
Supporting files
brain/wiki/index.md— auto-rebuilt on every ingest: one-sentence summary per page, grouped by typebrain/log.md— append-only audit trail: date, source, pages written, warnings
Architecture
New packages (ingestion module)
ingestion/internal/
llm/ — OpenAI-compatible HTTP client (chat completions, retry on 429,
configurable timeout and temperature)
wiki/ — Page types, slug utilities, merge logic, inventory loader,
index rebuilder, log appender
pipeline/ — Orchestrates one ingest run end-to-end (content or extracted file text)
watcher/ — Polls brain/raw/ and triggers pipeline on new files
The existing api/ and search/ packages are updated; no other existing packages change.
Brain directory layout
brain/
wiki/
concepts/ ← LLM-structured concept pages
entities/ ← LLM-structured entity pages
sources/ ← LLM-structured source pages
index.md ← auto-rebuilt on each ingest
knowledge/ ← quick raw notes via brain_write (BM25-searchable, unchanged)
raw/ ← drop zone; watcher picks up files here
processed/ ← moved here on success (organised by date: processed/YYYY-MM-DD/)
failed/ ← moved here on failure
sessions/ ← session logs (retrospective/trainer concern, not touched here)
training-data/ ← Layer 2 (trainer worker concern, not touched here)
log.md ← append-only audit trail
CLAUDE.md ← schema document injected into every ingest prompt
If brain/CLAUDE.md is absent, the pipeline falls back to an embedded default schema compiled into the binary.
API
POST /ingest
Ingest content provided directly by the caller.
Request:
{
"content": "...",
"source": "shape-up-book",
"dry_run": false
}
Response:
{
"pages": ["wiki/sources/shape-up.md", "wiki/concepts/betting-table.md"],
"warnings": []
}
source is the human-readable name used when writing/updating wiki/sources/<slug>.md. dry_run: true returns the page contents without writing.
POST /ingest-path
Ingest a file or walk a directory recursively. Supports .md, .txt, .pdf.
Request:
{
"path": "/Users/mathias/brain/raw/shape-up.pdf",
"source": "shape-up-book",
"dry_run": false
}
If path is a directory, all supported files within it are ingested in sequence. source is optional for directory ingestion — if omitted, the LLM derives it from each file's name and content.
Response: same shape as /ingest, with pages and warnings aggregated across all files.
Supervisor skill update
brain_ingest in internal/skills/brain/handlers.go gains an optional path field. If path is set, it calls /ingest-path; otherwise /ingest.
Pipeline
pipeline.Run(ctx, cfg, brainDir, content, source, dryRun) — called by both HTTP handlers after any file reading is done.
Steps:
- Load inventory — walk
brain/wiki/{concepts,entities,sources}/, build slug index grouped by type. Injected into prompt so LLM knows what to update vs create. - Load schema — read
brain/CLAUDE.md; fall back to embedded default if absent. - Chunk — split content at
INGEST_CHUNK_SIZEchars (default 6000; split on paragraph boundary). IfINGEST_CHUNK_SIZE=0, no chunking. - LLM call per chunk — returns JSON array of
{"path": "wiki/concepts/foo.md", "content": "..."}. Prompt structure: system instruction → date → schema → inventory → non-negotiable slug/wikilink rules → source content. - Parse + truncation recovery — strip markdown fences if present. If JSON array is truncated mid-object (token limit), salvage all complete objects before the break and log a warning.
- Merge — combine pages with the same path across chunks:
- Bullet sections (Related Concepts, Related Entities, Sources, Key Claims): union unique lines
- Append sections (Evolving Notes, Updates, Open Questions): append new content
- All other sections: keep first occurrence
- Frontmatter: keep first occurrence
- Write — create subdirs as needed, write files atomically. In dry-run mode, return page map without writing.
- Rebuild
index.md— one-sentence summary per page (derived from first body paragraph), grouped by type, with page count header. - Append to
log.md— date, source, list of pages written, warning count.
File Watcher
Background goroutine started at server startup (when INGEST_WATCH_INTERVAL > 0).
Poll loop:
- Walk
brain/raw/for files with supported extensions (.md,.txt,.pdf), excludingprocessed/andfailed/subdirs. - For each file found: derive source from filename (strip extension, kebab-to-title), call
pipeline.Runwith the file content. - On success: move file to
brain/raw/processed/YYYY-MM-DD/<filename>. - On failure: move file to
brain/raw/failed/<filename>, append error tobrain/log.md. - Sleep
INGEST_WATCH_INTERVALseconds, repeat.
Files are processed one at a time (no concurrency within the watcher) to avoid LLM rate-limit collisions.
LLM Prompt
System:
You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided. Output ONLY a valid JSON array — no markdown fences, no other text. Each element must have:
"path"(relative path within wiki, e.g."wiki/sources/foo.md") and"content"(full markdown including YAML frontmatter). Follow the schema strictly: correct frontmatter fields, wikilinks as[[slug|Display Text]], dates in YYYY-MM-DD format, paraphrase rather than quoting verbatim.
User (built dynamically):
- Today's date
- Full schema (
brain/CLAUDE.mdcontent) - Existing wiki inventory grouped by type (for update-vs-create decisions)
- Non-negotiable rules: slug format, wikilink format, one-source-per-book, section type enforcement
- Source content (the chunk)
Temperature: 0.2 for reproducibility.
Configuration
Ingestion server (new env vars)
| Variable | Default | Description |
|---|---|---|
INGEST_LLM_URL |
http://iguana:4000/v1 |
OpenAI-compatible endpoint |
INGEST_LLM_KEY |
(empty) | API key |
INGEST_LLM_MODEL |
koala/qwen35-9b-fast |
Model name |
INGEST_LLM_TIMEOUT |
15 |
LLM call timeout (minutes) |
INGEST_CHUNK_SIZE |
6000 |
Max chars per LLM call (0 = no chunking) |
INGEST_WATCH_INTERVAL |
30 |
Watcher poll interval in seconds (0 = disabled) |
Supervisor (new env vars + wiring)
| Variable | Default | Description |
|---|---|---|
INGEST_SVC_URL |
(empty) | URL of ingestion server for brain_ingest |
KB_RETRIEVAL_URL |
(empty) | URL of KB retrieval server for brain_search |
config.go gets two new fields. main.go passes them to brain.New(). Both tools are only registered as MCP tools when the respective URL is configured (already implemented in skill.go).
Testing
| Package | What is tested |
|---|---|
wiki/ |
Slug generation (edge cases: apostrophes, colons, version strings), merge logic (bullets union, append, keep-first), inventory loading from temp dir, truncation recovery (valid partial JSON), index rebuild output |
pipeline/ |
Integration test: temp brain dir + mock LLM HTTP server returning fixture JSON; verify files written to correct paths, index rebuilt, log appended |
api/ |
Handler tests for /ingest and /ingest-path using mock pipeline; 400 on missing fields, 200 with expected response shape |
watcher/ |
File placed in brain/raw/ is moved to processed/ on mock-pipeline success; moved to failed/ on error |
All tests are table-driven. No real LLM calls in tests.
Out of Scope
- Python validation/correction loop (can be added later; the LLM prompt enforces schema rules as non-negotiable instructions)
brain/training-data/— trainer worker concernbrain/sessions/— retrospective/sessionlog concern- Upload endpoint (multipart HTTP) —
scp/rsync tobrain/raw/+ watcher covers this - Qdrant vector indexing —
brain_searchcalls a separate KB retrieval service; ingestion does not write to Qdrant