241 lines
10 KiB
Markdown
241 lines
10 KiB
Markdown
# Brain Ingestion Pipeline — Design Spec
|
|
|
|
**Date:** 2026-04-22
|
|
**Status:** approved
|
|
**Author:** Mathias + Claude
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Add a structured ingestion pipeline to the hyperguild brain. The pipeline accepts raw content (directly or from files) and uses an LLM to produce structured wiki pages in `brain/wiki/` — the declarative layer of the Two-Layer Brain. Three fixed knowledge classes: **concepts**, **entities**, **sources**.
|
|
|
|
This spec covers:
|
|
- Three new packages in the `ingestion` Go module (`llm`, `wiki`, `pipeline`, `watcher`)
|
|
- Two new HTTP endpoints on the ingestion server (`/ingest`, `/ingest-path`)
|
|
- A background file watcher for `brain/raw/`
|
|
- Config additions to both the ingestion server and the supervisor
|
|
|
|
It does **not** cover Layer 2 (training data, `brain/training-data/`) — that is the trainer worker's concern.
|
|
|
|
---
|
|
|
|
## Information Model
|
|
|
|
Three fixed wiki page classes, matching the Two-Layer Brain design spec and the existing `ingestion-svc` model:
|
|
|
|
### `wiki/sources/<slug>.md`
|
|
One page per ingested source (project, book, article, note). Updated (not replaced) on re-ingestion.
|
|
|
|
Required frontmatter: `title`, `type` (article|pdf|book|video|note|project), `domain`, `source_url`, `date_ingested`, `last_updated`, `aliases`.
|
|
|
|
Body sections: Summary · Key Claims · Concepts Introduced or Reinforced · Entities Mentioned · Open Questions Raised. Books add: Chapters · Argument Arc · Updates (dated, append-only).
|
|
|
|
### `wiki/concepts/<slug>.md`
|
|
One page per idea, framework, methodology, or pattern (e.g. Domain Driven Design, TDD, event sourcing).
|
|
|
|
Required frontmatter: `title`, `domain`, `last_updated`, `aliases`.
|
|
|
|
Body sections: Definition · Why It Matters · Related Concepts · Related Entities · Sources · Evolving Notes.
|
|
|
|
### `wiki/entities/<slug>.md`
|
|
One page per person, tool, organisation, technology, or product.
|
|
|
|
Required frontmatter: `title`, `type` (person|company|tool|model|framework|technology), `domain`, `last_updated`, `aliases`.
|
|
|
|
Body sections: Description · Relevance · Key Positions/Products/Claims · Related Concepts · Related Entities · Sources.
|
|
|
|
### Wikilink format
|
|
All cross-references use `[[slug|Display Text]]`. Slug = lowercase title, spaces→hyphens, non-alphanumeric stripped. Slugs must resolve to an existing file in the wiki.
|
|
|
|
### Supporting files
|
|
- `brain/wiki/index.md` — auto-rebuilt on every ingest: one-sentence summary per page, grouped by type
|
|
- `brain/log.md` — append-only audit trail: date, source, pages written, warnings
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
### New packages (`ingestion` module)
|
|
|
|
```
|
|
ingestion/internal/
|
|
llm/ — OpenAI-compatible HTTP client (chat completions, retry on 429,
|
|
configurable timeout and temperature)
|
|
wiki/ — Page types, slug utilities, merge logic, inventory loader,
|
|
index rebuilder, log appender
|
|
pipeline/ — Orchestrates one ingest run end-to-end (content or extracted file text)
|
|
watcher/ — Polls brain/raw/ and triggers pipeline on new files
|
|
```
|
|
|
|
The existing `api/` and `search/` packages are updated; no other existing packages change.
|
|
|
|
### Brain directory layout
|
|
|
|
```
|
|
brain/
|
|
wiki/
|
|
concepts/ ← LLM-structured concept pages
|
|
entities/ ← LLM-structured entity pages
|
|
sources/ ← LLM-structured source pages
|
|
index.md ← auto-rebuilt on each ingest
|
|
knowledge/ ← quick raw notes via brain_write (BM25-searchable, unchanged)
|
|
raw/ ← drop zone; watcher picks up files here
|
|
processed/ ← moved here on success (organised by date: processed/YYYY-MM-DD/)
|
|
failed/ ← moved here on failure
|
|
sessions/ ← session logs (retrospective/trainer concern, not touched here)
|
|
training-data/ ← Layer 2 (trainer worker concern, not touched here)
|
|
log.md ← append-only audit trail
|
|
CLAUDE.md ← schema document injected into every ingest prompt
|
|
```
|
|
|
|
If `brain/CLAUDE.md` is absent, the pipeline falls back to an embedded default schema compiled into the binary.
|
|
|
|
---
|
|
|
|
## API
|
|
|
|
### `POST /ingest`
|
|
|
|
Ingest content provided directly by the caller.
|
|
|
|
**Request:**
|
|
```json
|
|
{
|
|
"content": "...",
|
|
"source": "shape-up-book",
|
|
"dry_run": false
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"pages": ["wiki/sources/shape-up.md", "wiki/concepts/betting-table.md"],
|
|
"warnings": []
|
|
}
|
|
```
|
|
|
|
`source` is the human-readable name used when writing/updating `wiki/sources/<slug>.md`. `dry_run: true` returns the page contents without writing.
|
|
|
|
### `POST /ingest-path`
|
|
|
|
Ingest a file or walk a directory recursively. Supports `.md`, `.txt`, `.pdf`.
|
|
|
|
**Request:**
|
|
```json
|
|
{
|
|
"path": "/Users/mathias/brain/raw/shape-up.pdf",
|
|
"source": "shape-up-book",
|
|
"dry_run": false
|
|
}
|
|
```
|
|
|
|
If `path` is a directory, all supported files within it are ingested in sequence. `source` is optional for directory ingestion — if omitted, the LLM derives it from each file's name and content.
|
|
|
|
**Response:** same shape as `/ingest`, with pages and warnings aggregated across all files.
|
|
|
|
### Supervisor skill update
|
|
|
|
`brain_ingest` in `internal/skills/brain/handlers.go` gains an optional `path` field. If `path` is set, it calls `/ingest-path`; otherwise `/ingest`.
|
|
|
|
---
|
|
|
|
## Pipeline
|
|
|
|
`pipeline.Run(ctx, cfg, brainDir, content, source, dryRun)` — called by both HTTP handlers after any file reading is done.
|
|
|
|
Steps:
|
|
|
|
1. **Load inventory** — walk `brain/wiki/{concepts,entities,sources}/`, build slug index grouped by type. Injected into prompt so LLM knows what to update vs create.
|
|
2. **Load schema** — read `brain/CLAUDE.md`; fall back to embedded default if absent.
|
|
3. **Chunk** — split content at `INGEST_CHUNK_SIZE` chars (default 6000; split on paragraph boundary). If `INGEST_CHUNK_SIZE=0`, no chunking.
|
|
4. **LLM call per chunk** — returns JSON array of `{"path": "wiki/concepts/foo.md", "content": "..."}`. Prompt structure: system instruction → date → schema → inventory → non-negotiable slug/wikilink rules → source content.
|
|
5. **Parse + truncation recovery** — strip markdown fences if present. If JSON array is truncated mid-object (token limit), salvage all complete objects before the break and log a warning.
|
|
6. **Merge** — combine pages with the same path across chunks:
|
|
- Bullet sections (Related Concepts, Related Entities, Sources, Key Claims): union unique lines
|
|
- Append sections (Evolving Notes, Updates, Open Questions): append new content
|
|
- All other sections: keep first occurrence
|
|
- Frontmatter: keep first occurrence
|
|
7. **Write** — create subdirs as needed, write files atomically. In dry-run mode, return page map without writing.
|
|
8. **Rebuild `index.md`** — one-sentence summary per page (derived from first body paragraph), grouped by type, with page count header.
|
|
9. **Append to `log.md`** — date, source, list of pages written, warning count.
|
|
|
|
---
|
|
|
|
## File Watcher
|
|
|
|
Background goroutine started at server startup (when `INGEST_WATCH_INTERVAL > 0`).
|
|
|
|
**Poll loop:**
|
|
1. Walk `brain/raw/` for files with supported extensions (`.md`, `.txt`, `.pdf`), excluding `processed/` and `failed/` subdirs.
|
|
2. For each file found: derive source from filename (strip extension, kebab-to-title), call `pipeline.Run` with the file content.
|
|
3. On success: move file to `brain/raw/processed/YYYY-MM-DD/<filename>`.
|
|
4. On failure: move file to `brain/raw/failed/<filename>`, append error to `brain/log.md`.
|
|
5. Sleep `INGEST_WATCH_INTERVAL` seconds, repeat.
|
|
|
|
Files are processed one at a time (no concurrency within the watcher) to avoid LLM rate-limit collisions.
|
|
|
|
---
|
|
|
|
## LLM Prompt
|
|
|
|
**System:**
|
|
> You are a wiki agent. Read the source material and produce structured wiki pages following the schema provided. Output ONLY a valid JSON array — no markdown fences, no other text. Each element must have: `"path"` (relative path within wiki, e.g. `"wiki/sources/foo.md"`) and `"content"` (full markdown including YAML frontmatter). Follow the schema strictly: correct frontmatter fields, wikilinks as `[[slug|Display Text]]`, dates in YYYY-MM-DD format, paraphrase rather than quoting verbatim.
|
|
|
|
**User (built dynamically):**
|
|
1. Today's date
|
|
2. Full schema (`brain/CLAUDE.md` content)
|
|
3. Existing wiki inventory grouped by type (for update-vs-create decisions)
|
|
4. Non-negotiable rules: slug format, wikilink format, one-source-per-book, section type enforcement
|
|
5. Source content (the chunk)
|
|
|
|
Temperature: 0.2 for reproducibility.
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
### Ingestion server (new env vars)
|
|
|
|
| Variable | Default | Description |
|
|
|---|---|---|
|
|
| `INGEST_LLM_URL` | `http://iguana:4000/v1` | OpenAI-compatible endpoint |
|
|
| `INGEST_LLM_KEY` | (empty) | API key |
|
|
| `INGEST_LLM_MODEL` | `koala/qwen35-9b-fast` | Model name |
|
|
| `INGEST_LLM_TIMEOUT` | `15` | LLM call timeout (minutes) |
|
|
| `INGEST_CHUNK_SIZE` | `6000` | Max chars per LLM call (0 = no chunking) |
|
|
| `INGEST_WATCH_INTERVAL` | `30` | Watcher poll interval in seconds (0 = disabled) |
|
|
|
|
### Supervisor (new env vars + wiring)
|
|
|
|
| Variable | Default | Description |
|
|
|---|---|---|
|
|
| `INGEST_SVC_URL` | (empty) | URL of ingestion server for `brain_ingest` |
|
|
| `KB_RETRIEVAL_URL` | (empty) | URL of KB retrieval server for `brain_search` |
|
|
|
|
`config.go` gets two new fields. `main.go` passes them to `brain.New()`. Both tools are only registered as MCP tools when the respective URL is configured (already implemented in `skill.go`).
|
|
|
|
---
|
|
|
|
## Testing
|
|
|
|
| Package | What is tested |
|
|
|---|---|
|
|
| `wiki/` | Slug generation (edge cases: apostrophes, colons, version strings), merge logic (bullets union, append, keep-first), inventory loading from temp dir, truncation recovery (valid partial JSON), index rebuild output |
|
|
| `pipeline/` | Integration test: temp brain dir + mock LLM HTTP server returning fixture JSON; verify files written to correct paths, index rebuilt, log appended |
|
|
| `api/` | Handler tests for `/ingest` and `/ingest-path` using mock pipeline; 400 on missing fields, 200 with expected response shape |
|
|
| `watcher/` | File placed in `brain/raw/` is moved to `processed/` on mock-pipeline success; moved to `failed/` on error |
|
|
|
|
All tests are table-driven. No real LLM calls in tests.
|
|
|
|
---
|
|
|
|
## Out of Scope
|
|
|
|
- Python validation/correction loop (can be added later; the LLM prompt enforces schema rules as non-negotiable instructions)
|
|
- `brain/training-data/` — trainer worker concern
|
|
- `brain/sessions/` — retrospective/sessionlog concern
|
|
- Upload endpoint (multipart HTTP) — `scp`/rsync to `brain/raw/` + watcher covers this
|
|
- Qdrant vector indexing — `brain_search` calls a separate KB retrieval service; ingestion does not write to Qdrant
|