Commit Graph

40 Commits

Author SHA1 Message Date
Mathias Bergqvist
a7b363d589 fix(pipeline): quote YAML scalar fields in buildFrontmatter to prevent injection 2026-04-23 18:56:39 +02:00
Mathias Bergqvist
7b57051af8 feat(pipeline): add BuildPages — compute slugs/paths/frontmatter from RawPage 2026-04-23 18:50:37 +02:00
Mathias Bergqvist
a620f6cb01 fix(pipeline): guard empty-title bridge + skip stale integration tests until task4
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 18:46:07 +02:00
Mathias Bergqvist
26b5636b43 feat(pipeline): replace ParsePages with ParseRawPages + RawPage type
Strips slug authority from the LLM. The new RawPage type carries only
{title, type, subtype, domain, content} — no paths or frontmatter.
Pipeline will derive slugs deterministically (Task 4).

pipeline.go gets a temporary bridge stub (TODO task4) to keep the
package compiling between tasks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 18:41:33 +02:00
Mathias Bergqvist
1605624668 feat(pipeline): add POST /backfill-refs endpoint to retroactively inject source back-references 2026-04-23 16:50:00 +02:00
Mathias Bergqvist
3c2bd9268c feat(pipeline): wire source back-reference injection into Run 2026-04-23 16:36:22 +02:00
Mathias Bergqvist
29727ec2a5 feat(pipeline): inject source back-references into concept and entity pages 2026-04-23 16:35:47 +02:00
Mathias Bergqvist
3607920601 fix(lint): resolve all errcheck violations in ingestion module
All checks were successful
cd / Build and deploy (push) Successful in 10s
CI / Lint / Test / Vet (push) Successful in 10s
CI / Mirror to GitHub (push) Successful in 3s
2026-04-23 16:20:59 +02:00
Mathias Bergqvist
a6c39e8691 feat: PDF extraction and fuzzy entity resolution
Some checks failed
cd / Build and deploy (push) Successful in 11s
CI / Lint / Test / Vet (push) Failing after 5s
CI / Mirror to GitHub (push) Has been skipped
- New extract package: Text() dispatcher for .md/.txt passthrough and
  PDF extraction via pdftotext subprocess
- wiki.Entry gains Aliases []string, loaded from YAML frontmatter
- Fuzzy entity resolution in pipeline: normalizes titles (lowercase,
  strip articles, collapse hyphens) and matches proposed pages against
  existing inventory slugs and aliases to prevent proliferation
- Watcher and API handler now use extract.Text() instead of os.ReadFile
- Dockerfile: apk add poppler-utils in Alpine runtime stage
2026-04-23 16:03:02 +02:00
Mathias Bergqvist
a37d18bf7a chore(docker): add poppler-utils for PDF text extraction 2026-04-23 16:02:12 +02:00
Mathias Bergqvist
2975eadc87 feat(watcher,api): use extract.Text() for file reading — fixes PDF ingestion 2026-04-23 16:01:36 +02:00
Mathias Bergqvist
53e46781b1 feat(pipeline): resolve proposed pages against inventory before writing 2026-04-23 16:00:31 +02:00
Mathias Bergqvist
e9b5cc401c feat(pipeline): add fuzzy entity resolution to prevent slug proliferation 2026-04-23 15:59:36 +02:00
Mathias Bergqvist
bf6f497d9d feat(wiki): add Aliases to Entry and read from YAML frontmatter 2026-04-23 15:57:16 +02:00
Mathias Bergqvist
9cc6c2d053 feat(extract): implement PDF extraction via pdftotext 2026-04-23 15:53:46 +02:00
Mathias Bergqvist
43a46d07e5 feat(extract): add Text() dispatcher with md/txt passthrough 2026-04-23 15:45:20 +02:00
Mathias Bergqvist
6928907d79 fix(watcher): copy files instead of moving them, leave originals for Obsidian
Some checks failed
cd / Build and deploy (push) Successful in 10s
CI / Lint / Test / Vet (push) Failing after 5s
CI / Mirror to GitHub (push) Has been skipped
Files dropped into brain/raw/ are now copied to processed/ or failed/ rather
than moved. A .processed or .failed marker is written next to the original so
the watcher skips it on subsequent polls without deleting it. This keeps
Syncthing-synced Obsidian vaults intact after ingestion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 14:47:50 +02:00
Mathias Bergqvist
e74320a8e8 feat(ingestion): wire watcher into server startup + fix Procfile env vars
Some checks failed
cd / Build and deploy (push) Successful in 10s
CI / Lint / Test / Vet (push) Failing after 5s
CI / Mirror to GitHub (push) Has been skipped
- Start background watcher on startup when INGEST_WATCH_INTERVAL > 0
- Procfile: add INGEST_WATCH_INTERVAL=30 and INGEST_SVC_URL for supervisor

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 23:09:00 +02:00
Mathias Bergqvist
1b0706f270 chore(brain): rename CLAUDE.md to schema.md for clarity
CLAUDE.md has a specific meaning in the Claude Code ecosystem (agent
instructions). The wiki schema for the ingestion pipeline should live
in schema.md to avoid confusion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 23:06:32 +02:00
Mathias Bergqvist
2f4b577131 fix(ingestion): address code review issues in api and watcher packages
- Strip internal error detail from 500 responses (leak prevention)
- Add path containment assertion in /write handler
- Use Go 1.22 method-prefixed mux routes for automatic 405 responses
- Clarify watch_interval log when watcher not yet wired
- Consolidate validation tests into table-driven TestIngest_Validation
- Watcher: return nil after successful quarantine to avoid double-logging
- Watcher: append timestamp suffix to processed dest if file already exists

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:59:39 +02:00
Mathias Bergqvist
a25bb18c54 feat(ingestion): add /ingest and /ingest-path HTTP handlers
Wires pipeline.Run into the HTTP layer so callers can ingest raw text
or files/directories without touching the filesystem directly. Rewrites
main.go to parse LLM and watcher env vars and build pipeline.Config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:54:28 +02:00
Mathias Bergqvist
78531bb238 feat(ingestion): add background file watcher for brain/raw/
Polls brain/raw/ on a configurable ticker, derives human-readable source
names from filenames, runs the pipeline, and moves files to
processed/YYYY-MM-DD/ on success or failed/ on error with a log.md entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:54:03 +02:00
Mathias Bergqvist
04fefe8e9c fix(ingestion): wrap naked error returns and harden mustJSON helper
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:51:19 +02:00
Mathias Bergqvist
103f4d90bf feat(ingestion): add pipeline orchestrator with prompt builder
Adds prompt.go (BuildPrompt + systemPrompt) and pipeline.go (Run, Config,
Result, mergeAll) that wire chunking, LLM calls, parse, merge, index rebuild,
and log append into a single ingestion pipeline. Includes integration tests
covering write, dry-run, and duplicate-path merge scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:45:19 +02:00
Mathias Bergqvist
9b11719481 feat(ingestion): add content chunking and LLM JSON output parser 2026-04-22 22:37:14 +02:00
Mathias Bergqvist
d405346f07 feat(ingestion): add wiki index rebuilder and audit log 2026-04-22 22:36:55 +02:00
Mathias Bergqvist
bf8a3fc11c feat(ingestion): add OpenAI-compatible LLM HTTP client with 429 retry 2026-04-22 22:29:24 +02:00
Mathias Bergqvist
ae5a4d04f0 feat(ingestion): add wiki page merge logic 2026-04-22 22:28:55 +02:00
Mathias Bergqvist
3a0424a6b4 feat(ingestion): add wiki inventory loader 2026-04-22 22:28:53 +02:00
Mathias Bergqvist
91e02b930c feat(ingestion): add wiki package with Page types and slug generation 2026-04-22 22:25:45 +02:00
Mathias Bergqvist
b5a0085c0a feat(brain): add brain_ingest, brain_search tools and extend search to wiki/ 2026-04-22 22:16:02 +02:00
Mathias Bergqvist
c9310b1079 fix(ingestion): always append .md extension to written filenames
All checks were successful
cd / Build and deploy (push) Successful in 9s
CI / Lint / Test / Vet (push) Successful in 10s
CI / Mirror to GitHub (push) Successful in 4s
brain_write with a custom filename omitted the .md extension, causing
search to skip the file (search.go filters on HasSuffix .md).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 19:23:07 +02:00
Mathias Bergqvist
ca1a16873c feat(ingestion): add Dockerfile and extend CD to build+push ingestion image
All checks were successful
cd / Build and deploy (push) Successful in 9s
CI / Lint / Test / Vet (push) Successful in 9s
CI / Mirror to GitHub (push) Successful in 3s
Ingestion server is a pure-Go HTTP binary — alpine runtime, no node.js.
CD now builds both supervisor and ingestion images on every push,
updates both deployment.yaml files in the infra repo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 16:37:11 +02:00
Mathias Bergqvist
3625e1268d feat(ingestion): simplify brain to knowledge/ — write and search use same dir 2026-04-22 15:36:10 +02:00
Mathias Bergqvist
24d9216474 fix(ingestion): preserve type and domain metadata as frontmatter in written notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 21:22:14 +02:00
Mathias Bergqvist
d18fa0dd59 fix(ingestion): validate required query field in Query handler
Empty or whitespace-only queries would silently pass through to search,
returning meaningless results. Also removed the Domain field from
queryRequest — it was accepted but silently ignored since search.Query
has no domain parameter, which would confuse callers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 20:27:02 +02:00
Mathias Bergqvist
e20edd6ca9 feat(ingestion): add query and write HTTP handlers
Implements POST /query (BM25 search via internal/search) and POST /write
(raw file persistence to brain/raw/) as an api.Handler struct. Filename
is auto-generated when absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 20:24:51 +02:00
Mathias Bergqvist
caf18c9acb fix(ingestion): consistent error handling in search walk
Both walk-level errors and ReadFile failures now use best-effort
semantics (warn via slog, continue) instead of mixed abort/silent-skip.
filepath.Rel error is now propagated from the callback instead of
discarded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 20:23:03 +02:00
Mathias Bergqvist
3c1f6edf3e feat(ingestion): add full-text wiki search package
Implements search.Query which walks brainDir/wiki/**/*.md, scores files
by term-frequency across query tokens, and returns results sorted by
score descending. Uses only stdlib — no external search deps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 20:18:57 +02:00
Mathias Bergqvist
6c485489bf chore: scaffold ingestion Go module 2026-04-17 20:16:59 +02:00