From e34cd6c12bcc96dd7141d1e423ef64838710e47a Mon Sep 17 00:00:00 2001 From: Mathias Date: Sun, 24 May 2026 22:48:48 +0200 Subject: [PATCH] =?UTF-8?q?docs(eval):=20record=20post-fix=20scorer=20run?= =?UTF-8?q?=20=E2=80=94=20phase=201=20lift=20insufficient?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Top-1 stayed at 20% (4/20), top-3 +5pt (65→70%) after: - extract.go wing/topic parser fix (commit 3084c41) - qwen35-9b-fast entity pad (was 239-byte stub → full entity) - grafana entry: add "pod restart" synonym to lesson body - dangling refs stripped from index.md + entities/k3s.md The only retrieval move: qwen35-9b-fast climbed from rank 0 (off top-5) to rank 2 — the entity pad worked. Other 5 misses are ranker behaviour on already-keyword-overlapping entries; BM25 doesn't weight the right slugs to the top. Per the proposal's gate (≥10pt lift = stop, <10pt = Phase 2 justified), the DIKW tier redesign earns its cost. Next session: tier column + file moves + tier-weighted retrieval, then re-measure against this same eval set. --- brain/eval/post-fix.txt | 167 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 167 insertions(+) create mode 100644 brain/eval/post-fix.txt diff --git a/brain/eval/post-fix.txt b/brain/eval/post-fix.txt new file mode 100644 index 0000000..d8550d6 --- /dev/null +++ b/brain/eval/post-fix.txt @@ -0,0 +1,167 @@ +# post-fix — 20 questions, k=5 + +top-1 hit rate: 4/20 = 20% +top-3 hit rate: 14/20 = 70% + +## per-question detail + +· rank=3 expected=dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + q: how do I stop dex from logging users out on every pod restart? + 1. homelab-network-perimeter-model + 2. 2026-05-12-koala-machine-state + 3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart <-- expected + 4. infra-litellm-absorption-2026-05-16 + 5. Financial Sentiment Analysis on Stock Market Headlines With FinBERT & HuggingFace + +★ rank=1 expected=postgres-least-privilege-migration-tenant-grant-bypass-2026-05 + q: my postgres-exporter broke after revoking PUBLIC CONNECT — why? + 1. postgres-least-privilege-migration-tenant-grant-bypass-2026-05 <-- expected + 2. infra-litellm-absorption-2026-05-16 + 3. brain-mcp-activation-runbook + 4. extension-version-lags-platform-major-upgrade + 5. ntfy-deny-all-rollout-ordering-keep-alert-pipeline-live-during-auth-flip + +★ rank=1 expected=homelab-network-perimeter-model + q: when is a NodePort acceptable vs needing a public ingress with bearer gate? + 1. homelab-network-perimeter-model <-- expected + 2. qwen3-thinking-model-empty-content-trap + 3. mcpclient-empty-token-silent-401-envfrom-missing-key + 4. 2026-05-12-koala-machine-state + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=3 expected=exit-255-unknown-reason-not-oom + q: what does container exit code 255 with reason Unknown mean? + 1. qwen3-thinking-model-empty-content-trap + 2. infra-litellm-absorption-2026-05-16 + 3. exit-255-unknown-reason-not-oom <-- expected + 4. mcpclient-empty-token-silent-401-envfrom-missing-key + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=3 expected=gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + q: can gitea push-mirror create the github repo automatically? + 1. infra-litellm-absorption-2026-05-16 + 2. Autoresearch + 3. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo <-- expected + 4. adr-new-project-gitea-first-github-mirror + 5. adr-github-as-primary-remote + +✗ rank=0 expected=flux-healthcheck-stale-on-resource-removal + q: a flux kustomization is stuck after I removed a resource — why? + 1. qwen3-thinking-model-empty-content-trap + 2. 2026-05-12-koala-machine-state + 3. homelab-architecture-principles-2026-05 + 4. gitea-mcp: full stack shipped end-to-end (2026-05-05) + 5. k8s-configmap-mount-no-reload-needs-pod-restart + +· rank=2 expected=go-bytes-buffer-bytes-reset-aliasing-trap + q: the bytes buffer aliasing trap with Reset in a loop — what's the bug? + 1. Financial Sentiment Analysis on Stock Market Headlines With FinBERT & HuggingFace + 2. go-bytes-buffer-bytes-reset-aliasing-trap <-- expected + 3. homelab-security-chains-not-bugs + 4. training-on-rtx-5070-pretraining-vs-finetuning + 5. Hash Encoding + +★ rank=1 expected=homelab-architecture-principles-2026-05 + q: what are the homelab architecture principles from may 2026? + 1. homelab-architecture-principles-2026-05 <-- expected + 2. homelab-network-perimeter-model + 3. Claude Managed Agents — architecture notes relevant to homelab agent platform + 4. homelab-core-glossary + 5. 2026-05-12-koala-machine-state + +✗ rank=0 expected=2026-05-04-sops-age-key-from-flux-cluster + q: where does the sops age private key live in the cluster? + 1. 2026-05-12-koala-machine-state + 2. homelab-network-perimeter-model + 3. postgres-least-privilege-migration-tenant-grant-bypass-2026-05 + 4. brain-mcp-activation-runbook + 5. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + +✗ rank=0 expected=grafana-dashboards-as-code-not-ui-state + q: why do my grafana dashboards disappear after a pod restart? + 1. infra-litellm-absorption-2026-05-16 + 2. 2026-05-12-koala-machine-state + 3. Financial Sentiment Analysis on Stock Market Headlines With FinBERT & HuggingFace + 4. brain-mcp-activation-runbook + 5. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + +· rank=2 expected=double-diamond-methodology + q: what is the double diamond methodology? + 1. Harnessing the Power of Hash Encoding for Categorical Data in Data Science + 2. double-diamond-methodology <-- expected + 3. unified-methodology-diamond-futures-autoresearch + 4. futures-thinking-extended-double-diamond + 5. insight-exploration-as-diamond-1 + +· rank=3 expected=2026-05-04-mcp-transport-version-claude-ai-strict + q: my MCP server works from claude code but fails on claude.ai — what's different? + 1. qwen3-thinking-model-empty-content-trap + 2. mcp-resource-url-empty-breaks-claude-ai-discovery-silently + 3. 2026-05-04-mcp-transport-version-claude-ai-strict <-- expected + 4. 2026-05-04-claude-ai-custom-mcp-connectors + 5. finding-github-mcp-claudeai-vs-claudecode + +· rank=2 expected=homelab-security-chains-not-bugs + q: how should I rate security findings — isolated bugs or exploit chains? + 1. homelab-network-perimeter-model + 2. homelab-security-chains-not-bugs <-- expected + 3. Financial Sentiment Analysis on Stock Market Headlines With FinBERT & HuggingFace + 4. policy-audit-mode-blocks-nothing + 5. homelab-document-accepted-risk-to-break-audit-cycle + +· rank=2 expected=2026-05-03-canonical-vs-derived-context-flow + q: how should canonical context files relate to derived adapter files? + 1. qwen3-thinking-model-empty-content-trap + 2. 2026-05-03-canonical-vs-derived-context-flow <-- expected + 3. 2026-05-12-koala-machine-state + 4. 2026-05-04-claude-ai-custom-mcp-connectors + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=2 expected=homelab-core-glossary + q: what is the homelab core vocabulary glossary? + 1. homelab-architecture-principles-2026-05 + 2. homelab-core-glossary <-- expected + 3. Claude Managed Agents — architecture notes relevant to homelab agent platform + 4. 2026-05-12-koala-machine-state + 5. Autoresearch + +★ rank=1 expected=koala-llama-swap-native-tool-calls-survey-2026-05 + q: which models on koala llama-swap actually emit native tool_calls correctly? + 1. koala-llama-swap-native-tool-calls-survey-2026-05 <-- expected + 2. 2026-05-12-koala-machine-state + 3. infra-litellm-absorption-2026-05-16 + 4. training-on-rtx-5070-pretraining-vs-finetuning + 5. qwen3-thinking-model-empty-content-trap + +· rank=2 expected=qwen35-9b-fast + q: what is qwen35-9b-fast and what's it used for? + 1. koala-llama-swap-native-tool-calls-survey-2026-05 + 2. qwen35-9b-fast <-- expected + 3. qwen3-thinking-model-empty-content-trap + 4. infra-litellm-absorption-2026-05-16 + 5. 2026-05-12-koala-machine-state + +✗ rank=0 expected=go-defer-errcheck-body-close + q: in go, how do I prevent defer body close from silently dropping errors? + 1. infra-litellm-absorption-2026-05-16 + 2. homelab-network-perimeter-model + 3. go-bytes-buffer-bytes-reset-aliasing-trap + 4. mcpclient-empty-token-silent-401-envfrom-missing-key + 5. brain-mcp-activation-runbook + +✗ rank=0 expected=hyperguild-level3-pipeline-rewrite + q: what was the level 3 rewrite of hyperguild's ingestion pipeline? + 1. 2026-05-12-koala-machine-state + 2. homelab-core-glossary + 3. brain-mcp-activation-runbook + 4. koala-llama-swap-native-tool-calls-survey-2026-05 + 5. infra-litellm-absorption-2026-05-16 + +? rank=4 expected=adr-new-project-gitea-first-github-mirror + q: what's the new-project ADR — is it gitea-first or github-first? + 1. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + 2. gitea-mcp: full stack shipped end-to-end (2026-05-05) + 3. mcp-tool-design-get-needs-list-partner + 4. adr-new-project-gitea-first-github-mirror <-- expected + 5. 2026-05-04-gitea-mcp-build-session +