From 2b7bbe38c74bfce49aa0ccf0b21e27419b9122c9 Mon Sep 17 00:00:00 2001 From: Mathias Date: Mon, 25 May 2026 18:51:29 +0200 Subject: [PATCH] =?UTF-8?q?docs(eval):=20record=20M4=20+=20M4b=20scorer=20?= =?UTF-8?q?runs=20=E2=80=94=20phase=202=20gate=20cleared=20(infra#72)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tier-weighted retrieval against the qa-2026-05.md 20-question set: | run | top-1 | top-3 | |--------------------------------|-------|-------| | baseline (pre-phase-1) | 20% | 65% | | post phase 1 (parser+content) | 20% | 70% | | post M4 (tier weighting) | 30% | 75% | | post M4b (entities → K tier) | 35% | 80% | Net Phase 2 lift: +15pt top-1, +15pt top-3 — comfortably above the ≥10pt close-gate set in infra#72. Three remaining misses are content-keyword issues, not structure issues (the questions don't share enough lexical surface with the target entries to surface via BM25 alone). Vector search would help here but the iguana embedder is off-mesh (see infra#64). --- brain/eval/post-m4.txt | 167 ++++++++++++++++++++++++++++++++++++++++ brain/eval/post-m4b.txt | 167 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 334 insertions(+) create mode 100644 brain/eval/post-m4.txt create mode 100644 brain/eval/post-m4b.txt diff --git a/brain/eval/post-m4.txt b/brain/eval/post-m4.txt new file mode 100644 index 0000000..698cc40 --- /dev/null +++ b/brain/eval/post-m4.txt @@ -0,0 +1,167 @@ +# post-m4-tier-weighting — 20 questions, k=5 + +top-1 hit rate: 6/20 = 30% +top-3 hit rate: 15/20 = 75% + +## per-question detail + +· rank=3 expected=dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + q: how do I stop dex from logging users out on every pod restart? + 1. homelab-network-perimeter-model + 2. 2026-05-12-koala-machine-state + 3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart <-- expected + 4. infra-litellm-absorption-2026-05-16 + 5. k8s-configmap-mount-no-reload-needs-pod-restart + +· rank=2 expected=postgres-least-privilege-migration-tenant-grant-bypass-2026-05 + q: my postgres-exporter broke after revoking PUBLIC CONNECT — why? + 1. infra-litellm-absorption-2026-05-16 + 2. postgres-least-privilege-migration-tenant-grant-bypass-2026-05 <-- expected + 3. extension-version-lags-platform-major-upgrade + 4. ntfy-deny-all-rollout-ordering-keep-alert-pipeline-live-during-auth-flip + 5. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + +★ rank=1 expected=homelab-network-perimeter-model + q: when is a NodePort acceptable vs needing a public ingress with bearer gate? + 1. homelab-network-perimeter-model <-- expected + 2. qwen3-thinking-model-empty-content-trap + 3. mcpclient-empty-token-silent-401-envfrom-missing-key + 4. 2026-05-12-koala-machine-state + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=3 expected=exit-255-unknown-reason-not-oom + q: what does container exit code 255 with reason Unknown mean? + 1. qwen3-thinking-model-empty-content-trap + 2. infra-litellm-absorption-2026-05-16 + 3. exit-255-unknown-reason-not-oom <-- expected + 4. mcpclient-empty-token-silent-401-envfrom-missing-key + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=2 expected=gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + q: can gitea push-mirror create the github repo automatically? + 1. infra-litellm-absorption-2026-05-16 + 2. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo <-- expected + 3. adr-new-project-gitea-first-github-mirror + 4. adr-github-as-primary-remote + 5. 2026-05-12-koala-machine-state + +✗ rank=0 expected=flux-healthcheck-stale-on-resource-removal + q: a flux kustomization is stuck after I removed a resource — why? + 1. qwen3-thinking-model-empty-content-trap + 2. 2026-05-12-koala-machine-state + 3. homelab-architecture-principles-2026-05 + 4. k8s-configmap-mount-no-reload-needs-pod-restart + 5. training-on-rtx-5070-pretraining-vs-finetuning + +★ rank=1 expected=go-bytes-buffer-bytes-reset-aliasing-trap + q: the bytes buffer aliasing trap with Reset in a loop — what's the bug? + 1. go-bytes-buffer-bytes-reset-aliasing-trap <-- expected + 2. homelab-security-chains-not-bugs + 3. Financial Sentiment Analysis on Stock Market Headlines With FinBERT & HuggingFace + 4. training-on-rtx-5070-pretraining-vs-finetuning + 5. flux-healthcheck-stale-on-resource-removal + +★ rank=1 expected=homelab-architecture-principles-2026-05 + q: what are the homelab architecture principles from may 2026? + 1. homelab-architecture-principles-2026-05 <-- expected + 2. homelab-network-perimeter-model + 3. homelab-core-glossary + 4. 2026-05-12-koala-machine-state + 5. pattern-reddit-tmux-multiagent-conductor + +? rank=4 expected=2026-05-04-sops-age-key-from-flux-cluster + q: where does the sops age private key live in the cluster? + 1. 2026-05-12-koala-machine-state + 2. homelab-network-perimeter-model + 3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + 4. 2026-05-04-sops-age-key-from-flux-cluster <-- expected + 5. homelab-security-chains-not-bugs + +★ rank=1 expected=grafana-dashboards-as-code-not-ui-state + q: why do my grafana dashboards disappear after a pod restart? + 1. grafana-dashboards-as-code-not-ui-state <-- expected + 2. infra-litellm-absorption-2026-05-16 + 3. 2026-05-12-koala-machine-state + 4. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + 5. k8s-configmap-mount-no-reload-needs-pod-restart + +★ rank=1 expected=double-diamond-methodology + q: what is the double diamond methodology? + 1. double-diamond-methodology <-- expected + 2. unified-methodology-diamond-futures-autoresearch + 3. futures-thinking-extended-double-diamond + 4. insight-exploration-as-diamond-1 + 5. workflow-idea-to-running-service + +· rank=3 expected=2026-05-04-mcp-transport-version-claude-ai-strict + q: my MCP server works from claude code but fails on claude.ai — what's different? + 1. qwen3-thinking-model-empty-content-trap + 2. mcp-resource-url-empty-breaks-claude-ai-discovery-silently + 3. 2026-05-04-mcp-transport-version-claude-ai-strict <-- expected + 4. 2026-05-04-claude-ai-custom-mcp-connectors + 5. finding-github-mcp-claudeai-vs-claudecode + +· rank=2 expected=homelab-security-chains-not-bugs + q: how should I rate security findings — isolated bugs or exploit chains? + 1. homelab-network-perimeter-model + 2. homelab-security-chains-not-bugs <-- expected + 3. policy-audit-mode-blocks-nothing + 4. homelab-document-accepted-risk-to-break-audit-cycle + 5. audit-shortcut-tls-blocks-zero-equals-edge-only + +· rank=2 expected=2026-05-03-canonical-vs-derived-context-flow + q: how should canonical context files relate to derived adapter files? + 1. qwen3-thinking-model-empty-content-trap + 2. 2026-05-03-canonical-vs-derived-context-flow <-- expected + 3. 2026-05-12-koala-machine-state + 4. 2026-05-04-claude-ai-custom-mcp-connectors + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=2 expected=homelab-core-glossary + q: what is the homelab core vocabulary glossary? + 1. homelab-architecture-principles-2026-05 + 2. homelab-core-glossary <-- expected + 3. 2026-05-12-koala-machine-state + 4. flux-kustomization-depends-on-bootstrap-ordering + 5. brain-ingest-ntfy-service + +★ rank=1 expected=koala-llama-swap-native-tool-calls-survey-2026-05 + q: which models on koala llama-swap actually emit native tool_calls correctly? + 1. koala-llama-swap-native-tool-calls-survey-2026-05 <-- expected + 2. 2026-05-12-koala-machine-state + 3. infra-litellm-absorption-2026-05-16 + 4. training-on-rtx-5070-pretraining-vs-finetuning + 5. qwen3-thinking-model-empty-content-trap + +✗ rank=0 expected=qwen35-9b-fast + q: what is qwen35-9b-fast and what's it used for? + 1. koala-llama-swap-native-tool-calls-survey-2026-05 + 2. qwen3-thinking-model-empty-content-trap + 3. infra-litellm-absorption-2026-05-16 + 4. 2026-05-12-koala-machine-state + 5. index + +✗ rank=0 expected=go-defer-errcheck-body-close + q: in go, how do I prevent defer body close from silently dropping errors? + 1. homelab-network-perimeter-model + 2. infra-litellm-absorption-2026-05-16 + 3. go-bytes-buffer-bytes-reset-aliasing-trap + 4. mcpclient-empty-token-silent-401-envfrom-missing-key + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +✗ rank=0 expected=hyperguild-level3-pipeline-rewrite + q: what was the level 3 rewrite of hyperguild's ingestion pipeline? + 1. 2026-05-12-koala-machine-state + 2. homelab-core-glossary + 3. koala-llama-swap-native-tool-calls-survey-2026-05 + 4. infra-litellm-absorption-2026-05-16 + 5. homelab-architecture-principles-2026-05 + +· rank=3 expected=adr-new-project-gitea-first-github-mirror + q: what's the new-project ADR — is it gitea-first or github-first? + 1. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + 2. mcp-tool-design-get-needs-list-partner + 3. adr-new-project-gitea-first-github-mirror <-- expected + 4. 2026-05-04-gitea-mcp-build-session + 5. adr-local-dev-vs-hyperguild-new-project + diff --git a/brain/eval/post-m4b.txt b/brain/eval/post-m4b.txt new file mode 100644 index 0000000..625c737 --- /dev/null +++ b/brain/eval/post-m4b.txt @@ -0,0 +1,167 @@ +# post-m4b-entities-promoted — 20 questions, k=5 + +top-1 hit rate: 7/20 = 35% +top-3 hit rate: 16/20 = 80% + +## per-question detail + +· rank=3 expected=dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + q: how do I stop dex from logging users out on every pod restart? + 1. homelab-network-perimeter-model + 2. 2026-05-12-koala-machine-state + 3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart <-- expected + 4. infra-litellm-absorption-2026-05-16 + 5. k8s-configmap-mount-no-reload-needs-pod-restart + +· rank=2 expected=postgres-least-privilege-migration-tenant-grant-bypass-2026-05 + q: my postgres-exporter broke after revoking PUBLIC CONNECT — why? + 1. infra-litellm-absorption-2026-05-16 + 2. postgres-least-privilege-migration-tenant-grant-bypass-2026-05 <-- expected + 3. extension-version-lags-platform-major-upgrade + 4. ntfy-deny-all-rollout-ordering-keep-alert-pipeline-live-during-auth-flip + 5. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + +★ rank=1 expected=homelab-network-perimeter-model + q: when is a NodePort acceptable vs needing a public ingress with bearer gate? + 1. homelab-network-perimeter-model <-- expected + 2. qwen3-thinking-model-empty-content-trap + 3. mcpclient-empty-token-silent-401-envfrom-missing-key + 4. 2026-05-12-koala-machine-state + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=3 expected=exit-255-unknown-reason-not-oom + q: what does container exit code 255 with reason Unknown mean? + 1. qwen3-thinking-model-empty-content-trap + 2. infra-litellm-absorption-2026-05-16 + 3. exit-255-unknown-reason-not-oom <-- expected + 4. mcpclient-empty-token-silent-401-envfrom-missing-key + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=2 expected=gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + q: can gitea push-mirror create the github repo automatically? + 1. infra-litellm-absorption-2026-05-16 + 2. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo <-- expected + 3. adr-new-project-gitea-first-github-mirror + 4. adr-github-as-primary-remote + 5. 2026-05-12-koala-machine-state + +✗ rank=0 expected=flux-healthcheck-stale-on-resource-removal + q: a flux kustomization is stuck after I removed a resource — why? + 1. qwen3-thinking-model-empty-content-trap + 2. 2026-05-12-koala-machine-state + 3. homelab-architecture-principles-2026-05 + 4. k8s-configmap-mount-no-reload-needs-pod-restart + 5. training-on-rtx-5070-pretraining-vs-finetuning + +★ rank=1 expected=go-bytes-buffer-bytes-reset-aliasing-trap + q: the bytes buffer aliasing trap with Reset in a loop — what's the bug? + 1. go-bytes-buffer-bytes-reset-aliasing-trap <-- expected + 2. homelab-security-chains-not-bugs + 3. Financial Sentiment Analysis on Stock Market Headlines With FinBERT & HuggingFace + 4. training-on-rtx-5070-pretraining-vs-finetuning + 5. flux-healthcheck-stale-on-resource-removal + +★ rank=1 expected=homelab-architecture-principles-2026-05 + q: what are the homelab architecture principles from may 2026? + 1. homelab-architecture-principles-2026-05 <-- expected + 2. homelab-network-perimeter-model + 3. homelab-core-glossary + 4. 2026-05-12-koala-machine-state + 5. pattern-reddit-tmux-multiagent-conductor + +? rank=4 expected=2026-05-04-sops-age-key-from-flux-cluster + q: where does the sops age private key live in the cluster? + 1. 2026-05-12-koala-machine-state + 2. homelab-network-perimeter-model + 3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + 4. 2026-05-04-sops-age-key-from-flux-cluster <-- expected + 5. homelab-security-chains-not-bugs + +★ rank=1 expected=grafana-dashboards-as-code-not-ui-state + q: why do my grafana dashboards disappear after a pod restart? + 1. grafana-dashboards-as-code-not-ui-state <-- expected + 2. infra-litellm-absorption-2026-05-16 + 3. 2026-05-12-koala-machine-state + 4. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + 5. k8s-configmap-mount-no-reload-needs-pod-restart + +★ rank=1 expected=double-diamond-methodology + q: what is the double diamond methodology? + 1. double-diamond-methodology <-- expected + 2. unified-methodology-diamond-futures-autoresearch + 3. futures-thinking-extended-double-diamond + 4. insight-exploration-as-diamond-1 + 5. workflow-idea-to-running-service + +· rank=3 expected=2026-05-04-mcp-transport-version-claude-ai-strict + q: my MCP server works from claude code but fails on claude.ai — what's different? + 1. qwen3-thinking-model-empty-content-trap + 2. mcp-resource-url-empty-breaks-claude-ai-discovery-silently + 3. 2026-05-04-mcp-transport-version-claude-ai-strict <-- expected + 4. 2026-05-04-claude-ai-custom-mcp-connectors + 5. finding-github-mcp-claudeai-vs-claudecode + +· rank=2 expected=homelab-security-chains-not-bugs + q: how should I rate security findings — isolated bugs or exploit chains? + 1. homelab-network-perimeter-model + 2. homelab-security-chains-not-bugs <-- expected + 3. policy-audit-mode-blocks-nothing + 4. homelab-document-accepted-risk-to-break-audit-cycle + 5. audit-shortcut-tls-blocks-zero-equals-edge-only + +· rank=2 expected=2026-05-03-canonical-vs-derived-context-flow + q: how should canonical context files relate to derived adapter files? + 1. qwen3-thinking-model-empty-content-trap + 2. 2026-05-03-canonical-vs-derived-context-flow <-- expected + 3. 2026-05-12-koala-machine-state + 4. 2026-05-04-claude-ai-custom-mcp-connectors + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=2 expected=homelab-core-glossary + q: what is the homelab core vocabulary glossary? + 1. homelab-architecture-principles-2026-05 + 2. homelab-core-glossary <-- expected + 3. 2026-05-12-koala-machine-state + 4. qwen35-9b-fast + 5. flux-kustomization-depends-on-bootstrap-ordering + +★ rank=1 expected=koala-llama-swap-native-tool-calls-survey-2026-05 + q: which models on koala llama-swap actually emit native tool_calls correctly? + 1. koala-llama-swap-native-tool-calls-survey-2026-05 <-- expected + 2. 2026-05-12-koala-machine-state + 3. infra-litellm-absorption-2026-05-16 + 4. training-on-rtx-5070-pretraining-vs-finetuning + 5. qwen3-thinking-model-empty-content-trap + +★ rank=1 expected=qwen35-9b-fast + q: what is qwen35-9b-fast and what's it used for? + 1. qwen35-9b-fast <-- expected + 2. koala-llama-swap-native-tool-calls-survey-2026-05 + 3. qwen3-thinking-model-empty-content-trap + 4. infra-litellm-absorption-2026-05-16 + 5. 2026-05-12-koala-machine-state + +✗ rank=0 expected=go-defer-errcheck-body-close + q: in go, how do I prevent defer body close from silently dropping errors? + 1. homelab-network-perimeter-model + 2. infra-litellm-absorption-2026-05-16 + 3. go-bytes-buffer-bytes-reset-aliasing-trap + 4. mcpclient-empty-token-silent-401-envfrom-missing-key + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +✗ rank=0 expected=hyperguild-level3-pipeline-rewrite + q: what was the level 3 rewrite of hyperguild's ingestion pipeline? + 1. 2026-05-12-koala-machine-state + 2. homelab-core-glossary + 3. koala-llama-swap-native-tool-calls-survey-2026-05 + 4. infra-litellm-absorption-2026-05-16 + 5. homelab-architecture-principles-2026-05 + +· rank=3 expected=adr-new-project-gitea-first-github-mirror + q: what's the new-project ADR — is it gitea-first or github-first? + 1. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + 2. mcp-tool-design-get-needs-list-partner + 3. adr-new-project-gitea-first-github-mirror <-- expected + 4. 2026-05-04-gitea-mcp-build-session + 5. adr-local-dev-vs-hyperguild-new-project +