diff --git a/brain/eval/post-m4.txt b/brain/eval/post-m4.txt new file mode 100644 index 0000000..698cc40 --- /dev/null +++ b/brain/eval/post-m4.txt @@ -0,0 +1,167 @@ +# post-m4-tier-weighting — 20 questions, k=5 + +top-1 hit rate: 6/20 = 30% +top-3 hit rate: 15/20 = 75% + +## per-question detail + +· rank=3 expected=dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + q: how do I stop dex from logging users out on every pod restart? + 1. homelab-network-perimeter-model + 2. 2026-05-12-koala-machine-state + 3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart <-- expected + 4. infra-litellm-absorption-2026-05-16 + 5. k8s-configmap-mount-no-reload-needs-pod-restart + +· rank=2 expected=postgres-least-privilege-migration-tenant-grant-bypass-2026-05 + q: my postgres-exporter broke after revoking PUBLIC CONNECT — why? + 1. infra-litellm-absorption-2026-05-16 + 2. postgres-least-privilege-migration-tenant-grant-bypass-2026-05 <-- expected + 3. extension-version-lags-platform-major-upgrade + 4. ntfy-deny-all-rollout-ordering-keep-alert-pipeline-live-during-auth-flip + 5. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + +★ rank=1 expected=homelab-network-perimeter-model + q: when is a NodePort acceptable vs needing a public ingress with bearer gate? + 1. homelab-network-perimeter-model <-- expected + 2. qwen3-thinking-model-empty-content-trap + 3. mcpclient-empty-token-silent-401-envfrom-missing-key + 4. 2026-05-12-koala-machine-state + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=3 expected=exit-255-unknown-reason-not-oom + q: what does container exit code 255 with reason Unknown mean? + 1. qwen3-thinking-model-empty-content-trap + 2. infra-litellm-absorption-2026-05-16 + 3. exit-255-unknown-reason-not-oom <-- expected + 4. mcpclient-empty-token-silent-401-envfrom-missing-key + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=2 expected=gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + q: can gitea push-mirror create the github repo automatically? + 1. infra-litellm-absorption-2026-05-16 + 2. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo <-- expected + 3. adr-new-project-gitea-first-github-mirror + 4. adr-github-as-primary-remote + 5. 2026-05-12-koala-machine-state + +✗ rank=0 expected=flux-healthcheck-stale-on-resource-removal + q: a flux kustomization is stuck after I removed a resource — why? + 1. qwen3-thinking-model-empty-content-trap + 2. 2026-05-12-koala-machine-state + 3. homelab-architecture-principles-2026-05 + 4. k8s-configmap-mount-no-reload-needs-pod-restart + 5. training-on-rtx-5070-pretraining-vs-finetuning + +★ rank=1 expected=go-bytes-buffer-bytes-reset-aliasing-trap + q: the bytes buffer aliasing trap with Reset in a loop — what's the bug? + 1. go-bytes-buffer-bytes-reset-aliasing-trap <-- expected + 2. homelab-security-chains-not-bugs + 3. Financial Sentiment Analysis on Stock Market Headlines With FinBERT & HuggingFace + 4. training-on-rtx-5070-pretraining-vs-finetuning + 5. flux-healthcheck-stale-on-resource-removal + +★ rank=1 expected=homelab-architecture-principles-2026-05 + q: what are the homelab architecture principles from may 2026? + 1. homelab-architecture-principles-2026-05 <-- expected + 2. homelab-network-perimeter-model + 3. homelab-core-glossary + 4. 2026-05-12-koala-machine-state + 5. pattern-reddit-tmux-multiagent-conductor + +? rank=4 expected=2026-05-04-sops-age-key-from-flux-cluster + q: where does the sops age private key live in the cluster? + 1. 2026-05-12-koala-machine-state + 2. homelab-network-perimeter-model + 3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + 4. 2026-05-04-sops-age-key-from-flux-cluster <-- expected + 5. homelab-security-chains-not-bugs + +★ rank=1 expected=grafana-dashboards-as-code-not-ui-state + q: why do my grafana dashboards disappear after a pod restart? + 1. grafana-dashboards-as-code-not-ui-state <-- expected + 2. infra-litellm-absorption-2026-05-16 + 3. 2026-05-12-koala-machine-state + 4. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + 5. k8s-configmap-mount-no-reload-needs-pod-restart + +★ rank=1 expected=double-diamond-methodology + q: what is the double diamond methodology? + 1. double-diamond-methodology <-- expected + 2. unified-methodology-diamond-futures-autoresearch + 3. futures-thinking-extended-double-diamond + 4. insight-exploration-as-diamond-1 + 5. workflow-idea-to-running-service + +· rank=3 expected=2026-05-04-mcp-transport-version-claude-ai-strict + q: my MCP server works from claude code but fails on claude.ai — what's different? + 1. qwen3-thinking-model-empty-content-trap + 2. mcp-resource-url-empty-breaks-claude-ai-discovery-silently + 3. 2026-05-04-mcp-transport-version-claude-ai-strict <-- expected + 4. 2026-05-04-claude-ai-custom-mcp-connectors + 5. finding-github-mcp-claudeai-vs-claudecode + +· rank=2 expected=homelab-security-chains-not-bugs + q: how should I rate security findings — isolated bugs or exploit chains? + 1. homelab-network-perimeter-model + 2. homelab-security-chains-not-bugs <-- expected + 3. policy-audit-mode-blocks-nothing + 4. homelab-document-accepted-risk-to-break-audit-cycle + 5. audit-shortcut-tls-blocks-zero-equals-edge-only + +· rank=2 expected=2026-05-03-canonical-vs-derived-context-flow + q: how should canonical context files relate to derived adapter files? + 1. qwen3-thinking-model-empty-content-trap + 2. 2026-05-03-canonical-vs-derived-context-flow <-- expected + 3. 2026-05-12-koala-machine-state + 4. 2026-05-04-claude-ai-custom-mcp-connectors + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=2 expected=homelab-core-glossary + q: what is the homelab core vocabulary glossary? + 1. homelab-architecture-principles-2026-05 + 2. homelab-core-glossary <-- expected + 3. 2026-05-12-koala-machine-state + 4. flux-kustomization-depends-on-bootstrap-ordering + 5. brain-ingest-ntfy-service + +★ rank=1 expected=koala-llama-swap-native-tool-calls-survey-2026-05 + q: which models on koala llama-swap actually emit native tool_calls correctly? + 1. koala-llama-swap-native-tool-calls-survey-2026-05 <-- expected + 2. 2026-05-12-koala-machine-state + 3. infra-litellm-absorption-2026-05-16 + 4. training-on-rtx-5070-pretraining-vs-finetuning + 5. qwen3-thinking-model-empty-content-trap + +✗ rank=0 expected=qwen35-9b-fast + q: what is qwen35-9b-fast and what's it used for? + 1. koala-llama-swap-native-tool-calls-survey-2026-05 + 2. qwen3-thinking-model-empty-content-trap + 3. infra-litellm-absorption-2026-05-16 + 4. 2026-05-12-koala-machine-state + 5. index + +✗ rank=0 expected=go-defer-errcheck-body-close + q: in go, how do I prevent defer body close from silently dropping errors? + 1. homelab-network-perimeter-model + 2. infra-litellm-absorption-2026-05-16 + 3. go-bytes-buffer-bytes-reset-aliasing-trap + 4. mcpclient-empty-token-silent-401-envfrom-missing-key + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +✗ rank=0 expected=hyperguild-level3-pipeline-rewrite + q: what was the level 3 rewrite of hyperguild's ingestion pipeline? + 1. 2026-05-12-koala-machine-state + 2. homelab-core-glossary + 3. koala-llama-swap-native-tool-calls-survey-2026-05 + 4. infra-litellm-absorption-2026-05-16 + 5. homelab-architecture-principles-2026-05 + +· rank=3 expected=adr-new-project-gitea-first-github-mirror + q: what's the new-project ADR — is it gitea-first or github-first? + 1. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + 2. mcp-tool-design-get-needs-list-partner + 3. adr-new-project-gitea-first-github-mirror <-- expected + 4. 2026-05-04-gitea-mcp-build-session + 5. adr-local-dev-vs-hyperguild-new-project + diff --git a/brain/eval/post-m4b.txt b/brain/eval/post-m4b.txt new file mode 100644 index 0000000..625c737 --- /dev/null +++ b/brain/eval/post-m4b.txt @@ -0,0 +1,167 @@ +# post-m4b-entities-promoted — 20 questions, k=5 + +top-1 hit rate: 7/20 = 35% +top-3 hit rate: 16/20 = 80% + +## per-question detail + +· rank=3 expected=dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + q: how do I stop dex from logging users out on every pod restart? + 1. homelab-network-perimeter-model + 2. 2026-05-12-koala-machine-state + 3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart <-- expected + 4. infra-litellm-absorption-2026-05-16 + 5. k8s-configmap-mount-no-reload-needs-pod-restart + +· rank=2 expected=postgres-least-privilege-migration-tenant-grant-bypass-2026-05 + q: my postgres-exporter broke after revoking PUBLIC CONNECT — why? + 1. infra-litellm-absorption-2026-05-16 + 2. postgres-least-privilege-migration-tenant-grant-bypass-2026-05 <-- expected + 3. extension-version-lags-platform-major-upgrade + 4. ntfy-deny-all-rollout-ordering-keep-alert-pipeline-live-during-auth-flip + 5. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + +★ rank=1 expected=homelab-network-perimeter-model + q: when is a NodePort acceptable vs needing a public ingress with bearer gate? + 1. homelab-network-perimeter-model <-- expected + 2. qwen3-thinking-model-empty-content-trap + 3. mcpclient-empty-token-silent-401-envfrom-missing-key + 4. 2026-05-12-koala-machine-state + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=3 expected=exit-255-unknown-reason-not-oom + q: what does container exit code 255 with reason Unknown mean? + 1. qwen3-thinking-model-empty-content-trap + 2. infra-litellm-absorption-2026-05-16 + 3. exit-255-unknown-reason-not-oom <-- expected + 4. mcpclient-empty-token-silent-401-envfrom-missing-key + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=2 expected=gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + q: can gitea push-mirror create the github repo automatically? + 1. infra-litellm-absorption-2026-05-16 + 2. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo <-- expected + 3. adr-new-project-gitea-first-github-mirror + 4. adr-github-as-primary-remote + 5. 2026-05-12-koala-machine-state + +✗ rank=0 expected=flux-healthcheck-stale-on-resource-removal + q: a flux kustomization is stuck after I removed a resource — why? + 1. qwen3-thinking-model-empty-content-trap + 2. 2026-05-12-koala-machine-state + 3. homelab-architecture-principles-2026-05 + 4. k8s-configmap-mount-no-reload-needs-pod-restart + 5. training-on-rtx-5070-pretraining-vs-finetuning + +★ rank=1 expected=go-bytes-buffer-bytes-reset-aliasing-trap + q: the bytes buffer aliasing trap with Reset in a loop — what's the bug? + 1. go-bytes-buffer-bytes-reset-aliasing-trap <-- expected + 2. homelab-security-chains-not-bugs + 3. Financial Sentiment Analysis on Stock Market Headlines With FinBERT & HuggingFace + 4. training-on-rtx-5070-pretraining-vs-finetuning + 5. flux-healthcheck-stale-on-resource-removal + +★ rank=1 expected=homelab-architecture-principles-2026-05 + q: what are the homelab architecture principles from may 2026? + 1. homelab-architecture-principles-2026-05 <-- expected + 2. homelab-network-perimeter-model + 3. homelab-core-glossary + 4. 2026-05-12-koala-machine-state + 5. pattern-reddit-tmux-multiagent-conductor + +? rank=4 expected=2026-05-04-sops-age-key-from-flux-cluster + q: where does the sops age private key live in the cluster? + 1. 2026-05-12-koala-machine-state + 2. homelab-network-perimeter-model + 3. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + 4. 2026-05-04-sops-age-key-from-flux-cluster <-- expected + 5. homelab-security-chains-not-bugs + +★ rank=1 expected=grafana-dashboards-as-code-not-ui-state + q: why do my grafana dashboards disappear after a pod restart? + 1. grafana-dashboards-as-code-not-ui-state <-- expected + 2. infra-litellm-absorption-2026-05-16 + 3. 2026-05-12-koala-machine-state + 4. dex-in-memory-storage-wipes-oauth-tokens-on-every-pod-restart + 5. k8s-configmap-mount-no-reload-needs-pod-restart + +★ rank=1 expected=double-diamond-methodology + q: what is the double diamond methodology? + 1. double-diamond-methodology <-- expected + 2. unified-methodology-diamond-futures-autoresearch + 3. futures-thinking-extended-double-diamond + 4. insight-exploration-as-diamond-1 + 5. workflow-idea-to-running-service + +· rank=3 expected=2026-05-04-mcp-transport-version-claude-ai-strict + q: my MCP server works from claude code but fails on claude.ai — what's different? + 1. qwen3-thinking-model-empty-content-trap + 2. mcp-resource-url-empty-breaks-claude-ai-discovery-silently + 3. 2026-05-04-mcp-transport-version-claude-ai-strict <-- expected + 4. 2026-05-04-claude-ai-custom-mcp-connectors + 5. finding-github-mcp-claudeai-vs-claudecode + +· rank=2 expected=homelab-security-chains-not-bugs + q: how should I rate security findings — isolated bugs or exploit chains? + 1. homelab-network-perimeter-model + 2. homelab-security-chains-not-bugs <-- expected + 3. policy-audit-mode-blocks-nothing + 4. homelab-document-accepted-risk-to-break-audit-cycle + 5. audit-shortcut-tls-blocks-zero-equals-edge-only + +· rank=2 expected=2026-05-03-canonical-vs-derived-context-flow + q: how should canonical context files relate to derived adapter files? + 1. qwen3-thinking-model-empty-content-trap + 2. 2026-05-03-canonical-vs-derived-context-flow <-- expected + 3. 2026-05-12-koala-machine-state + 4. 2026-05-04-claude-ai-custom-mcp-connectors + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +· rank=2 expected=homelab-core-glossary + q: what is the homelab core vocabulary glossary? + 1. homelab-architecture-principles-2026-05 + 2. homelab-core-glossary <-- expected + 3. 2026-05-12-koala-machine-state + 4. qwen35-9b-fast + 5. flux-kustomization-depends-on-bootstrap-ordering + +★ rank=1 expected=koala-llama-swap-native-tool-calls-survey-2026-05 + q: which models on koala llama-swap actually emit native tool_calls correctly? + 1. koala-llama-swap-native-tool-calls-survey-2026-05 <-- expected + 2. 2026-05-12-koala-machine-state + 3. infra-litellm-absorption-2026-05-16 + 4. training-on-rtx-5070-pretraining-vs-finetuning + 5. qwen3-thinking-model-empty-content-trap + +★ rank=1 expected=qwen35-9b-fast + q: what is qwen35-9b-fast and what's it used for? + 1. qwen35-9b-fast <-- expected + 2. koala-llama-swap-native-tool-calls-survey-2026-05 + 3. qwen3-thinking-model-empty-content-trap + 4. infra-litellm-absorption-2026-05-16 + 5. 2026-05-12-koala-machine-state + +✗ rank=0 expected=go-defer-errcheck-body-close + q: in go, how do I prevent defer body close from silently dropping errors? + 1. homelab-network-perimeter-model + 2. infra-litellm-absorption-2026-05-16 + 3. go-bytes-buffer-bytes-reset-aliasing-trap + 4. mcpclient-empty-token-silent-401-envfrom-missing-key + 5. koala-llama-swap-native-tool-calls-survey-2026-05 + +✗ rank=0 expected=hyperguild-level3-pipeline-rewrite + q: what was the level 3 rewrite of hyperguild's ingestion pipeline? + 1. 2026-05-12-koala-machine-state + 2. homelab-core-glossary + 3. koala-llama-swap-native-tool-calls-survey-2026-05 + 4. infra-litellm-absorption-2026-05-16 + 5. homelab-architecture-principles-2026-05 + +· rank=3 expected=adr-new-project-gitea-first-github-mirror + q: what's the new-project ADR — is it gitea-first or github-first? + 1. gitea-push-mirror-cannot-create-remote-repo-needs-pre-existing-github-repo + 2. mcp-tool-design-get-needs-list-partner + 3. adr-new-project-gitea-first-github-mirror <-- expected + 4. 2026-05-04-gitea-mcp-build-session + 5. adr-local-dev-vs-hyperguild-new-project +