flake: TestRoutingPodEndToEnd port-wait deadline exceeded under load #15
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
TestRoutingPodEndToEndincmd/routing/main_test.gointermittently fails at line 58 with:The 5s deadline is
waitForPort(t, "127.0.0.1:33310", 5*time.Second)(main_test.go:58, 85–104) waiting for the freshly-built routing binary to bind its HTTP listener.Observed during #14 e2e work (2026-05-18)
Failed twice in a row:
task check(full project test suite with-race -count=1)go test -race -count=1 ./cmd/routing/...immediately afterBoth at ~5.23s. Then it stopped reproducing — 5/5 passes in isolation with
-race -count=1, including with cold test cache (go clean -testcache).Pattern: triggered under load, not by code.
Likely causes (untriaged)
go buildcost inside the test.buildRouting(t)compiles./cmd/routinginto a temp dir. Cold build +-raceinstrumentation can take >5s on a loaded machine. Running the produced binary directly outside the test starts in <2s, so the bottleneck is build, not bind.ss -tlnp | grep 33310was empty when checked, but a previous TIME_WAIT or another test could collide.osPath()returns empty PATH. Lines 116–122 iterateexec.Command("env").Envwhich is the child Cmd's env (nil unless set), not the parent's. The test passesPATH=(empty) to the routing subprocess. The binary doesn't need PATH at runtime (statically linked Go), so this is probably benign — but it is incorrect and worth fixing while we're here.Proposed fix
Cheap, low risk:
waitForPortdeadline to 30s (cost only paid on failure paths).33310with anet.Listen("tcp", ":0")random port allocation passed viaROUTING_PORTenv.osPath()to read parent process env viaos.Getenv("PATH")(or just drop the explicitPATH=and rely oncmd.Envinheriting).Acceptance criteria
go test -race -count=10 ./cmd/routing/...is greentask checkis green across 3 consecutive runsosPath()either correct or removedReferences
cmd/routing/main_test.go:30–80— test bodycmd/routing/main_test.go:85–104—waitForPortcmd/routing/main_test.go:116–122—osPath937355c).Fixed in
fe18e4e.Root cause (confirmed)
All three suspects from the issue body played a role:
task checkparallel package execution another test or stray process could hold the port long enough that the routing binary bound it late or failedListen. NowfreePort(t)grabs:0, releases, passes the OS-assigned port viaROUTING_PORT.waitForPortdeadline — too tight whengo build -raceruns cold under load. Bumped to 30s; only paid in failure mode.osPath()always returned empty — iteratingexec.Command("env").Envreads the childCmd's env (nil), not the parent's. The binary worked anyway (static Go), but the helper was a no-op masquerading as PATH inheritance. Replaced with explicitPATH=+HOME=viaos.Getenv.Why not
os.Environ()for full inheritanceTried that first. It leaked
ROUTING_MCP_TOKENfrom my shell into the test subprocess, flipping the routing pod into bearer-auth-required mode → all calls401 unauthorized. Explicit minimal env keeps the test hermetic.Verification
Diff: +28/-16, single file
cmd/routing/main_test.go.Closing.