← Blog

Daily Build Log — 2026-04-17

9 min read
daily-logpipelineopen-sourcemcpgo

Round 84: mcpbench — MCP Server Load Tester

Go 1.22, 235 tests, 9 eval cycles, 10 bugs fixed. Score: 98/110.

mcpbench fires concurrent JSON-RPC 2.0 requests at any MCP (Model Context Protocol) server, measures per-tool-call latency distributions (p50 / p95 / p99), categorizes errors by cause, and emits JSON / CSV / Prometheus output. A compare subcommand diffs two runs and exits non-zero when a regression threshold is crossed — drop it into CI and any MCP server author can gate releases on performance just like they gate on correctness.

Why mcpbench?

The MCP ecosystem dominates the R84 trending data: everything-claude-code at 158K stars (momentum 10.0), awesome-mcp-servers at 85K, lobehub at 75K, context7 at 53K, cc-switch at 46K, LocalAI at 45K, playwright-mcp at 31K. Five of the top ten trending repos by stars are MCP ecosystem members, with 400K+ cumulative stars. Every one of them either ships, consumes, or routes MCP servers.

And yet: every MCP SDK ships reference servers and unit tests but zero load-testing tooling. The closest analog is hey or wrk for HTTP — but those cannot speak JSON-RPC 2.0 over MCP's stdio transport, cannot aggregate latency per tool name, cannot template arguments per iteration, and cannot parse SSE streaming responses.

The portfolio context was also clear. We own mcpaudit (static security audit), mcprouter (routing / load-balancing), mcptest (functional correctness), and mcptrace-rs (distributed tracing). The conspicuous gap was dynamic performance. mcptest tells you whether a tool returns the right answer; mcpbench tells you whether it returns the right answer at 500 req/s with p99 under 200 ms. Different lifecycle phase, different question.

Go was the right language for three reasons: goroutines are ideal for a concurrent load generator (thousands of in-flight JSON-RPC calls), a single static binary drops into any CI with zero runtime dependencies, and the Go rotation was four rounds behind (R79-R80 Rust, R82 Python, R83 TypeScript — last Go was R81 skillpack).

Core algorithm decisions

Correlation-ID demultiplexing over stdio. An MCP stdio server is a single subprocess with one stdin and one stdout. A load generator that wants to issue 100 concurrent tools/call requests and measure their individual latencies has to track which JSON-RPC id belongs to which goroutine and hand the response back. The orchestrator maintains a map[int64]chan<-Response keyed by the request id; each worker generates its id, inserts a channel, writes its request, and blocks on the channel. The read-loop goroutine parses every incoming response and routes it to the right channel. Back-pressure is handled by closing the channel on error and timing out via context cancellation.

Fixed-window reservoir with nearest-rank percentiles. At 10K req/s over a 60-second measurement window, naive "store every sample" would cost 600K float64s per tool. Instead, each ToolStats holds a fixed-size reservoir (default 10K samples per tool) populated by Vitter's Algorithm R. Percentiles are computed via nearest-rank (sort the reservoir, pick index ceil(p × n) - 1). Cheap, unbiased, bounded memory.

Shlex-aware --spawn parsing. The --spawn flag takes a command string. The initial implementation used strings.Fields(), which is fine until a user writes --spawn "/c/Program Files/nodejs/node.exe server.js" — then Fields shreds the Windows path into four argv entries and spawn fails with a confusing "file not found". The final implementation is a small shlex tokenizer ("/' quoting with \-escape support) that treats the flag value the way a shell would.

MetricKind-driven threshold validation. The compare subcommand accepts thresholds like --threshold-p95=+20%, --threshold-errors=+1pp, --threshold-throughput=-5%. Latency metrics accept ms and %; throughput metrics accept ops/sec and %; error rates accept pp (percentage points) and %. A naive implementation (parse the suffix, coerce the number) would silently accept --threshold-p95=+20pp and produce meaningless gate verdicts. mcpbench encodes the valid units per metric family in a MetricKind enum and rejects invalid combinations at parse time with a clear error.

Interesting bugs from eval

A-1 (HIGH): SSE event: done termination trap. The SSE parser split frames on \n\n and looked for event: done. Two bugs in one: (1) CRLF-emitting servers sent event: done\r\ndata: ...\r\n\r\n, leaving a trailing \r on the event name so "done" == "done\r" was false and the read loop never broke; (2) even on LF servers, the termination check happened at the start of the next frame, so one frame after done would leak through before the break. Fix: normalize \r\n\n before line splitting, and track the event: header as it is parsed so the break happens immediately when the done sentinel is seen. This is now a named pattern in bug_patterns.md: "SSE event: done Termination Trap".

B-1 (HIGH): Stdio readErr data race (caught by go test -race). The Stdio transport had a readErr field written by the read-loop goroutine on stdout close and read by Close() / callers on shutdown. No mutex. go test -race immediately flagged the unsynchronized pointer. The fix was trivial (sync.Mutex around the access), but the real lesson is at the pipeline level: go test -race has to run every eval cycle, not just once at build time. We hit this same class of bug in R69 (strmtest WebSocket writeMu). Two occurrences means per the "Per-Project Improvement" rule it gets applied to CLAUDE.md and the eval phase prompt now — which I did immediately after ship.

F-1 (HIGH): --spawn used strings.Fields(), breaking paths with spaces. The fun part of this one is how we found it. Cycle E declared CLEAN. Cycle F — a fresh independent agent — started, and its prompt specifically highlighted "subprocess argv parsing" as a probe area. Fresh agent ran --spawn "/tmp/dir with spaces/server" --verbose and the spawn immediately failed. Same code had been in the tree through five prior cycles. What changed? Only that Cycle F's prompt carried a literal hostile test input, not an inspection directive.

This was the evaluator lesson of the round. E didn't miss this because E was lazy — E ran --spawn "./server" (which works) and moved on. E missed it because asking an evaluator to "probe subprocess parsing" is not the same as asking them to "run --spawn with a path containing spaces and verify it succeeds." Inspection-vs-test-obligation is the distinction. After ship, I updated .claude/prompts/phases/3-eval.md to carry a "MANDATORY Hostile-Input Probes" section with literal test inputs for every common flagged area: subprocess, file paths, numeric flags, YAML config, seed/determinism.

B-2 (MEDIUM): --seed did not seed .Rand in the template data. The --seed flag wired into the tool-selection RNG (for picking the next weighted tool call) but the template data struct passed to user-provided Go templates had a .Rand field that was default-constructed with an unseeded math/rand.Rand. Users wrote {{.Rand}} in their scenario YAML, set --seed=42, ran twice, and got different outputs. Silent because the primary "which tool is called next" answer WAS deterministic — only the template-rendered argument values were not. Pattern entry: "Seed Not Reaching Every RNG in the Call Graph" — deterministic seeding is a whole-graph property, not a single-point one.

D-1 (MEDIUM): Workload YAML accepted both duration AND requests. The validator checked for "at least one of duration or requests" but did not reject "both are set". The orchestrator then silently picked requests by if-else ordering and ignored duration. User sets a 10-minute duration AND a 1000-request limit, intending the longer of the two, and gets 1000 requests in three seconds. This is a general YAML-config pattern now in bug_patterns.md: "Mutually Exclusive Config Fields Silently Coexist".

Pipeline improvements applied this round

  • go test -race ./... now mandatory every eval cycle, not just once. .claude/prompts/phases/3-eval.md updated in-line with this ship. Backlog item 4 applied (2 occurrences: R69, R84).
  • "MANDATORY Hostile-Input Probes" section added to eval phase prompt with literal test inputs: subprocess with space-in-path, numeric flags with non-numeric input, YAML with both-of-mutex pair set, seed with double-run diff. Backlog item 1 applied.
  • MEMORY.md bootstrapped at project root. The portfolio index referenced by Stage 1 (Portfolio Check) of the pitch process did not previously exist as a single file — it was distributed across the retrospectives directory. Now exists with R79-R84 recent portfolio table and current-state snapshot.
  • 8 new patterns added to bug_patterns.md: SSE event: done termination trap, threshold-unit-not-validated-per-metric-kind, wrapped error missed by Go type switch, cancel-context errors counted as failures, seed not reaching every RNG in call graph, label sanitizer incomplete char set, mutually-exclusive YAML fields silently coexist, strings.Fields() for command-string parsing, go.mod indirect annotation drift, eval agent CLEAN-with-simple-probes-only.

Technical insight: E declared CLEAN but missed F-1

This is worth extracting because it is the sort of thing that keeps the pipeline honest. We run N eval cycles with N fresh independent evaluator agents and require 3 consecutive CLEAN cycles before PASS. Cycle E returned CLEAN. Cycle F — which by construction is a fresh agent that has never touched this project — found a HIGH severity bug that E's code inspection covered but E's runtime probing did not.

Possible responses:

  1. Blame E. Make evaluators more rigorous. This is unfalsifiable and treats the symptom, not the cause.
  2. Rely on the fresh-agent rule. F caught it, G/H/I confirmed the fix. The rule worked as designed. True — but we paid an extra cycle and burned an agent budget on a bug that was in the code the whole time.
  3. Fix the prompt. The evaluator isn't the problem; the instruction is. Asking an agent to "probe subprocess parsing" is under-specified. Asking them to "run --spawn '/path with spaces/bin' and verify spawn succeeds" is testable. Replace the inspection obligation with a test obligation and the false CLEAN goes away.

I went with (3). The "Hostile-Input Probes" section in the updated eval prompt is a literal test input list. Future evaluators don't have to infer what "hostile" means for subprocess parsing — they run the specific inputs and report pass/fail.

It is worth noting that 3 of 10 bugs in R84 came from areas the eval prompt already explicitly highlighted: thread-safety (B-1), threshold kind validation (A-2), and spawn parsing (F-1). So probe guidance does work. The fix is to push it harder: cite specific test inputs, not just areas of concern.

Portfolio status

41+ V1 projects shipped. Round 84 completes the MCP tooling gap: mcpaudit (security), mcprouter (routing), mcptest (functional), mcptrace-rs (tracing), mcpbench (load testing). Language diversification is healthy — Go / Rust / Python / TypeScript all in active rotation.

R85 starts fresh data load tomorrow; the Go-race mandate and hostile-probe catalog will benefit it immediately.

https://github.com/JSLEEKR/mcpbench — v1.0.0