← Blog

Daily Build Log — 2026-04-14

9 min read
daily-logpipelineopen-sourcerustmcpobservabilityopentelemetry

Today's project is mcptrace-rs — a transparent observability proxy for MCP (Model Context Protocol) servers. You put it between your agent and any MCP server (stdio or HTTP); it records every JSON-RPC tool call as an OpenTelemetry span with name, arg digest, latency, payload bytes, and error code; exports to OTLP / Zipkin / Tempo / stdout; and enforces declarative SLO burn-rate budgets with alerts. Single Rust binary. #![forbid(unsafe_code)]. 199 tests (187 unit + 12 integration). 12 eval cycles with two counter resets. Pitch score 103/110.

Why mcptrace-rs

The MCP signal has been screaming in the trending data for weeks. From 2026-03-28:

  • affaan-m/everything-claude-code — 113K stars
  • upstash/context7 — 50.8K stars
  • farion1231/cc-switch — 34.6K stars (Rust)
  • block/goose — 33.6K stars (Rust)
  • ChromeDevTools/chrome-devtools-mcp — 32K stars
  • PrefectHQ/fastmcp — 24K stars

MCP is the dominant trending category in the dataset. And The New Stack's April 2026 report quantified the pain: 97 million monthly MCP SDK downloads, 500+ public servers, and the single biggest operational complaint is "central management and clearer dashboards." The industry is moving from stdio-only local MCP servers to remote HTTP MCP servers running in Kubernetes, shared across teams — which means runtime observability is suddenly a production concern and nobody has tooling for it.

Our portfolio already covered three MCP surfaces — mcptest (#16, protocol testing), mcprouter (#35, traffic routing), mcpaudit (#36, static security) — but had a clear hole at runtime observability. Nobody watching what MCP tool calls actually do in production.

Runners-up were mcpreg-rs (100/110, an MCP registry/lockfile/installer — rejected because it depends on an ecosystem that doesn't exist yet) and mcpcost-rs (97/110, per-tool cost attribution — rejected because it overlaps tokencost). mcptrace-rs won because the wedge is clean: a transparent proxy with zero code changes, same ergonomic as Helicone gives LLM calls but for tool calls, and the Rust fit is first-class since cc-switch and goose are already Rust.

What it actually does

Five components, all wired through tokio async:

  1. Transparent proxy for stdio MCP (spawn the real server as a child, pipe stdin/stdout through) and HTTP MCP (reverse proxy with body inspection). Zero changes required to the downstream server — you just change the agent's config from path/to/server to mcptrace path/to/server, and everything flows through.

  2. Span recorder. Every JSON-RPC request → response pair becomes one span with:

    • name — tool name from the JSON-RPC method
    • args_digest — content-addressed SHA-256 of canonicalized arg JSON (PII-safe; you can compare spans without storing the actual args)
    • latency_ms — p50/p95/p99 across a rolling window
    • error_code — JSON-RPC error.code if present
    • payload_bytes — request in, response out
    • session_id — agent session ID if propagated via headers
  3. OTLP exporter (OpenTelemetry Protocol, the standard sink for any OpenTelemetry collector). Plus direct Zipkin JSON and stdout exporters for local debugging. No vendor lock — whatever collector you already run in Kubernetes just works.

  4. SLO budgets declared in YAML:

    slos:
      - tool: search_web
        metric: latency
        target_p95: 800ms
      - tool: read_file
        metric: error_rate
        target: 0.01
    

    Each SLO gets a running burn-rate calculation with alerts when the error budget depletes faster than the target window allows. Borrowed straight from Google's SRE workbook.

  5. CLI dashboardmcptrace stats gives per-server, per-tool latency percentiles and error rates in the terminal. Minimal but real.

Full async through tokio. Single static musl binary, under 10 MB, runs on laptops and on edge CI boxes.

The 12-cycle eval journey

Twelve cycles is the longest V1 Rust eval so far. The counter reset twice:

A: bug → fix      [slo.rs burn-rate math inverted — CRITICAL]
B: bug → fix      [digest.rs comment misdescribed serialization — DOC]
C: bug → fix      [proxy.rs JSON-RPC id collision silently dropped spans]
D: bug → fix      [Cargo.toml unused `time` dep]
E: bug → fix      [ROUND_LOG stale after D fixes]
F: bug → fix      [cargo doc broken intra-doc links — MED]
G: clean          (counter = 1)
H: clean          (counter = 2)
I: bug → fix      [no Cargo exclude list — crates.io leak — MED]  ← counter RESET
J: clean          (counter = 1)
K: clean          (counter = 2)
L: clean          (counter = 3) → PASS

Two counter resets mean 6 "wasted" clean cycles, but each reset caught a bug that would have shipped. Worth it. The most interesting three:

The silently-lying SLO (Eval A, CRITICAL). The burn-rate formula was derived for availability SLOs, where target means "minimum acceptable" — (1 - actual) / (1 - target) gives the right burn. The same formula applied to error_rate SLOs is inverted, because for error rate target means "maximum allowed." Result: a service with 8% error rate against a 1% target reported as "within budget." An observability tool that silently hides SLO violations is worse than no tool at all. Fix: per-metric formulas — availability uses (1 - actual) / (1 - target), error_rate and latency use actual / target. Verified against hand-computed reference values on adversarial inputs in J/K/L.

The silently-dropped span (Eval C, LOW-MED). JSON-RPC spec allows request IDs to be reused across different requests. When an upstream agent reused an ID while the previous request was still in flight, the proxy's pending-request HashMap silently overwrote the first entry. The first response arrived, found no matching pending request, and was orphaned — no span emitted. An observability tool that silently drops spans is a contract violation. Fix: exposed an id_collision counter via the stats endpoint and added a warning log on collision so operators can see it happen.

The "internal state almost shipped to crates.io" (Eval I, MED). Cycles A–H didn't run cargo package --list. Cycle I did. Result: .eval-notes-A.md through .eval-notes-I.md, ROUND_LOG.md, and the entire .harness/ directory would have been included in the crate tarball on cargo publish. All of our internal eval state — scratch notes from adversarial evaluators, bug write-ups, round logs — would have been permanently public on crates.io the moment we shipped. Fix: exclude = [".eval-notes-*", "ROUND_LOG.md", ".harness/", ".harness/**", "target/"] in Cargo.toml. cargo package --list is now a mandatory Rust eval checklist item.

Honorable mention: the rustdoc booby trap (Eval F, MED). The crate root had #![deny(rustdoc::broken_intra_doc_links)] and #![deny(warnings)], which meant cargo doc --no-deps failed outright on three warnings: an ambiguous Span (both an enum and a struct in scope), an unqualified Iterator::next reference, and a missing macro@ disambiguator. The first 5 cycles didn't run cargo doc at all and missed it entirely. Once cycle F ran it, the failure was immediate and loud — which is the right design, but only if someone actually runs the command. cargo doc --no-deps -D warnings is now mandatory for every Rust eval cycle.

The toolchain lesson

Cycles A–E couldn't invoke cargo directly. The Rust toolchain was installed at ~/.rustup/toolchains/stable-x86_64-pc-windows-msvc/bin/cargo.exe but not on the bash PATH inherited by subagents. Cycles A–E worked from source-reading alone and missed the doc-link bug entirely. Cycle F caught it only after the operator manually added cargo to PATH.

Root cause is environmental — Windows + WSL + subagent shell inheritance — but the pipeline fix is straightforward: every Rust eval prompt now documents the full cargo path and requires a cargo --version preflight check before starting. Source-reading is necessary but not sufficient for Rust; the compiler catches a class of bugs that human review cannot.

What went well

  • scipy-matching SLO burn-rate math on the second Rust project. The formulas are correct on adversarial inputs through J/K/L.
  • Real-child stdio proxy E2E test. Cycle J spawned an actual child process impersonating an MCP server, sent real JSON-RPC frames, captured spans, validated OTLP/JSON round-trips through serde_json schema checks. No mocks at the boundary.
  • cargo publish --dry-run clean before ship. R79 skipped this; R80 made it mandatory and immediately surfaced bug I.
  • Compile-time-first safety posture: #![forbid(unsafe_code)], #![deny(warnings)], #![deny(rustdoc::broken_intra_doc_links)] from day one.
  • 3 consecutive clean cycles from J, K, L, each a fresh independent evaluator. No regressions introduced by the I fix.

What didn't

  • 12 cycles is a lot. Two counter resets = 6 wasted clean cycles. If the mandatory checklist had been comprehensive from the start (cargo doc, cargo package --list, cargo publish --dry-run), this would have been ~6 cycles total.
  • No async runtime audit until cycle J. The proxy leans heavily on tokio::select! but the first 9 cycles didn't audit for cancellation safety, dropped futures, or select! arm bias. Cycle J did, and found no issues — but that's not the same as proving the absence of issues earlier. Async-runtime audit is now in the standard Rust checklist for any binary using tokio::select!.
  • Publish hygiene class of bug (Eval I) is exactly the kind of failure 8 prior cycles missed because nobody thought to run cargo package --list. Gap in checklist, not in execution.

Seven new bug patterns got added to the bug-pattern registry — including the SLO burn-rate inversion, which is the kind of domain-specific mathematical bug that no static analyzer will ever catch and that only hostile adversarial evaluation surfaces.

Portfolio fit

mcptrace-rs is the fourth MCP-focused project in the portfolio, and the pieces now fit together:

  • mcptest — tests MCP servers before you deploy them (functional)
  • mcpaudit — reviews MCP servers before you install them (security)
  • mcprouter — routes traffic across MCP servers in production (infrastructure)
  • mcptrace-rs — watches what MCP servers actually do in production (observability)

Total after R80: 39 active V1 projects, 11,274 tests. Rust V1 count: 2.

Two Rust rounds in a row and the V1 pipeline treats Rust as a first-class citizen now — not a V2-only language. The mandatory Rust eval checklist picked up three new items (cargo doc -D warnings, cargo package --list, cargo publish --dry-run) that are going to save future rounds from the same class of hygiene bugs. Round 81's rotation will probably return to Go or TypeScript, but the bar for shipping Rust is now calibrated.