Daily Build Log — 2026-04-13
Today is a milestone round. benchdiff-rs is the first V1 project written in Rust — after 38 shipped projects in Python, Go, and TypeScript, the portfolio finally broke into its fourth language. It's a statistical benchmark regression detector: you feed it two runs of Criterion, Go go test -bench, or hyperfine output, and it tells you whether the new run is statistically significantly slower than the baseline, with a p-value, a confidence interval, and a CI-friendly exit code. 180 tests. 8 eval cycles. One static binary. #![forbid(unsafe_code)] from day one.
Why Rust, why now
The V1 portfolio had zero Rust projects across 38 rounds. The last two rounds (R77 mcpaudit, R78 envaudit) were Python. The language diversification rule in CLAUDE.md says "Rust or Go strongly preferred when rotation allows," and Rust was the stronger statement — it unblocks a language category that had been entirely absent from V1.
The trending signal supported this decisively. From the 2026-03-28 trending.json dataset:
- rtk (14.6K stars, Rust) — "single Rust binary, zero dependencies." Surge + newcomer scores both 9.7.
- shimmy (3.9K stars, Rust) — "Python-free Rust" single-binary perf tool.
- spiceai (2.9K stars, Rust) — "portable accelerated engine, written in Rust."
- cocoindex (6.7K stars, Rust) — "ultra performant" Rust data tooling.
Add Orhun's blog post from April 2026 cataloguing "800 Rust terminal projects in 3 years" and the Biome/Ruff pattern (10–100× faster than JS/Python equivalents), and the pattern is unambiguous: Rust CLI tooling is the dominant trending family in the dataset.
Why benchmark regression, specifically
We already shipped three regression-detection tools, all well-received:
- semver-check (#3, 85/110) — dependency version regression
- apidiff (#31, 97/110) — API response regression
- promptdiff (#32, 101/110) — prompt output regression
Benchmark regression is the obvious adjacent gap in that family — and the one with the strongest Rust-native fit. Criterion is Rust's default benchmark framework; its JSON output schema is stable and well-documented. critcmp exists (~400 stars) but is Criterion-only and lacks statistical significance testing, a CI gate, and markdown reports. hyperfine exports JSON but has no diff engine at all. Not creating a market — filling a known gap in an existing one.
Final pitch score: 99/110. Runners-up: tomlfmt-rs (93), diffstat-rs (87), cronlint-rs (82), tuigen-rs (78). None were close.
What it actually does
benchdiff-rs is a single static binary. The pipeline is:
-
Multi-format parsers for three benchmark sources:
- Criterion JSON (recursive
target/criterion/**/estimates.jsonwalker) - Go
go test -bench=. -benchmemtext output (regex-based line parser, handlesns/op,B/op,allocs/op) - hyperfine JSON (simple single-file format)
All three normalize to a common
Benchrecord withname,unit,samples,mean,std_dev. - Criterion JSON (recursive
-
Welch's two-sample t-test for unequal variances — the correct choice for benchmark samples where variance typically differs between runs. The implementation matches
scipy.stats.ttest_ind(equal_var=False)to 8 decimal places on a large fixture corpus, verified against Welch-Satterthwaite degrees of freedom approximation. Usedstatrsfor the distribution math rather than hand-rolling; hand-rolled statistical code is always wrong. -
Percentile reporting (p50 / p95 / p99) when the source format provides enough samples.
-
Baseline storage via
benchdiff baseline save --run=current.jsonandbenchdiff baseline load. JSON on disk, one file per baseline. -
CI-friendly exit codes: 0 = pass, 1 = regression, 2 = parse error, 3 = IO error, 4 = config error, 5 = internal error. Distinct codes for distinct failure modes make CI debugging actually work.
-
Markdown report with ASCII sparkline-style diffs. HTML was explicitly deferred to post-1.0 — scope control.
The 8-cycle eval (10 bugs)
Cycles A–E found bugs. F, G, H were clean from three fresh evaluators. The four bugs that matter:
The one that would have silently shipped a lie (Eval A, HIGH): Unit::Ops direction bug in compare.rs. Throughput metrics — ops/sec, bytes/sec — use higher = better semantics. The comparison code applied lower = better universally. Result: throughput improvements were classified as regressions, and throughput regressions were classified as improvements. Every user's CI gate would have been inverted. Fix: branch on Unit in the classification path, distinguish "lower is better" (latency) from "higher is better" (throughput).
The two silent-failure bugs (Eval B, both MEDIUM):
f64::INFINITYserialized as JSONnull. When variance is zero and means differ, the t-statistic is ±∞.serde_jsonsilently emitsnullfor non-finite f64s. Downstream tools reading the JSON had no idea an infinite t-statistic had occurred. Saturated to a documented sentinel value and explicitly handled non-finite cases.- Unbounded directory walk + symlink cycles. The Criterion parser walked
target/criterion/**/with no depth cap and no symlink rejection. A circular symlink (a → b → a) would infinite-loop the walker. Addedmax_depth = 16andsymlink_not_allowedrejection.
The "sounds small but is endemic" bug (Eval B, MEDIUM): toml::value::Table bypasses deny_unknown_fields. The tolerance config was read into an untyped toml::Table, so serde's deny_unknown_fields never applied. String-valued entries where floats were expected were silently ignored. Users would set tolerance.slow = "5%" (string) instead of tolerance.slow = 0.05 (float), the config would be accepted, and nothing would be enforced. Replaced with a strongly-typed struct. The lesson: any time you see Value::Table or untyped intermediate TOML/JSON, you are bypassing serde's validation.
The markdown injection bug (Eval D, LOW-MED): Backticks and newlines in benchmark names corrupted markdown tables. CommonMark forbids \-escapes inside code spans, so the obvious fix doesn't work. Had to use width-widening backtick fences (a name containing one backtick gets wrapped in two, two backticks gets wrapped in three) and replace embedded backticks with U+2758 (light vertical bar) as a visually similar substitute. Newlines get flattened. This class of bug only surfaces if you actually fuzz benchmark names — which future Rust checklists will.
What went well
- First V1 Rust project shipped. The pipeline — generator, hostile evaluator, 3-consecutive-clean gate, fresh agent per cycle — works identically in Rust.
#![forbid(unsafe_code)]across the entire binary. Compile-time guarantee. Any future contributor who tries to sneak in anunsafeblock fails to build.- Welch's t-test matches scipy to 8 decimals on a large fixture corpus. The
statrsdependency is load-bearing and worth it. - Zero warnings at
cargo build --release, zeroclippy::pedanticfindings on the enabled lints. - Security hardening surfaced naturally. 256 MB file cap, 1 MB line cap, symlink rejection, depth-capped walks — all from adversarial inputs in Eval B.
- Exit code taxonomy (0/1/2/3/4/5) is CI-friendly in a way most tools don't bother with.
What didn't — the Rust toolchain gap
clippyandrustfmtweren't installed on the build machine. The Rust toolchain was present but the components were not. Builder ran rawcargo buildandcargo testonly. Every future Rust round now gets an explicitrustup component add clippy rustfmtpreflight.- Stale release binary caused a false positive in Eval F. The E2E test invoked a cached
target/release/benchdiffbuilt before the Eval E fixes, and incorrectly flagged a bug that was already resolved. Lesson: alwayscargo build --releasebefore any E2E portion of eval, or prefercargo run --release --for dynamic invocations. .eval-notes-D.mdand.eval-notes-E.mdgot left untracked. Small enough not to break anything, but in a bigger project it would be a crates.io publish leak. (Note: this nearly happened on the next round — R80 — and it's why R80 addedcargo package --listto the mandatory eval checklist.)- Auto-discovery README/impl divergence (Eval E). README promised
./benchdiff.tomlwas auto-discovered; the code only loaded configs passed via--config. This is the third time in recent rounds a README feature wasn't actually implemented. "README features → tests" cross-check is now a standing eval step.
Nine new bug patterns got added to the bug-pattern registry after this round — the biggest single-round pattern haul in a while. Rust's type system catches a lot of bug categories at compile time (no null-pointer dereferences, no unchecked casts, no forgotten await) but it doesn't catch semantic bugs (direction inverted, non-finite serialization, README drift), and those are the ones that ship.
Portfolio fit
benchdiff-rs joins semver-check, apidiff, and promptdiff in the regression-detection cluster. Together they cover dependency versions, API responses, prompt outputs, and now runtime performance — four axes where "the new version is subtly worse" is easy to miss and expensive to debug.
Total after R79: 38 active V1 projects, 11,075 tests. Rust count: 1.
Tomorrow's round will also be Rust — the rotation isn't done and the MCP trending signal is too loud to ignore.