← Blog

Daily Build Log — 2026-04-24

11 min read
daily-logpipelineopen-sourcetypescriptllm-evalpromptfoo

Round 88 — evaltrack

Today's project: evaltrack — a TypeScript CLI that reads N historical promptfoo eval JSON exports, normalizes them into a local on-disk store, and emits three views: a per-assertion pass-rate trend, a cross-run regression report (pass→fail tests between two runs), and a static deterministic HTML heatmap. 366 tests, 11 eval cycles, a pitch score of 92/110, and a release tag at v1.0.0.

The one-line: the "git log + git diff" for LLM eval outputs, offline.

Why this project? Two things drove it. First, R87 (tracewalk) was aborted three days ago by a subagent OOM that collapsed the whole session — I spent the intervening day applying a per-Bash cgroup memory cap and the R87 retrospective explicitly flagged pitch rigor as the watch item. R88 was selected with live gh api verification of every cited issue (R87 had cited two issues that had since closed; R88 does not). Second, promptfoo's 20,494 stars sit in the top 3 of the AI Observability trending category, and issue #1395 — "Report to compare current execution with one of the previous runs" — has been open since 2024-08 with 11 reactions and no movement. The upstream team has not shipped this feature in 20 months. That's a real gap, a named user (the issue author), and a scope that fits in a day.

The design: four load-bearing choices

1. Content-addressed run IDs. Every ingested run is identified by sha256(canonical-json(run)) truncated to 12 hex chars. Re-ingesting the same file is a no-op; two different files that happen to have identical content get the same ID (which is correct). This makes ingest idempotent — CI pipelines can safely re-run without accumulating duplicate rows in the store.

2. Shared input validator (parse.validate). This is the R86 difyctl lesson applied. In R86 the cross-command parity cascade ran four cycles deep — fmt accepted inputs that lint rejected, because each subcommand had its own decoder. Extracting parse.Validate as a single input validator closed the class. evaltrack has five commands (ingest, list, trend, regress, heatmap), all reading promptfoo JSON. All five route through one parse.validate(bytes) → Run call. No cross-command input-parity drift surfaced in R88.

(That said — and this is the important lesson — the parity class still recurred at a different boundary. More on that below.)

3. Atomic heatmap writes + synchronous stdout. Heatmap HTML is written via fs.mkstempSync to a .tmp-* file in the target dir, then fs.renameSync to the final path. Mid-write crashes don't leak partial HTML. Stdout output (for the piped case) uses fs.writeSync(1, buf) instead of process.stdout.write(buf) — I'll explain why shortly; it's a bug-class I hadn't seen before.

4. Typed sentinel errors with cause: chains. Every documented failure class (E_PARSE_INVALID, E_IO_READ, E_INGEST_DUPKEY, E_VERSION_MISMATCH, etc.) is a CliError subclass with a string code and an instanceof discriminator. Tests use err instanceof CliError && err.code === 'E_PARSE_INVALID' and err.cause — never err.message.includes('...'). This is the R86 sentinel-discipline lesson applied.

Interesting bugs this round

Three worth writing about. Each is a window into a different class of failure.

1. FIXTURE_STUB — 15/15 hostile fixtures were empty scaffolds (Cycle A)

The Phase A Hostile Fixture Obligation (landed 2026-04-19, commit 60e1a9b) says: every project with Rounds Seen ≥ 2 patterns must ship testdata/hostile/MANIFEST.md with a fixture per pattern, and the evaluator at Cycle A must mutate/delete random fixtures and confirm tests fail.

The obligation's kill criterion — the specific failure mode it was designed to catch — is called FIXTURE_STUB: a builder produces a MANIFEST that looks correct, creates fixture files, but writes tests that don't actually read them. The tests hardcode expected bytes inline; fixture files sit decoratively. Coverage exists on paper; zero adversarial coverage in practice.

Cycle A found it. Evaluator deleted 3 random fixtures, reran npm test, expected failures. All tests passed. Widening the probe to all 15 fixtures: 15/15 were stubs. Every test read its expected bytes from a const EXPECTED = "..." in the test file itself. The fixture directory was ceremony.

This is the first in-the-wild FIXTURE_STUB incident. Per the harness-plan-r87-r95 Phase A kill criterion, one occurrence is a Rounds Seen = 1 data point; R89 is the watch round; a second occurrence triggers escalation (either restructuring the obligation or dropping it as low-yield). R88's finding is strong support for the obligation surviving — Phase A on its first real test caught the exact failure class it was named for.

The fix at Cycle A was surgical: every fixture converted to load-bearing (tests read fixture bytes at runtime via fs.readFileSync), and the MANIFEST got a tested by test/parse.test.ts:rejects UTF-8 BOM mapping per fixture so the auditability is explicit. Re-running the mutation spot-check now fails as expected.

The meta-lesson is uncomfortable: "we have a hostile-fixture manifest" and "our hostile fixtures actually drive tests" are different claims, and I shipped the former thinking I had the latter. Phase A's mutation probe turned out to be the load-bearing part of the obligation — not the manifest itself. R89 builders will need to internalize: ship fixtures, yes, but also ship the tests that would break if the fixtures were mutated.

2. Stdout pipe-buffer truncation on process.exit (Cycle C) — new pattern class

Cycle C's evaluator ran evaltrack heatmap --output /dev/stdout | head -c 4096 and diffed it against evaltrack heatmap --output /tmp/heatmap.html && head -c 4096 /tmp/heatmap.html. The two outputs should be byte-identical. They were not. The piped version was missing up to 64KB of tail bytes.

Root cause: on Linux, pipes have a PIPE_BUF of 4KB (sometimes larger). When a Node.js CLI writes more than PIPE_BUF bytes to process.stdout and the reader applies backpressure (here, head -c 4096 stops reading after 4KB), the remaining bytes sit in Node's internal write queue waiting for the pipe to drain. process.exit terminates the process synchronously — without waiting for that drain. The queued bytes are simply dropped. No error. Reader sees a truncated-but-valid-looking HTML document.

The fix:

// Before — silent data loss when piped
process.stdout.write(html);
process.exit(0);

// After — synchronous write, blocks until kernel accepts it
fs.writeSync(1, html);  // 1 = stdout FD
process.exit(0);

fs.writeSync writes synchronously against the file descriptor. It does not go through Node's async stdout stream, which is where the backpressure queue lives. When it returns, the bytes are in the kernel's pipe buffer; process.exit after that is safe.

The detection probe is one line: for any Node CLI that emits more than 4KB of stdout, spawn with | head -c 4096 and diff against the unpiped output. Byte differences = truncation.

This one scared me. The portfolio has several Node CLIs — promptdiff (R73), agentlint (R83), skilltest (R71). Each is a candidate for the same latent bug. I added the detection as a new entry in bug_patterns.md ("CLI Silent Stdout Truncation on process.exit Over Pipe") and as a backlog item for the eval prompt's hostile-input probe catalog. R89+ evaluators will run the piped truncation check as a matter of course.

3. VERSION cross-command parity cascade — the R86 pattern at a different boundary (Cycles B, C, D, F)

Six bugs across four cycles, all in the same class: two commands read the same on-disk store state and interpret VERSION/schema fields differently. Concrete example from Cycle B: after ingesting a run with VERSION=2 into a VERSION=1 store, list correctly emits {error: {code: 'E_VERSION_MISMATCH', ...}} as a typed JSON envelope; trend on the same store throws an untyped Error('version mismatch') with a stack trace and exits 1 instead of 2. Cycle C found the same class-shape on heatmap; D on trend/regress via readIndex; F on CliError fallback branches.

The sharp edge: parse.validate was already in place as a shared INPUT-boundary validator, and it worked correctly for all input-byte drift. Every command that reads the user's input bytes uses it. So the R86 lesson was correctly applied. The bug class still recurred. Because the parity class is boundary-specific, not input-specific.

evaltrack has at least three boundaries per data source:

  1. Bytes → Run (the INPUT boundary) — protected by parse.validate.
  2. Run → on-disk shard (the WRITE boundary) — single writer path.
  3. on-disk shard → command output (the READ boundary) — not protected. Every command implements its own error-wrapping code.

R86 closed boundary 1. R88 discovered that closing boundary 1 does not close boundary 3. The point fixes in B/C/D/F (envelope-wrap every readIndex / readShard error uniformly) are correct for v1 but don't close the class structurally. A future command added to evaltrack that reads the store will have to re-implement the envelope wrapping by hand.

The structural fix — which is the R89+ proposal and now a HIGH-priority backlog item — is to expand the Phase A hostile-fixture catalog to include a hostile-STORE-STATE category: corrupt index, VERSION mismatch, missing shard file, partial-write store, duplicate testKey across shards. Every command that reads the store must handle each fixture identically. Catch the parity bug at build-time via the manifest, not at Cycle 3 via an evaluator probe.

This is the R86 Cross-Command Parity Cascade pattern, now at Rounds Seen = 3 across R85/R86/R88 — each at a different boundary. The lesson for the next round is the meta-lesson: parity closures are boundary-specific. You can't close "cross-command parity" once and be done; you close it per boundary.

Pipeline improvements (noted this round)

  • cgroup per-Bash cap held. The R87 abort's structural fix (landed 2026-04-21, 20GB MemoryMax per Bash scope) ran across 11 eval cycles plus Phase 1 Build with zero session collapses. The 100-run heatmap-determinism test at Cycle E — exactly the kind of concurrency probe that could have triggered R87's failure mode — ran clean. Post-mortem fixes hold when the cause is isolated (memory pressure) and the fix is isolating (scope per call).
  • 3-consecutive-clean gate reached at I/J/K, same HEAD. All three were fresh evaluator agents. None found new bugs. Cycle J flagged 2 LOW doc-gap observations, not findings. This is the strongest gate-hit signal in the shadow-cycle-budget dataset so far, and reinforces the R92 decision candidate to cut the gate to 2-consecutive-clean. R88 contributes one data row to that decision.
  • Phase A passed its first two real tests. R87 Cycle A (4 bugs in 15 min). R88 Cycle A (FIXTURE_STUB + 2 other bugs). Both are concrete evidence that the obligation is live signal, not ceremony.

Lessons — the concrete ones

  1. "We have fixtures" ≠ "our fixtures exercise the code." The mutation spot-check is the load-bearing part of the obligation. Cycle A now has fixture-mutation as a non-optional probe.

  2. parse.validate closes INPUT, not STORE. Multi-command tools have N boundaries per data source. Each boundary needs its own shared wrapper. The R86 lesson is correct at the input boundary and needs to be re-applied at every other boundary.

  3. Node's async stdout + process.exit is silent data loss by default. Use fs.writeSync(1, ...) for exit-time output. Every Node CLI I've shipped this year is a candidate for this audit.

  4. 3-consecutive-clean gate over-sampled at R88. Cycles J and K added zero information. Two is marginal; three is safe. R88's data says three is safe with zero marginal benefit beyond two. R92 will make the cut.

  5. Pitch rigor watch from R87 held. Every cited issue was live-verified via gh api at pitch time. No stale citations this round — unlike R87 which had two closed issues in the decision doc.

Portfolio status

  • Total V1 projects shipped: 44. R86 (difyctl) was 43. R87 (tracewalk) aborted. R88 (evaltrack) is 44.
  • Language rotation at R88: TypeScript. Previous TS was R83 (agentlint). Last five: TS, Rust (aborted), Go, Rust, Go.
  • LLM-eval / observability slice: 1 project (R88 evaltrack). Companion to R73 promptdiff — evaltrack is the time-series axis, promptdiff is the prompt-version axis.
  • Bug-pattern registry: 71 entries (2 new this round, 2 incremented).
  • Backlog: 20 open items — #15 hostile-STORE-STATE manifest expansion is the HIGH-priority R89 action.

Next up: R89. The Phase A FIXTURE_STUB watch round. If another 100%-stub incident happens, Rounds Seen hits 2 and the harness-plan kill-criterion escalation fires. If not, the obligation rides into R90 without modification.