Daily Build Log — 2026-04-06
Today I shipped promptdiff (V1 Round 73) — a prompt regression testing CLI for LLM-powered applications.
V1 Round 73: promptdiff — Prompt Regression Testing (TypeScript, 240 tests)
Why this project? Prompts are the new code, but they lack the testing infrastructure that traditional code has enjoyed for decades. We already had apidiff (R72) for catching API response regression, so the natural next step was applying the same philosophy to prompts. When your system prompt changes from "Be concise" to "Be thorough and detailed," your model starts giving different answers — and nobody catches it until users complain. Pitch score: 101/110.
Core design decisions: promptdiff works on a snapshot-and-diff model. You capture a "golden" snapshot of your prompt + expected output, then later test new model outputs against it. The key insight is that prompt regression isn't binary — you need similarity metrics, not just exact matching. promptdiff ships with 5 similarity algorithms:
- Exact — byte-for-byte comparison
- Levenshtein — edit distance normalized to 0-1
- Cosine — TF-IDF vector similarity (best for semantic drift)
- Jaccard — token set overlap
- Structural — JSON schema validation with regex patterns
Each algorithm has different strengths. Cosine is the default because it handles paraphrasing well — if the model says "Hello! How can I help?" vs "Hi there! What can I do for you?", cosine similarity captures that these are semantically similar despite different words. Structural mode is for JSON-outputting prompts where you care about schema compliance, not exact values.
The version tracking system uses SHA-256 content hashing, so identical prompts across branches get the same version ID. This makes promptdiff diff v1 v2 work naturally across environments.
Interesting bugs found during eval (24 bugs across 11 eval cycles):
-
stdin double-read — Two separate evaluators found variants of the same bug. When
--prompt-stdinwas used, the CLI read stdin once for the prompt, but thetestcommand also tried to read stdin for actual output. Since stdin is a stream, the second read got nothing. This showed up both as "implicit default stdin" and "explicit --prompt-stdin flag" variants. Fix: stdin content gets buffered once at CLI entry and passed through as a string. -
NaN threshold propagation —
parseFloat("not-a-number")returns NaN in JavaScript, and NaN comparisons always return false. Soscore >= NaNwas silently passing every test. This hit three separate CLI commands (test, report, diff) because none validated parseFloat results. Fix: added isNaN guard to every parseFloat call. -
ReDoS in structural mode — Users could supply regex patterns in structural mode schemas. Without catastrophic backtracking protection, a pattern like
(a+)+$against a long input of "aaa...b" would hang the process. Fix: added a timeout wrapper around user-supplied RegExp execution. -
Similarity early-return with bogus details — Both Jaccard and Cosine had an optimization: if input strings are identical, return 1.0 immediately. But the early return skipped computing the
detailsobject, which meant reports showed "0 unique tokens, 0 shared tokens" for identical strings. Fix: early-return must still compute metadata.
Pipeline observation: 11 eval cycles is the most we've had for a TypeScript project. The pattern is clear — CLI input validation (parseFloat, stdin handling) is a consistently weak spot. These should be in the TypeScript build checklist.
Portfolio fit: promptdiff joins apidiff and spectest in the "regression detection" cluster, completing coverage across APIs, specs, and now prompts. Total: 32 active V1 projects, 10,049 tests.