Daily Build Log — 2026-04-26
R89 dagulint shipped — 44-cycle deep evaluation marathon, 8 architectural refactors, 72 bugs, first user-override-ship
Build daily, learn daily, write it down.
R89 dagulint shipped — 44-cycle deep evaluation marathon, 8 architectural refactors, 72 bugs, first user-override-ship
Round 88: evaltrack (TypeScript) — regression and trend CLI for promptfoo eval histories. 366 tests, 11 eval cycles, 17 bugs. The hostile-fixture obligation caught a 100% stub-faking on its first real test, and a new bug class (pipe-buffer truncation on process.exit) surfaced that every Node CLI I've shipped this year is latently vulnerable to.
Round 86: difyctl (Go) — Dify workflow DSL linter/differ. 274 tests, 16 eval cycles, 31 bugs. The cross-command parity cascade: 4 cycles of 'fmt accepts what lint rejects' before parse.Validate became the single source of truth.
Round 85: skilldigest (Rust, 343 tests, 33 eval cycles, 47 bugs + 1 architectural refactor). The bug-reporting story of the year — when a class of bugs gets large enough, you refactor instead of patch.
Round 84: mcpbench (Go, 235 tests, 9 eval cycles, 10 bugs fixed). Protocol-aware MCP load tester with per-tool p50/p95/p99, stdio + HTTP/SSE, compare-for-CI.
Round 82: ragcheck (Python, 434 tests). Round 83: agentlint (TypeScript, 307 tests). Two projects shipped: RAG evaluation harness and AI agent config linter.
Round 81: skillpack — a package manager, lockfile, and bundler for agent skills (SKILL.md / .cursorrules / AGENT.md / skill.yaml). Resolves semver dep
Round 80: mcptrace-rs — a Rust observability proxy for MCP servers. Drop-in between your agent and any MCP server, records every tool-call as a span, exports to OTLP / Zipkin / stdout, enforces SLO burn-rate budgets. 199 tests, 12 eval cycles, two counter resets.
Round 79: benchdiff-rs — the first V1 Rust project. A statistical benchmark regression detector for Criterion, Go bench, and hyperfine output. Welch's t-test matching scipy to 8 decimals, 180 tests, 8 eval cycles, single static binary.
Round 78: envaudit — a Python CLI that audits environment variable configurations across .env, CI/CD, Docker Compose, and Kubernetes for leaked secrets, cross-environment drift, and missing required vars. 25 rules, 274 tests, 9 eval cycles.
Round 77: mcpaudit — a static security auditor for MCP server definitions. 20 rules, 327 tests, 92% coverage, hardened across 23 adversarial eval cycles. Plus six new bug patterns about regex false positives, schema inference, and arithmetic overflow.
Round 76: mcprouter — a routing gateway for MCP servers. 10 eval cycles, 17 bugs, and a masterclass in everything that can go wrong with circuit breakers, half-open states, and stdio child processes.
Round 75: sessaudit — CLI Agent Session Auditing (TypeScript, 205 tests). A security compliance tool for CLI agent sessions with policy enforcement and risk scoring.
Today I shipped agentmem (V1 Round 74) — a testing framework for AI agent memory systems.
Today I shipped promptdiff (V1 Round 73) — a prompt regression testing CLI for LLM-powered applications.
Today I shipped two projects: apidiff (V1 Round 72) and ouroboros-rs (V2 Project 20).
graft compile now outputs to orchestration.md instead of CLAUDE.md. Existing project support, repo cleanup, marketing push.
Graft handles NL sub-steps (review, analysis, planning) within coding workflows. Import existing .claude/ structures with graft import. Positioning refined to natural-language I/O pipelines.
New graft import command reads existing .claude/ structures and generates .gft files. Round-trip validated, 1,722 tests.
Two projects shipped: apidiff (API regression detection, Go, 247 tests) and ouroboros-rs (Socratic execution engine in Rust, 199 tests)
Natural language pipeline generation, Claude Code native integration, runtime quality validation with automatic .gft fix suggestions. 1,712 tests.
Two projects shipped: spectest (TypeScript, OpenAPI spec compliance testing) and edict-rs (Rust reimplementation of edict multi-agent orchestration).
npm publish, 9 bug fixes from deep debugging, e2e Claude Code verification. 1,574 tests. M1 milestone complete.
Conditional edge codegen, rustc-style errors with did-you-mean, graft watch, graft visualize. Honest execution model docs. 1,614 tests.
Two projects shipped: strmtest (Go, streaming protocol testing) and understand-rs (Rust reimplementation of Understand-Anything).
Graft v3.0 adds pluggable codegen backends, field-level memory writes, multi-field partial reads, and failure strategies. 477 tests, 168 ratchet decisions, 8 rounds — half without any debate.
Graft v3.1 adds LSP autocomplete, programmatic API exports, parallel failure fix, and fallback cycle detection. 537 tests, 172 ratchets, 5 rounds — only 1 needed debate.
Graft v3.2 adds panic-mode parser error recovery, keyword hover docs, LRU cache eviction, and import completion wiring. 582 tests, 180 ratchets, 4 rounds — under 10 agent calls.
Graft v3.3 adds LSP code actions for auto-importing undefined references, runtime conditional edge routing, document symbols, and condition type validation. 636 tests, 190 ratchets, 4 rounds.
Graft v3.4 adds LSP rename support, conditional edge token estimation, hierarchical document symbols, and splits the LSP features module. 690 tests, 200 ratchets, 4 rounds.
Graft v3.5 hardens the rename feature with comment/string filtering, CRLF normalization, cross-file conflict detection, and adds FlowNode locations to the parser. 739 tests, ~210 ratchets, 4 rounds.
Graft v3.6 adds LSP find-all-references, unifies the keyword system, fixes document symbol ranges, and catches the first NEEDS_CHANGES in 24 rounds. 790 tests, ~220 ratchets, 4 rounds.
Foreach failure handling and multi-hop conditional routing. 832 tests, 3 bugs caught by dual-agent debate, first runtime-focused version since v3.1.
Closing the longest-running tech debt, extracting the densest code block, and the first clean sweep in 7 rounds. 864 tests, ~10 agent calls, 0 NEEDS_CHANGES.
Final v3.x release — 890 tests, conditional edge transforms, tech debt zero. 3 rounds, ~9 agent calls.
First major version bump — let bindings, expression evaluation, graph parameters, graph calls. 980 tests, 6 rounds, 33 new ratchets.
Quality-only release — multi-segment condition fix, output isolation, scope extraction, exhaustive switches. 1,001 tests, 4 rounds, 0 new features.
Built-in functions (len, max, min, str), graph call return values, equality unification. 1,048 tests, 4 rounds, 2 new error codes.
Multiplication, modulo, abs/round/keys builtins, division precedence fix. 1,077 tests, 3 rounds, 17 consecutive PASS.
Template expressions, evaluateExpr extraction, BUILTIN_FUNCTIONS enrichment. 1,123 tests, 4 rounds, 21 consecutive PASS.
Adding parallel and foreach flow control to Graft with an adaptive adversarial debate process. The debate even found a bug in the benchmark file itself. 135 tests, 58 ratchet-locked decisions.
Adding graft run — compile and execute AI pipelines by spawning Claude Code subagents. A4-Specialist proposed reusing generateAgent(), then self-retracted as forced dissenter. A3-Skeptic caught two showstoppers. 171 tests, 70 ratchet-locked decisions.
Multi-file imports and persistent memory across pipeline runs. 5 debate rounds across all 7 pipeline stages, foreach memory staleness bug caught by A3-Skeptic alone, and a 3:1 vote for field-matching merge over full overwrite. 249 tests, 92 ratchet-locked decisions.
Quality release with runtime token tracking, 4 correctness fixes, and shared module extraction. 288 tests, 21 agent calls (42% under budget), zero review failures.
Developer tooling release: LSP server with diagnostics/hover/go-to-definition, VS Code extension, npm distribution as @graft-lang/graft. 376 tests, ~33 agent calls, 132 ratchet-locked decisions.
Today was a milestone day for the daily-challenge project — we shipped 7 projects total, including our first-ever Rust reimplementation in the V2 pipe
From idea to working compiler in one sprint. 4 AI agents argued over every design decision, caught 3 critical bugs, and produced a graph-native language that compiles .gft to Claude Code harness structures. 110 tests, 46 ratchet-locked decisions.
I reimplemented rtk (14.6K stars, Rust) in Go with zero dependencies. Here's the performance comparison and what I discovered.
Rewrote the pipeline infrastructure: Node.js safety hooks, bug pattern auto-learning, 11 Claude Code skills, 6 plugins published. Then shipped 7 more projects to reach 57 total.
Built 14 open-source tools in a single session, reaching the 50-project milestone. Redesigned the scoring engine, added safety gates, and taught the pipeline to evolve itself.
How I built a pipeline that produces one open-source developer tool every day — 35 projects and 10,777 tests so far — by combining AI agents, adversarial evaluation, and trend data.