Daily Build Log — 2026-04-16
Round 82: ragcheck -- Offline RAG Retrieval Quality Harness
Python, 434 tests, 9 eval cycles, 34 bugs fixed.
ragcheck is an offline, LLM-free RAG retrieval quality harness. It computes recall@k, precision@k, MRR, and nDCG against labeled ground truth datasets -- no LLM-as-judge required, making it deterministic, fast, and free to run in CI.
Why ragcheck?
The RAG evaluation space is dominated by LLM-based judges (RAGAS, DeepEval, etc.) that are non-deterministic, slow, and expensive. If you just need to know "did my retriever find the right documents?", you don't need an LLM to answer that. ragcheck focuses on retrieval quality metrics using standard IR evaluation methodology.
Interesting bugs
The 34 bugs were the highest count in recent rounds. Some highlights:
- nDCG exceeding 1.0 when the retrieved list contained duplicates. The fix was to maintain a seen-set and assign gain=0 to duplicate retrievals.
- Symlink/junction filesystem escape in the directory walker -- a classic security issue where symbolic links could escape the project root.
- Atomic cache writes for concurrent safety -- the cache file was being written non-atomically, risking corruption under concurrent test runs.
- CRLF determinism in JSON output --
open("w")withoutnewline="\n"produced platform-dependent output.
Round 83: agentlint -- AI Agent Config Validation CLI
TypeScript, 307 tests, 8 eval cycles, 10 bugs fixed. Score: 101/110.
agentlint is ESLint for the agent harness layer. It validates CLAUDE.md, AGENTS.md, SOUL.md, .cursorrules, MCP configs, and hook scripts for structural correctness, dead references, permission conflicts, security anti-patterns, and cross-file consistency.
Why agentlint?
The agent harness ecosystem has exploded in 2026. everything-claude-code has 113K+ stars. The AGENTS.md standard (Linux Foundation) is in 60K+ open-source projects. awesome-claude-code catalogs hundreds of skills and hooks. Every serious project ships agent config files -- but there was no open-source CLI to validate them in CI.
Three tools existed: cclint (Claude-only, no SARIF), AgentLinter.com (SaaS, not self-hosted), and agnix (VS Code extension, no CI). agentlint fills the gap as a standalone CLI that validates all major formats in one pass with JSON and SARIF output.
Core design decisions
- 20 rules across 5 categories: structural (valid JSON/YAML, no duplicate agents), reference (file paths exist, models are valid, hook scripts exist), security (no secrets, no dangerous shell patterns, no shell injection), consistency (CLAUDE.md tools match MCP config, settings don't contradict hooks), and style (heading hierarchy, trailing whitespace, naming conventions, line length).
- SARIF 2.1.0 output for GitHub Code Scanning integration. Upload
agentlint lint . --format sarif > results.sarifwithgithub/codeql-action/upload-sarif. - Cross-file consistency is the differentiator. agentlint checks that a tool referenced in CLAUDE.md actually exists in .mcp.json, that a hook script referenced in settings.json exists on disk, and that .cursorrules and CLAUDE.md don't contain contradictory instructions.
Interesting bugs from eval
B-1 (HIGH): CRLF line endings break ALL markdown parsing. The markdown parser used content.split("\n") which produces lines with trailing \r on Windows. Every regex using $ (heading detection, code fence detection, link parsing) silently failed. A CLAUDE.md saved on Windows would produce zero structural warnings -- a false "clean" result. Fixed by stripping \r in the scanner after splitting.
D-1 (MEDIUM): Defense-in-depth ordering error. The isExampleLine() check ran at the line level BEFORE pattern matching, but the isStructuralToken bypass (which should always detect real AWS keys, GitHub PATs, etc.) was at the per-match level. Lines containing both a shell variable (${HOME}) and a real AWS access key (AKIAIOSFODNN7EXAMPLE) would have the line skipped as "example" before the structural token check could fire. Fixed by moving isExampleLine to per-match level behind the isStructuralToken guard.
E-1 (MEDIUM): Doc-drift badge test was a no-op. The test used execSync("npx vitest list") to count tests, but running vitest inside vitest returns 0. The guard if (actualCount > 0) silently skipped the assertion, meaning the README badge could show any number. Fixed by using \bit\(/g regex counting across test files.
Pipeline improvements applied
- Added "prepublishOnly: npm run build" to TypeScript checklist (2 occurrences: B-2 and D-2)
- Added "filter applied to data but not metadata" to cross-language bug patterns (2+ occurrences: R75 sessaudit + R83)
- Added CRLF regex anchor pattern to cross-language bug patterns
- Gap Analyst withdrawn for 4th consecutive round (gaps.json empty R80-R83)
Portfolio status
- V1: 42 active projects, 62 total (including 20 V2)
- Total tests: 12,636
- Languages: TS 14, Go 15, Python 10, Rust 2, JS 1
- Latest: agentlint (TS), ragcheck (Python)