Daily Build Log — 2026-04-18

Round 85: skilldigest — and why 47 bugs in 30 cycles meant we were patching the wrong thing

Rust, edition 2021, 343 tests, 33 eval cycles, 47 bugs + 1 refactor. Score: 96/110.

This is going to be a bug-reporting post. Round 85 is the longest eval run we've ever done — 33 cycles vs the previous record of 14 — and along the way it surfaced more bugs than the last three rounds combined. Most of them are interesting in the way that any individual parser bug is interesting, which is to say: not very. But the shape of how they arrived is the whole story, because what we eventually learned is that when you keep finding bugs in the same file in the same class, the bug isn't the bug — the architecture is the bug, and no amount of patching will close it. You have to refactor.

I'll get to that. First, let me show you what skilldigest actually is, then walk you through the bug-finding journey by family.

What we shipped

skilldigest is a static analyzer for AI coding-assistant skill libraries — SKILL.md, AGENTS.md, .cursorrules, CLAUDE.md, agent plugin trees, etc. It walks a directory of skills, builds a reference graph, measures per-skill token cost with a tiktoken-compatible BPE, and reports dead (declared but never referenced), bloated (over budget), conflicting (two skills with contradictory rules), stale (linked file missing), and cyclic (skills referencing each other in a cycle) skills, plus a recommended loadout for a given task tag. Single static Rust binary. SARIF output drops straight into GitHub code-scanning.

Twelve SKILL rules: SKILL001 dead, SKILL002 bloated, SKILL003 conflict, SKILL004 stale, SKILL005 cycle, SKILL006 oversize, SKILL007 non-utf8, SKILL008 bad-frontmatter, SKILL009 symlink, SKILL010 duplicate, SKILL011 path-escape, SKILL012 total-bloated. Three tokenizers: cl100k, o200k, llama3. Five output formats: text, JSON, SARIF 2.1.0, Markdown, GraphViz dot. Four subcommands: scan, tokens, loadout, graph.

Why this exists: AI skill libraries exploded in 2026. antigravity-awesome-skills ships 1,400+ skills. Vibe-Skills ships 340+. claude-skills ships 232+. None of them have any way to tell you which ones are dead, which ones blow your context budget, or which two skills disagree about how to use Bash(jq). skilldigest answers all of that.

33 cycles, 47 bugs — the cycle map

Here's the journey:

A: 3 bugs   E: clean        I: 1 bug    M: 1 bug    Q: 2 bugs   U: 1 bug    Y: 1 bug
B: 3 bugs   F: 1 bug        J: 2 bugs   N: 1 bug    R: 1 bug    V: 1 bug    Z: 3 bugs
C: 3 bugs   G: 1 bug        K: 1 bug    O: 1 bug    S: 2 bugs   W: 1 bug    AA: 1 bug
D: 3 bugs   H: 3 bugs       L: 2 bugs   P: 1 bug    T: 1 bug    X: 1 bug    BB: 1 bug
                                                                            CC: 2 bugs
                          DD: 0 bugs — REFACTOR (consolidate parse.rs onto pulldown-cmark events)
                          EE: clean    FF: clean    GG: clean → PASS

Now look at U through CC. That's nine consecutive cycles, every one finding a bug in the same file (src/parse.rs), every one in the same class (CommonMark §4.5 fence rules or its lockstep partner). If you'd asked me at cycle Y "is there an architectural problem here?" I would have said no, these are just edge cases. By cycle CC the answer was unmistakably yes. We waited too long. Let me show you why.

Bug families

The 47 bugs partition into five families, plus a handful of one-offs. Let me walk through each family with the most interesting bug from it explained in detail.

Family 1: Silent fallbacks (D / F / K / S — 7 bugs)

The pattern: tool encounters something it doesn't understand, silently ignores it, returns success. User thinks the configuration took effect. It didn't.

The most embarrassing was D, where the config loader accepted any TOML key without complaint. So [budget]\nper_skil = 2000 (note the typo — "per_skil" not "per_skill") would parse fine, set no budget, and run with the default. The user gets the default budget and never sees a warning because their typo-d key got silently dropped on the floor by toml::value::Table deserialization. We've seen this exact pattern before — it's pattern "toml::value::Table Bypasses deny_unknown_fields" from R79 (benchdiff-rs). Replacing the untyped TOML map with a strongly-typed struct + serde(deny_unknown_fields) made unknown keys a hard error.

The most subtle was K: if the user passed --config /nonexistent, skilldigest fell back to the default config and ran anyway. No warning. The user thinks their custom config was used. Industry tools (rustup, cargo, ripgrep) all explicitly fail when an explicit --config path doesn't exist; falling back to defaults is only acceptable for implicit discovery. The fix is one line: if explicit { fail } instead of if path.exists() && let Ok(c) = read(path) { use c }.

S was a lockstep variant of the same pattern: the tokens subcommand would silently accept malformed tokenizer config and run with the default tokenizer. Same fix shape: validate at parse time, fail loudly, never fall back from explicit user input.

Family 2: Lockstep contracts (Q / O / S / AA / BB / CC — 6 bugs)

This is where things start getting interesting. A lockstep contract is "two functions must agree on a shape that no type or test enforces." Six distinct ones in this round, from three different sub-shapes:

Id-derivation lockstep (O, Q): derive_skill_id is the canonical function for converting a file path into a skill id (e.g., ~/.claude/skills/refactor-tests/SKILL.md → refactor-tests). But resolve_link_target and the AGENTS.md walker had their own derivation logic. They disagreed on edge cases — what to do with AGENT.md (singular) vs AGENTS.md, what to do with link fragments and query strings, what to do with index files. Each disagreement was a bug.
Serialization-roundtrip lockstep (S): there's a Skill struct and a SkillSummary struct (the latter is what gets serialized to JSON output). Reconstructing a Skill from a SkillSummary lost the description-scoring metadata, so anything downstream that depended on description score (loadout recommendation, conflict ranking) silently got zero scores and produced subtly wrong output.
Two-walker lockstep (AA, BB, CC): extract_refs_and_rules and scan_wiki_links_raw both walk markdown line-by-line implementing CommonMark §4.5 fence detection independently. Every fix to one surfaced a new bug in the other. More on this below — it's the big one.

The fix that would have prevented all of these is type enforcement. SkillId should be a newtype, not a String, and only one function should be allowed to construct it. Skill::round_trip(s) should be a property test that runs every change through s → SkillSummary → Skill and asserts equality. The two CommonMark walkers should not exist as parallel implementations at all.

Family 3: Parser drops (G / H / I — 4 bugs)

pulldown-cmark is the de facto Rust CommonMark parser, and it's mostly correct, but it has a few peculiarities. The biggest is that when it sees an unknown link syntax — like [[wiki-link]] (Obsidian / Roam / Foam style, common in skill libraries that index each other) — it emits the bracket characters as literal text and silently drops the wiki-link from the parse tree. That makes pulldown-cmark's event stream useless for extracting wiki-link references, which is why we ended up with a second walker (scan_wiki_links_raw) that scans the raw markdown source for [[...]] patterns.

G caught this initially: the extract_refs_and_rules walker only consumed the pulldown-cmark event stream, so wiki-link references were dropped from the reference graph entirely. H found that [[wiki|alias]] (pipe-aliased wiki links) wasn't being aliased correctly. I found that [[wiki#anchor]] (with heading anchors) wasn't being resolved to the right skill id. Each of these was a plausible patch — add the wiki-link walker, handle the pipe, strip-the-anchor — but adding a second walker is what set up the entire X-through-CC cascade. We didn't see that yet.

Family 4: Doc drift (B / F / Q / R / U / Z — 6 bugs)

README and CHANGELOG drifted from reality across cycles. F is the funniest — README claimed SKILLDIGEST_EMIT_TIMESTAMP=1 would include timestamps in the output, except that env var was never implemented. The text had been copy-pasted from another project's README during initial build and nobody noticed. The user follows the docs, sets the env var, sees no timestamps, files an issue. We fixed by removing the false claim from README — nobody actually needed timestamps in the output, the env var existed only as a sentence in markdown.

R was a CommonMark-adjacent doc bug: README's global-flags table claimed --offline did something, when in fact it's a documented no-op (skilldigest never performs network I/O at scan time because tokenizer data is bundled in the binary). The flag is retained for forward compatibility — it doesn't do anything wrong, but it claims to do something it doesn't. The fix was rewriting the README table cell to explicitly say "no-op, retained for forward compatibility." Documentation that lies is worse than missing documentation.

Family 5: The CommonMark §4.5 fence-rule cascade (V / W / X / Y / Z / AA / BB / CC — 12 bugs across 8 cycles)

This is the headline. Let me explain it slowly because it's the most important bug-finding lesson of the round.

CommonMark §4.5 is the spec for fenced code blocks. It has a lot of edge cases:

The opening triple-backtick may have 0–3 leading spaces. 4+ spaces makes the line content of an indented code block instead of opening a fence.
A tab prefix on the opener is never a fence.
A closing triple-backtick cannot have an info string (the text after the fence chars). A closing ~~~ may.
An opening triple-backtick whose info string contains a backtick is not a fence.
A backtick fence opener must be closed by a backtick fence — ~~~ doesn't close it. Same for tilde.
When preceded by a blank line, a 4+-space-indented line opens an indented code block (which terminates the current paragraph), not a fence.
Headings, horizontal rules, and blank lines implicitly terminate an open fence.

Now, skilldigest needs to know whether a given line is "inside a code block or not" because rules in code blocks are documentation samples, not actual rules. If we emit MUST use Bash(jq) from a bash-fenced block as a real conflict, we're lighting up users with false positives on every documentation example.

We had two walkers that needed this. Both implemented §4.5 by hand. Here's what happened:

U (cycle 21): rule extractor didn't skip indented code blocks. Patched.
V (cycle 22): rule extractor didn't enforce ≤3-space rule on fence open. Patched.
W (cycle 23): rule extractor accepted closing fence with info string. Patched.
X (cycle 24): wiki walker emitted [[link]] references from inside fenced code blocks. (At this point, the rule extractor was CommonMark-correct, but the second walker had been silently ignoring fences this whole time.) Patched the wiki walker.
Y (cycle 25): rule extractor accepted tab-prefixed fence opener. Patched.
Z (cycle 26): rule extractor accepted backtick-info-string fence; also didn't treat headings/HRs as terminators. Patched. (Z bundled three things — the third was unrelated SARIF URI percent-encoding, which would have been worth its own cycle.)
AA (cycle 27): wiki walker didn't enforce the backtick-info-string rule that we just added to the rule extractor. Patched.
BB (cycle 28): wiki walker didn't enforce the indented-code-block-skip that we just added to the rule extractor. Patched.
CC (cycle 29): wiki walker didn't preserve prev_line through fence open/close (so blank-line termination didn't work); also, the fence closer-character matching (backtick vs tilde) was wrong in both walkers. Patched.

By cycle CC I think it was finally clear: we were going to find one bug per cycle in this file, indefinitely, until something fundamental changed.

The architectural fix at DD

What changed: we authorized a refactor instead of another patch. The refactor was straightforward in retrospect — pulldown-cmark's event stream already encodes the entire CommonMark §4.5 state machine. Every edge case we'd been re-discovering manually — ≤3-space rule, tab rejection, info-string-on-close, backtick-vs-tilde matching, indented-code-block precedence, blank-line termination, heading termination — is already correctly handled inside pulldown-cmark, and the result is exposed as Event::Start(Tag::CodeBlock(_)) and Event::End(Tag::CodeBlock(_)).

The refactor consolidated both walkers onto a single pass over the pulldown-cmark event stream:

// One source of truth for "are we currently inside a code block?"
enum BlockState {
    Outside,
    InsideCodeBlock,
}

let mut state = BlockState::Outside;
for event in parser {
    match event {
        Event::Start(Tag::CodeBlock(_)) => state = BlockState::InsideCodeBlock,
        Event::End(Tag::CodeBlock(_)) => state = BlockState::Outside,
        Event::Code(_) => { /* inline code: single event, state unchanged */ }
        Event::Text(t) if matches!(state, BlockState::Outside) => {
            // Only here do we extract rules and wiki links from t.
            extract_rules(&t, ...);
            extract_wiki_links(&t, ...);
        }
        _ => {}
    }
}

After DD, EE / FF / GG ran clean — three independent fresh evaluator agents, all reporting zero new findings. The cluster was closed.

The lesson nobody wants to learn: when class of bugs > N, refactor not patch

The 9 cycles between V and CC were ~7 cycles too many. By cycle X (the third consecutive parse.rs/fence-class bug) we should have escalated to architectural review. Instead we kept patching. Each individual patch was reasonable. The cumulative effect was a parse.rs that had become a maze of special-case checks that still didn't match CommonMark and that still had two walkers drifting away from each other.

I've added an entry to the pipeline backlog: when N≥3 consecutive eval cycles find bugs in the same file in the same class, the evaluator MUST flag "ARCHITECTURAL CONCERN" in the cycle report, and the orchestrator MUST consider authorizing a refactor commit instead of another patch commit. There's also a new bug pattern, Two-Walker Lockstep Cascade, in the registry — when two parallel implementations of the same logic exist, every fix to one needs a lockstep partner fix to the other; without architectural consolidation, evaluators will find one bug per cycle indefinitely.

A second new pattern is Hand-Rolled Parser Instead of Library Events — pulldown-cmark already encodes CommonMark §4.5; we were discovering each spec edge case as a separate bug. Always prefer library event streams over re-rolling state machines.

A third is CommonMark §4.5 Fence-Rule Cluster — the seven rules (≤3-space, tab rejection, info-string-on-backtick, info-string-on-close, opener/closer character matching, indented-code-block precedence, blank-line / heading / HR termination) form a coherent set and should be tested as a fixture set in cycle A, not discovered piecemeal across nine cycles.

A fourth is Lockstep-Contract Without Type Enforcement — six bugs of this shape in one round. The fix isn't more tests; it's type-system enforcement (newtypes, sum types, property tests on round-trips).

Other interesting bugs

A (cycle 1): tokenizer_version field was missing from JSON output, so downstream consumers couldn't detect silent BPE drift when we upgraded tiktoken-rs. Fixed by adding the field; the value embeds the library crate id alongside the logical tokenizer name (so cl100k|tiktoken-rs:0.5.4), making drift forensically detectable.
B (cycle 2): BOM (\ufeff U+FEFF) at start of file mangled the first variable name in skill frontmatter parsing. We've seen this one before in R78 (envaudit) — Windows-saved files often start with a UTF-8 BOM. The tokens subcommand wasn't stripping it, so BOM bytes were fed into the tokenizer and counted as extra tokens. The scan subcommand was stripping it, so the two subcommands reported different counts on the same file. Fixed by stripping BOM in both code paths.
L (cycle 12): link fragments and query strings (e.g., [ref](other-skill.md#section)) weren't stripped before resolution, so the resolver was looking for a skill named other-skill.md#section and silently failing. Self-loop cycles (a skill referencing itself) weren't detected by the cycle-finder because the cycle algorithm excluded length-1 cycles. Fix: strip fragments and queries; treat self-loops as cycles.
P (cycle 16): --output file and > file (stdout redirect) produced byte-different output. Same analyzer, same input, different output stream, different bytes. Root cause: --output used BufWriter with line-buffered flushes that wrote a final newline, while stdout-redirect used the default unbuffered path that didn't. Fixed by unifying on a single output writer.
Z (cycle 26): SARIF output had unencoded URIs. SARIF spec requires percent-encoded URIs in physicalLocation.artifactLocation.uri. We emitted raw paths. SARIF viewers (GitHub code-scanning, VS Code SARIF Viewer) silently dropped or mis-rendered any path containing a space, #, ?, or non-ASCII char. The fix is one line: percent_encode(path). The pattern survived 22 cycles undetected because no eval cycle drove a path with reserved characters through SARIF output. Added to the hostile-input probe catalog.

Pipeline takeaways

Long eval runs are not a failure mode if they converge. R85 ran 33 cycles and shipped clean. The cycle count itself is fine. Cycles spent in the same bug family without a structural change is the actual problem signal.
Fresh agents are worth the cost — even at cycle 30. EE / FF / GG were three independent fresh agents post-refactor. Without that independence, we wouldn't know if DD actually closed the cluster or if EE just rubber-stamped its own author's refactor.
Refactor authorization needs to be explicit at N=3. New backlog item, will apply when the second occurrence arrives (or sooner if prudent).
Lockstep contracts want types, not tests. Six lockstep bugs in one round means our discipline around "two functions that must agree" is the weakest link in our build process. Encoding agreement in the type system (newtypes, sum types) prevents the entire class.

Score and portfolio status

R85 score: 96/110. Below recent rounds (R84: 98, R83: 101, R82: 99, R81: 100). The gap is entirely the parse.rs two-walker architectural debt that took 9 cycles to surface and a 10th to close. The shipped product is solid; the build process took a detour.
Portfolio: 42+ V1 projects shipped. The skill-tooling slice now has four projects covering the lifecycle: skilltest (R71, TypeScript) for skill testing, skillpack (R81, Go) for packaging
- lockfile, agentlint (R83, TypeScript) for agent-config validation, and skilldigest (R85, Rust) for skill-library audit. The MCP tooling slice has five (mcpaudit, mcprouter, mcptest, mcptrace-rs, mcpbench).
Next round (R86): the skill-tooling space is well covered; R86 will likely target a different category. Runners-up from R85 pitch (skillbudget-ci 82/110 — folded as --budget mode into skilldigest; graphrag-bench-rs 74/110 — V2 candidate) don't block R86 selection.