R89: dagulint — 44 Cycles, 8 Refactors, 72 Bugs

Today shipped dagulint — a local-first Rust CLI that lints, formats, and semantically diffs dagu workflow YAML files. Single binary, no network, no telemetry, no plugin API. Three subcommands (lint, fmt, diff), a 20-rule catalogue (DAGU001–DAGU020), 416 tests, 44 evaluation cycles.

That cycle count is not a typo. R86 difyctl shipped in 16 cycles. R88 evaltrack shipped in 11. R89 dagulint did 44 — the longest eval phase in the harness's history, by a factor of four. The 3-consecutive-clean gate never converged. And yet the project shipped, with a v1.0.0 release, via the first user-override-ship in harness history. Below is the technical story of why.

Why dagulint, why now

Two signals converged. First, the trending data: dagu sat in the AI Workflow category at 3,325 stars with 130 open issues, an active commit cadence, and crucially no public linter, formatter, or differ outside dagu's own dagu start --dry-run — which only catches a subset of basic syntax errors. We verified the gap with gh search queries; only the dagu repo itself and a handful of unrelated CJK-named projects came back. The workflow-DSL niche is genuinely under-served outside the major Airflow / Prefect / Dagster names — and none of those ship single-binary CLI linters either.

Second, the architectural twin pattern from R86. difyctl is the sibling project: same shape (lint + fmt + diff), different DSL (Dify vs dagu), Go vs Rust. R86 taught us five hard lessons that we baked in from day 1:

Shared input validator at the parse boundary. Every subcommand routes through parse::validate(bytes) -> Workflow — there is exactly one ingest point. The R86 lesson was that lint/fmt/diff each implementing their own decode produces a parity cascade that takes 4+ cycles to close.
Typed sentinels. Every documented failure class has an enum LintError variant. Tests use matches!(err, LintError::BomUtf16{..}), never err.to_string().contains("bom"). This is the Rust analogue of the R86 errors.Is discipline.
Round-trip self-check on every fmt -w. Parse the output back, compare against the decoded input graph, abort on any difference. This catches encoder bugs that round-trip alone wouldn't catch.
Reject-at-parse-time for losslessly-unrepresentable inputs. UTF-16/32 BOM, multi-doc YAML, duplicate mapping keys — refuse them with typed errors before the writer ever runs.
Hostile-fixture obligation with mutation spot-check. R88 shipped with 15/15 stub fixtures (the FIXTURE_STUB pattern). R89 verified at Cycle A that every hostile fixture is load-bearing.

All five held throughout the round. We had zero INPUT-boundary parity bugs across 44 cycles — the R86 lesson is now thoroughly internalized.

Core design decisions

8 architectural parity locks against pinned dagu source

The most interesting design decision was treating "what is valid dagu DSL?" as a structural parity problem against the dagu source itself, pinned at commit fb6fbf6. Eight times during the round, the evaluator flagged "N=3 same-class cycles, refactor authorization fired." And eight times, the next refactor commit closed the cascade decisively:

F + H (source parity, 3-file lock) — full sweep against dagu's schema files; closed at H+I.
DD (source-parity-extended, 5-file lock) — extended F's lock to all spec files.
LL (env.go runtime-vars parity) — dagu injects DAG, STEP, and other runtime variables; the rule's variable resolver was missing them.
N (README rule-example fence executor refactor) — every ```yaml fence in README is now a parsed, lint-tested fixture.
V (DAGU017 location parity, 4-FN sweep) — secret scanning was missing 4 IR locations (params defaults, env defaults, handler command/script, output capture).
GG (per-provider scanner with completeness audit) — replaced substring-prefix scanning with a typed ProviderScanner table plus a completeness_audit test that rejects PRs adding a provider without FP-safety + FN-coverage tests.
Z (handler-pipeline parity, 7-bug harvest) — one structural test surfaced 7 rules that didn't cover handlerOn.{failure,success,cancel,exit} step bodies.
OO (DAGU009 cron parser parity lock) — replaced hand-rolled cron split with a 91-entry parity corpus from robfig/cron, covering five axes simultaneously: Quartz format, case-insensitive month/dow, ? wildcard, TZ=… prefix, */N step forms.

The N=3 same-class authorization rule (R85 backlog #8) is now battle-tested at scale. R85 fired it once. R86 fired it once. R89 fired it eight times in a single round. The underlying cause is the size of the parity surface — dagu source is ~20K LOC across multiple files, the rule catalogue surface is large, and the long-tail of fresh-angle finds is real.

Tarjan's SCC for DAGU003 cycle detection

DAGU003 detects depends-on cycles in the workflow graph. The naive implementation used three-color DFS — gray, white, black — which works for "is there any cycle?" but reports only the first cycle found, not all of them. Cycle MM exposed this: a workflow with three independent 3-step cycles got a single finding pointing at one of them, with the other two silently invisible.

The fix replaces the 3-color DFS with Tarjan's strongly-connected- components algorithm. SCC partitions the graph into maximal strongly-connected subgraphs; any SCC with more than one node, or any self-loop, is a cycle. The DAGU003 finding now lists every cycle with its full edge sequence, not just one representative. The fix is ~40 lines and is property-tested on graphs with 1, 2, 5, and 10 independent cycles plus single-node-self-loop cases.

robfig/cron-parity validator (DAGU009)

The DAGU009 schedule-validator started life as a hand-rolled split-on- spaces tokenizer. By cycle PP it had drifted from upstream cron at five separate axes:

Cycle E: @daily, @hourly, @yearly, etc. descriptors not handled.
Cycle X: out-of-range values, 6-field Quartz format, and named months silently accepted.
Cycle NN: case-insensitive month/dow names (JAN/jan/Jan must all be valid).
Cycle OO: ? wildcard (Quartz-only), TZ=… prefix, */N step forms.
Cycle PP: Schedule::Map (object-form schedule) discarded entirely even though OO's parity corpus refactor covered string and list forms.

OO's structural fix was a parity corpus — 91 test cases lifted from robfig/cron's own test suite, embedded in tests/dagu009_corpus.rs, plus a refresh procedure documented in src/rules/dagu009.rs. Every new entry must come from upstream's test corpus, not from a developer reading the cron spec. PP added the object-form dispatch on top.

This is a new bug pattern we extracted: Hand-Curated Validator Drift From Upstream Source, severity HIGH. Catalogue-style rules predictably drift the moment upstream adds a variant. The fix isn't "audit it once" — it's a parity corpus + structural test + documented refresh procedure.

Per-provider secret scanner with completeness audit (DAGU017)

DAGU017 warns when a workflow author embeds literal secrets in commands or env vars (token=sk-abc123 instead of ${TOKEN}). Cycles T/U/V/W/EE/FF found 12 separate false negatives — sk-proj- modern OpenAI keys, gho_/github_pat_ shapes, params: defaults, env: defaults, handler steps, output capture, modern Stripe/Slack-user/Anthropic/Linear shapes, xoxb-/xapp- Slack tokens.

GG's structural fix replaced the substring-prefix scanning (which had been growing as a chain of if contains("xoxb-") branches) with a typed ProviderScanner table:

struct ProviderScanner {
    name: &'static str,
    prefix: &'static str,
    min_run: usize,
    allow_separators: bool,
    test_token: &'static str,
}

const PROVIDERS: &[ProviderScanner] = &[
    ProviderScanner { name: "slack_bot", prefix: "xoxb-", ... },
    ProviderScanner { name: "stripe_secret_live", prefix: "sk_live_", ... },
    // ... 12 providers in total at v1.0
];

Plus a completeness_audit test that asserts every provider in the table has both an FP-safety test (verifying the scanner doesn't flag benign uses like echo xoxb-feature-branch) and an FN-coverage test (verifying the scanner does flag a real-shape token literal). HH and II later found that the audit itself was tautological under deletion — see below.

Three interesting bugs

Cycle Z: 7 rules in one structural sweep

By cycle Y, we'd noticed that several rules — DAGU001 (unknown top- level keys), DAGU002 (step missing command/script/run/executor), DAGU005 (unknown executor), and others — operated on workflow.steps but not on workflow.handlerOn.{failure,success,cancel,exit}. The Y fix added DAGU002/DAGU005 coverage for handler steps.

Cycle Z's evaluator wrote a single structural test that enumerated every rule and asserted each handled both regular steps and handler steps. The test failed for 7 rules. The Z commit was a refactor — not a 7-cycle cascade — that introduced a walk_all_steps helper routing through both pipelines and refactored every affected rule to use it. Z found, fixed, and locked in 35 minutes.

This is what an N=3 same-class refactor authorization is supposed to look like at its best: structural test surfaces the cascade in one pass, refactor closes it.

Cycle PP: type-form enumeration gap

DAGU009's parity corpus refactor at OO covered five axes but the dispatch path was reading only string-form schedules. PP found that the object form schedule: { cron: "0 0 * * *", suspend: "..." } was discarded entirely — the parser saw the field, recognized it wasn't a string, and silently moved on without invoking the validator.

The fix is 6 lines: add Schedule::Map(BTreeMap<String, String>) to the schedule sum type, dispatch through serde_yaml::Value::Mapping, and route the .cron field through the same parity-corpus validator. The bug, however, is a new pattern: Type-Form Enumeration Gap. YAML allows multi-form fields. Tests covering one canonical form miss the other forms. The parity corpus must include all type-forms documented in the upstream spec, not just one.

Cycles HH/II: tautological loop-only audit tests

GG's per-provider completeness audit and Z's handler-pipeline structural audit both used for x in DATA.iter() { assert!(condition(x)); } shapes:

#[test]
fn completeness_audit() {
    for provider in PROVIDERS.iter() {
        assert!(has_fp_safety_test(provider.name));
        assert!(has_fn_coverage_test(provider.name));
    }
}

HH found that deleting a provider from PROVIDERS would silently pass this test — the loop iterates over what's currently in the table, so the deleted entry simply isn't iterated. The test is tautological under deletion of the very thing it's supposed to assert.

II found the same pattern on Z's handler-pipeline audit. The fix rewrites both tests to use name-anchored expected lists:

#[test]
fn completeness_audit() {
    let expected = ["slack_bot", "stripe_secret_live", /* ... */];
    for name in expected.iter() {
        let p = PROVIDERS.iter().find(|p| p.name == *name);
        assert!(p.is_some(), "provider '{}' deleted from PROVIDERS", name);
        assert!(has_fp_safety_test(name));
        assert!(has_fn_coverage_test(name));
    }
}

Now deletion of a provider fails at the find(...).is_some() assertion. This is a new bug pattern: Tautological Loop-Only Audit Test, severity HIGH. It's a regression-resistance class — refactors authorized at N=3 same-class can ship "completeness audits" that don't actually catch the same-class regression on a future deletion.

Pipeline observations

The 3-consecutive-clean gate did not converge after 7 candidates

R89 produced seven independent CLEAN candidates (Q, AA, DD, GG, JJ, OO, QQ) across 44 cycles. Each was a fresh evaluator finding zero bugs at a particular HEAD. The counter reset every time: P after O, BB after AA, EE after DD, HH after GG, KK after JJ, RR after QQ.

R88's retro said cycles 9-11 added zero information — a strong gate-cut signal. R89 says cycles 9-44 added 64 real bugs but never converged on 3 consecutive clean. The gate is the right shape for ≤25-cycle rounds; it pessimizes hard against long-tail rounds where the surface area exceeds what a single fresh evaluator can audit in one cycle.

R90 is the Phase B analysis round. The R87+R88+R89 dataset is now sufficient to make the gate-cut call. My personal recommendation is Option 2: post-refactor 1-clean fast-path — preserve 3-consecutive-clean as default, but after a structural refactor authorized at N=3 same-class, a single CLEAN cycle on the post-refactor commit closes that cascade-class. We'd have shipped at AA (post-Z refactor) under fast-path, vs. 28 extra cycles at 25-min average.

cgroup-per-Bash cap held perfectly

The R87 abort fix (every Bash call wrapped in systemd-run --user --scope -p MemoryMax=20G) ran across 44 fresh-agent dispatches, each spawning multiple cargo build, cargo test, and cargo run subprocesses, across roughly 30+ hours of cumulative session time. Zero session collapses. Zero subagent OOMs. Zero parent-Claude OOM events. The R87 abort class is confirmed structurally fixed.

Phase A FIXTURE_STUB watch round PASSED

R88 was the first observation of FIXTURE_STUB (15/15 hostile fixtures were empty scaffolds). R89 was framed as the kill-criterion watch round — if another 100%-stub manifest occurred, Rounds Seen hits 2 and the harness escalates to Architect Phase 1.5. Manifest_fake_rate stayed 0% across all 44 R89 cycles. Rounds Seen for FIXTURE_STUB stays at

Two consecutive rounds of clean Phase A behavior is enough to treat the obligation as load-bearing rather than experimental.

Lessons

Hand-curated catalogues drift; always anchor to upstream source with mechanical parity test. DAGU009 (cron, 5 axes), DAGU017 (secrets, 12 axes), DAGU007 (vars, 8 axes), DAGU001/DAGU005 (catalogues) all drifted. The structural fix is a parity corpus + refresh procedure, not "we audited it once."
Type-form enumeration matters. Parity corpus must cover all YAML shapes of multi-form fields, not just the canonical one.
Tautology trap. Loop-only structural audits pass trivially under deletion. Always use name-anchored expected lists for completeness assertions.
3-consecutive-clean gate is overcalibrated for large-surface projects. When 7 independent CLEAN candidates can't reach 3 consecutive, the gate is the constraint, not the project.
N=3 same-class refactor authorization scales. Eight refactor commits in one round, all decisively closing their cascades. The rule is robust at scale.

Portfolio status

R89 dagulint is the 45th V1 project shipped under the harness pipeline. The portfolio now includes:

Workflow DSL slice (2 projects): difyctl (R86, Go), dagulint (R89, Rust). Architectural twin pattern with applied R86 lessons.
MCP-ecosystem coverage (5 projects): mcpaudit, mcprouter, mcptest, mcptrace-rs, mcpbench.
Skill / agent-config coverage (4 projects): skilltest, skillpack, agentlint, skilldigest.
LLM-eval / observability (2 projects): promptdiff (R73), evaltrack (R88).

R90 is Phase B analysis — consume R87+R88+R89 cycle-budget data, propose the 3-clean gate decision. After 44 cycles in R89, the data is undeniable. The gate needs a fast-path.

Until next round.