← Blog

Graft v2.1: Token Tracking & Correctness

5 min read
graftcompileradversarial-debatetoken-trackingclaude-code

Graft is a graph-native DSL that compiles .gft source files into Claude Code harness structures (.claude/ directories with agent definitions, hook scripts, and orchestration docs). It is built for LLM-to-LLM communication — structured pipelines with typed schemas and compile-time token budget analysis. Every line of Graft is developed through a multi-agent adversarial debate process where 2-4 AI agents independently analyze, cross-critique, and converge on each design decision.

What v2.1 Adds

v2.1 is a quality release. No new syntax, no new keywords. Three categories of improvement:

Runtime Token Tracking

After v1.2 added graft run and v2.0 added memory, the missing piece was visibility into actual token consumption. v2.1 adds a TokenTracker that monitors usage per node during pipeline execution:

[token] Classifier  est: 3000  actual: 2847  cumulative: 2847/10000 (28.5%)
[token] Assigner    est: 1500  actual: 1203  cumulative: 4050/10000 (40.5%)

The tracker logs to .graft/token_log.txt with ISO timestamps. RunResult.tokenUsage provides programmatic access with budget/consumed/fraction/perNode breakdown. Budget enforcement is advisory — warnings at 80% and 90% thresholds, no hard abort.

Under the hood, parseCLIOutput uses a heuristic envelope detection to separate Claude CLI metadata from node output. The executor switched from --print to --output-format json to get structured usage data.

Correctness Fixes

Four new validations that catch real mistakes:

memory Log(max_tokens: 2k, storage: file) {
  turns: List<String>
  summary: String
}

node Writer(model: sonnet, budget: 2k/1k) {
  reads: [Input]
  writes: [Log]
  produces Result { reply: String }
}
# Warning: Writer produces no matching fields for memory Log
# (Result has 'reply', Log expects 'turns' and 'summary')
  • Writes schema overlap: warns when produces fields don't match memory fields
  • max_tokens > 0: catches max_tokens: 0 declarations
  • Parallel memory writes: warns when parallel { A B } and both write to the same memory
  • Warning routing fix: warnings no longer block compilation (bug since v1.0)

Shared Module Extraction

MODEL_MAP had been duplicated in two files since T6 — a tracked tech debt item across 4 versions. v2.1 extracted it to src/constants.ts alongside three other constants. Utility functions moved to src/utils.ts, and memory I/O became standalone functions in src/runtime/memory.ts. Two ratchet-locked decisions were unlocked for the first time in the project to enable this.

The Bugs That Mattered

The warning routing bug was the standout finding. Since v1.0, compiler.ts had pushed all analyzer diagnostics into errors[] without checking severity. This meant any warning would block compilation — a bug that was invisible because v1.0 through v2.0 never emitted warnings. Both A3-Skeptic and A4-Specialist found it independently in the 2-agent analysis. Without fixing it first, the new warning validations would have broken valid programs.

Mock spawner backward compatibility was subtler. The new parseCLIOutput function needed to distinguish CLI envelope JSON ({ result: "...", usage: {...} }) from regular node output that happens to contain a result field. A2-Pragmatist caught that existing test mocks returned { result: "..." } and would be incorrectly unwrapped. The fix: require both result AND at least one metadata field (usage, model, or cost_usd) before treating output as a CLI envelope.

Process Evolution: 42% Under Budget

v2.1 used 21 agent calls against a 36-call budget — a 42% reduction and 70% less than v2.0's ~70 calls. This came from four adaptations:

| Adaptation | Savings | Rationale | |-----------|---------|-----------| | MEDIUM tier (2-agent analysis) | -16 calls | R1, R2, R4 had high consensus; 4 agents would not have added insight | | Skip Step 0 (research) | -8 calls | All rounds extended existing patterns; no novel domains | | Skip cross-critique | -8+ calls | Even R3 (HIGH complexity) had clearly stated disagreements in Step 1 | | Test-only round collapse | -6 calls | R4 needed no design decisions — 2 calls instead of 8 |

The forced dissenter mechanism was not exercised (it requires cross-critique), but remains available for high-complexity rounds with low consensus.

This was the first version with zero NEEDS_CHANGES — every round passed review on the first attempt. Combined with v2.0's zero NEEDS_CHANGES, the last 9 consecutive rounds have passed first-attempt review.

Stats

| Metric | v2.0 | v2.1 | |--------|------|------| | Tests | 249 | 288 (+39) | | Ratchet decisions | 92 | 107 (+17, 2 unlocked) | | Agent calls | ~70 | 21 (42% under budget) | | Rounds | 5 | 4 | | NEEDS_CHANGES | 0 | 0 | | Commits | -- | 12 | | Critical bugs found | 2 | 1 | | Cross-critique used | 4 of 5 | 0 of 4 |

Try It

git clone https://github.com/JSLEEKR/Graft.git
cd Graft
npm install
npm run build

# Compile a .gft file
npx graft compile examples/chatbot.gft

# Run a pipeline with token tracking
npx graft run examples/chatbot.gft --input '{"message": "hello"}'

# Check without compiling
npx graft check examples/chatbot.gft

# Run tests
npm test  # 288 tests

Built with Claude Opus 4.6 via the adversarial debate harness. 21 agent calls, 107 ratchet-locked decisions, zero review failures.