← Blog

Graft v1.0: Building a Programming Language with Adversarial AI Debate

5 min read
graftcompileraiclaude-codeadversarial-debatelanguage-design

What is Graft?

Graft is a graph-native language for AI agent harness engineering. A .gft source file declares contexts, nodes, edges with transforms, and a graph — and the compiler produces a complete .claude/ directory with agent definitions, hook scripts, orchestration docs, and settings.

The core insight: multi-agent AI systems waste tokens by passing full natural language context between agents. When Agent A produces a 2,000-token analysis and Agent B only needs three fields, current frameworks still forward the entire blob. Graft fixes this at the language level with typed schemas and edge transforms.

context BugReport(max_tokens: 800) {
  title: String
  description: String
  stack_trace: Optional<String>
}

node Classifier(model: sonnet, budget: 2k/1k) {
  reads: [BugReport]
  produces Classification {
    category: enum(crash, regression, feature_request, performance, security)
    priority: enum(p0, p1, p2, p3)
    confidence: Float(0..1)
  }
}

node Assigner(model: haiku, budget: 1k/500) {
  reads: [Classification]
  produces Assignment {
    team: String
    suggested_owner: String
    reason: String
  }
}

edge Classifier -> Assigner
  | select(category)
  | select(priority)
  | compact

graph BugTriage(input: BugReport, output: Assignment, budget: 5k) {
  Classifier -> Assigner -> done
}

That select(category) | select(priority) | compact pipeline extracts only what the downstream node needs — the compiler estimates ~79% token savings on that hop.


The Adversarial Debate Process

Rather than writing code directly, every implementation task went through a structured adversarial debate with 4 AI agents running on Claude Opus 4.6:

  • A1 - Architect: System design, module boundaries, interface contracts
  • A2 - Pragmatist: YAGNI enforcement, shipping speed, practical tradeoffs
  • A3 - Skeptic: Bug finder, edge case hunter
  • A4 - Specialist: Compiler theory, domain expertise

7-Step Process per task: Research → Analysis → Cross-Critique → Convergence → TDD Implementation → Review → Memory Update

Forced Dissenter: The agent with the highest confidence must argue against their own position. This broke confirmation bias in every single task.

Ratchet System: Once confirmed, design decisions are locked. No circular debates.

~14 agent calls per task, 8 tasks, ~99 total agent calls.


The Language Design

The specification started in Korean, defining everything from lexical structure to memory systems. The brainstorming session narrowed scope:

  • TypeScript as implementation language — fast iteration with vitest, Node.js ecosystem alignment
  • Hand-written recursive descent parser — full control over error messages with caret-style diagnostics
  • End-to-end scope — parser through codegen in one milestone, not "parser first, codegen later"
  • k-suffix notation4k means 4000, 2k/1k means budget-in 2000 / budget-out 1000
  • Pipe transforms on edges| select(findings) | compact instead of nested blocks

Implementation: 8 Tasks via Adversarial Debate

T1: Scaffolding

tsc-only build, ESM, NodeNext resolution. A3 argued against pre-creating empty directories (YAGNI won). GraftError class with SourceLocation and caret-style formatting locked in.

T2: Lexer

A3 found critical bug #1: GraftError didn't extend Error — breaks all instanceof checks and vitest's toThrow(). A3+A4 found: Float parsing 42.} — without checking that a digit follows the dot, the lexer would consume . as part of a float literal.

T3: AST

A3 proposed narrow literal unions (name: 'String' | 'Int' | 'Float' | 'Bool' instead of name: string). A2 opposed (YAGNI). A3 won — the narrow unions caught 2 codegen bugs in T6.

T4: Parser — The Show-Stopper Bug

A3-Skeptic found the CRITICAL keyword-identifier collision:

The parser had expectIdentifier() which only accepts Identifier tokens. But keywords like input, compact, skip are valid field names. The lexer tokenizes input as a keyword token, so expectIdentifier() rejects it. Every common English word used as a field name would crash the parser.

produces Result {
  input: String     // "input" is TokenType.Input, not Identifier
  output: String    // "output" is TokenType.Output
  model: String     // "model" is TokenType.Model
}

Fix: expectIdentifierOrKeyword() backed by a KEYWORD_TYPES Set. All three other agents missed this.

T5: Analyzer

Three-class decomposition: ScopeChecker (name validation), TypeChecker (field validation), TokenEstimator (budget analysis). A3 found retry_then_fallback worst-case omits fallback cost.

T6: Code Generator

Five codegen files: agents, hooks, orchestration, settings, codegen orchestrator. A2 found toLocaleString() locale dependency — on a Japanese-locale machine, numbers format differently. Fix: always 'en-US'.

T7: Compiler Pipeline & CLI

A3 found: no graph = silent success. A .gft file with contexts and nodes but no graph would produce degenerate output. Fix: explicit guard.

T8: Final Verification

1-agent smoke test. All 110 tests passing, all 14 benchmarks passing, clean build.


A3-Skeptic: 100% Hit Rate

| Task | Bug | Severity | |------|-----|----------| | T2 | GraftError not extending Error | Critical | | T2 | Float parsing 42.} | Medium | | T3 | Loose name: string vs narrow literals | Medium | | T4 | Keyword-identifier collision | Critical (show-stopper) | | T5 | Fallback cost omitted from worst case | Medium | | T7 | No-graph silent success | Critical |

At least one significant finding in every task from T2–T7.


Honest Retrospective

What worked: The adversarial debate found real bugs that would have shipped. YAGNI discipline kept the compiler at ~2,000 lines. TypeScript + vitest was the right toolchain.

What didn't work: The plan document's code snippets went stale immediately. 3-4 agent calls per task were spent catching mechanical mismatches. The toLocaleString bug should have been caught by a linter, not a 4-agent debate.

What I'd do differently: Start with the benchmark suite for acceptance criteria. Use 2 agents instead of 4 for simple tasks (T1, T3, T8). Regenerate plan code from source before each task.

Stats

| Metric | Value | |--------|-------| | Source lines | ~1,936 | | Unit tests | 110 | | Benchmarks | 14 | | Agent calls | ~99 | | Ratchet-locked decisions | 46 | | Critical bugs caught | 3 | | Dependencies (runtime) | 1 (commander) |

Try It

git clone https://github.com/JSLEEKR/graft.git
cd graft && npm install && npm run build
node dist/index.js compile examples/hello.gft --out-dir ./output
node dist/index.js check examples/hello.gft

Built with Claude Opus 4.6 via Claude Code. March 2026.