agent-test-recorder

Record and replay LLM API calls for deterministic testing

The VCR for AI agents. Record real LLM interactions once, replay them deterministically forever. Zero API calls. Zero cost. Zero flakiness.

</div>

Why

LLM calls are non-deterministic. The same prompt returns different outputs every time, making tests flaky and impossible to assert on.
Every test run costs money. Running your test suite against OpenAI or Anthropic in CI burns tokens on every push.
Ephemeral execution is invisible. Agent tool calls, multi-turn chains, and streaming responses vanish after execution. Cassettes capture the full trace for debugging and regression testing.

agent-test-recorder solves all three: record real interactions once with mode: 'auto', commit the cassette files, and replay them forever.

How It Works

First run (API key required):

  Your Test ──► OpenAI API ──► Response
                    │
              ┌─────▼─────┐
              │  Cassette  │  ← saved as JSON
              │  (.json)   │
              └────────────┘

Every run after (no API key, no cost):

  Your Test ──► Cassette ──► Same Response
                (replay)     (deterministic)

Think of it like a VCR for API calls. Record once, replay forever.

When to Use This

| Scenario | How | |---|---| | CI/CD without API keys | Record cassettes locally, commit to git, CI replays them | | Deterministic assertions | Same input always returns same output -- no more flaky tests | | Cost control | Record once, run tests 1000 times for free | | Regression detection | Change your prompt? Cassette mismatch tells you immediately | | Debugging agent chains | Cassettes capture every step of multi-turn conversations | | Model migration | Switch gpt-4o → gpt-4o-mini, re-record with mode: 'update' |

Features

| Feature | Description | |---|---| | 5 recording modes | record, playback, auto, passthrough, update | | OpenAI adapter | chat.completions.create with streaming and tool_calls support | | OpenAI embeddings | embeddings.create with token usage tracking | | Anthropic adapter | messages.create with streaming events and tool_use blocks | | Fetch adapter | Generic HTTP interception with URL pattern matching | | Streaming support | Records SSE streams as single objects, replays as async iterables | | Incomplete stream handling | Detects stream errors, marks responses with incomplete: true | | SHA-256 matching | Content-based request hashing with deterministic key sorting | | ignoreFields | Strip dynamic fields (default: ['stream']) before hashing | | Custom matcher | Full control via user-provided (request, recorded) => boolean function | | Similarity scoring | On mismatch, reports closest cassette entry with Jaccard similarity | | Multi-turn ordering | Nth call with same hash returns Nth recording | | Atomic writes | Writes to temp file then renames -- no partial cassette corruption | | Cassette versioning | meta.version field with CassetteVersionError on mismatch | | Cassette migration | migrateCassette() upgrades old cassettes; readCassette supports autoMigrate option | | Header masking | Auto-strips Authorization and x-api-key headers before saving | | Vitest + Jest plugins | withCassette() helper with setup/teardown hooks | | CLI tools | list, show, diff, prune, stats, update commands | | Event hooks | onRecord, onReplay, onMismatch callbacks in RecorderConfig | | Recorder introspection | getRecordedCount(), getInteractionCount(), getSummary() methods | | Error recording | Records 4xx/5xx error responses and replays them faithfully |

Quick Start

npm install agent-test-recorder

import { Recorder, OpenAIAdapter } from 'agent-test-recorder';
import OpenAI from 'openai';

const client = new OpenAI();
const recorder = new Recorder({
  cassettePath: './__cassettes__/my-test.json',
  mode: 'auto', // Record if cassette missing, replay if it exists
});
const adapter = new OpenAIAdapter(recorder);

await recorder.start();
adapter.instrument(client);

// First run hits the real API, subsequent runs replay from cassette
const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'What is 2+2?' }],
});

adapter.restore(client);
await recorder.stop();

That's it. On the first run, the real API is called and the response is saved to __cassettes__/my-test.json. On every subsequent run, the cassette is replayed instantly -- no API call, no cost, same result every time.

Recommended Project Structure

my-agent/
├── src/
│   └── agent.ts
├── tests/
│   ├── __cassettes__/          ← commit these to git!
│   │   ├── math-test.json
│   │   ├── tool-use-test.json
│   │   └── multi-turn.json
│   ├── agent.test.ts
│   └── test-helpers.ts         ← shared adapter setup
├── .gitignore
└── package.json

Key rule: commit your cassettes to git. This is what makes CI/CD work without API keys. Cassettes are plain JSON, so you can review LLM response changes in pull request diffs.

Practical Tips

| Tip | Why | |---|---| | Use auto mode as default | Records on first run, replays after -- no manual switching | | Commit cassettes to git | CI runs without API keys, PR diffs show response changes | | Set ignoreFields: ['timestamp', 'request_id'] | Dynamic fields break matching -- strip them | | Use setup/teardown in withCassette | Avoid boilerplate adapter setup in every test | | Run mode: 'update' after model changes | Re-records all cassettes with the new model | | Check getSummary() in test output | Shows how many calls were recorded vs replayed | | Use onRecord hook for cost tracking | Log token usage during recording sessions |

Recording Modes

| Mode | Cassette exists | No cassette | |---|---|---| | record | Overwrite with new recording | Create new cassette | | playback | Replay recorded responses | Throw CassetteNotFoundError | | auto | Replay known, record new interactions | Record live calls, create cassette | | passthrough | Ignore cassette (real API calls) | Ignore (real API calls) | | update | Real call, overwrite cassette | Real call, create cassette |

auto is the recommended default: new interactions are recorded on first run, replayed thereafter.

SDK Adapters

OpenAI

import { Recorder, OpenAIAdapter } from 'agent-test-recorder';
import OpenAI from 'openai';

const recorder = new Recorder({ cassettePath: './cassettes/test.json', mode: 'auto' });
const adapter = new OpenAIAdapter(recorder);
const client = new OpenAI();

await recorder.start();
adapter.instrument(client); // Patches chat.completions.create and embeddings.create

Chat completions (regular and streaming):

// Regular call
const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello' }],
});

// Streaming call -- works transparently
const stream = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello' }],
  stream: true,
});
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}

Embeddings:

const embedding = await client.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'The quick brown fox',
});
// Recorded with token usage (prompt_tokens / total_tokens)

adapter.restore(client); // Restores both chat.completions.create and embeddings.create
await recorder.stop();

Captures: model, messages, token usage, tool_calls in responses, embeddings vectors.

Anthropic

import { Recorder, AnthropicAdapter } from 'agent-test-recorder';
import Anthropic from '@anthropic-ai/sdk';

const recorder = new Recorder({ cassettePath: './cassettes/test.json', mode: 'auto' });
const adapter = new AnthropicAdapter(recorder);
const client = new Anthropic();

await recorder.start();
adapter.instrument(client); // Patches client.messages.create

// Regular call
const response = await client.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Hello' }],
});

// Streaming call
const stream = await client.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Count to 3' }],
  stream: true,
});
for await (const event of stream) {
  // Replays: message_start, content_block_start, content_block_delta,
  //          content_block_stop, message_delta, message_stop
}

adapter.restore(client);
await recorder.stop();

Captures: tool_use content blocks, input/output token counts, streaming events.

Fetch (Generic HTTP)

import { Recorder, FetchAdapter } from 'agent-test-recorder';

const recorder = new Recorder({ cassettePath: './cassettes/test.json', mode: 'auto' });
const adapter = new FetchAdapter(recorder, {
  urlPattern: /api\.openai\.com|api\.anthropic\.com/, // Only intercept matching URLs
});

await recorder.start();
adapter.instrument(); // Patches globalThis.fetch (no client argument)

const response = await fetch('https://api.openai.com/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer sk-...', // Auto-stripped before saving
  },
  body: JSON.stringify({ model: 'gpt-4o', messages: [] }),
});

adapter.restore(); // Restores original globalThis.fetch
await recorder.stop();

URLs not matching urlPattern pass through to the real fetch
Authorization and x-api-key headers are automatically stripped before saving
Records HTTP status code and status text alongside response body

Test Framework Integration

Vitest

import { withCassette } from 'agent-test-recorder/vitest';
import { OpenAIAdapter } from 'agent-test-recorder';
import OpenAI from 'openai';

describe('my agent', () => {
  // Basic usage -- auto mode, cassette at __cassettes__/answer-test.json
  it('answers correctly', withCassette('answer-test', async (recorder) => {
    const client = new OpenAI();
    const adapter = new OpenAIAdapter(recorder);
    adapter.instrument(client);
    const result = await client.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: 'What is 2+2?' }],
    });
    adapter.restore(client);
    expect(result.choices[0].message.content).toContain('4');
  }));

  // With options
  it('re-records when needed', withCassette('update-test', { mode: 'update' }, async (recorder) => {
    // ...
  }));

  // With setup/teardown hooks
  it('uses setup hook', withCassette('hook-test', {
    setup: (recorder) => {
      const client = new OpenAI();
      const adapter = new OpenAIAdapter(recorder);
      adapter.instrument(client);
      return {
        teardown: () => adapter.restore(client),
      };
    },
  }, async (recorder) => {
    // client is already instrumented
  }));
});

CassetteOptions

| Option | Type | Default | Description | |---|---|---|---| | mode | RecordMode | 'auto' | Recording mode | | basePath | string | process.cwd() | Base directory for cassette storage | | matcherOptions | MatcherOptions | {} | Matcher configuration | | setup | (recorder: Recorder) => { teardown: () => void } | -- | Setup/teardown hooks |

withCassette stores cassettes at <basePath>/__cassettes__/<name>.json.

Jest

import { withCassette } from 'agent-test-recorder/jest';

// Same API as Vitest
test('answers correctly', withCassette('answer-test', async (recorder) => {
  // ...
}));

Matcher Configuration

The Matcher class controls how incoming requests are matched against recorded interactions.

import { Recorder } from 'agent-test-recorder';

const recorder = new Recorder({
  cassettePath: './cassettes/test.json',
  mode: 'auto',
  matcherOptions: {
    // Strip these fields before computing the match hash (default: ['stream'])
    ignoreFields: ['stream', 'request_id', 'timestamp'],

    // Or provide a fully custom matcher
    custom: (request, recorded) => {
      return request.params.model === recorded.params.model
        && JSON.stringify(request.params.messages) === JSON.stringify(recorded.params.messages);
    },
  },
});

| Option | Type | Default | Description | |---|---|---|---| | ignoreFields | string[] | ['stream'] | Fields to strip from params before SHA-256 hashing | | custom | (request: InteractionRequest, recorded: InteractionRequest) => boolean | -- | Override matching logic entirely; when set, hash-based matching is skipped |

Similarity scoring: When no match is found in playback mode, the Matcher computes Jaccard similarity between the request and all recorded interactions, reporting the closest match in the error message.

Cassette Format

Cassettes are plain JSON files, human-readable and version-controlled.

{
  "meta": {
    "version": 1,
    "name": "my-test",
    "createdAt": "2026-03-25T00:00:00.000Z",
    "updatedAt": "2026-03-25T00:00:00.000Z"
  },
  "interactions": [
    {
      "request": {
        "provider": "openai",
        "method": "chat.completions.create",
        "params": { "model": "gpt-4o", "messages": [{ "role": "user", "content": "Hello" }] }
      },
      "response": {
        "body": { "choices": [{ "message": { "role": "assistant", "content": "Hi!" } }] },
        "tokenUsage": { "prompt": 10, "completion": 3 },
        "latency": 450,
        "streaming": false
      },
      "timestamp": "2026-03-25T00:00:00.000Z",
      "matchKey": "sha256:abc123..."
    }
  ]
}

Atomic writes: Cassettes are written to a temp file (<path>.<uuid>.tmp) then renamed, preventing corruption from crashes or concurrent writes.

CLI

# List all cassettes in a directory (recursive)
agent-test-recorder list
agent-test-recorder list -d ./tests

# Display cassette contents (name, version, dates, interactions)
agent-test-recorder show ./cassettes/my-test.json

# Compare two cassette files (added/removed interactions by matchKey)
agent-test-recorder diff cassette-a.json cassette-b.json

# Delete cassettes older than N days
agent-test-recorder prune --older-than 30d
agent-test-recorder prune --older-than 7d -d ./tests

# Show aggregate statistics (count, total interactions, size, by provider)
agent-test-recorder stats
agent-test-recorder stats -d ./tests

# Migrate cassettes to latest format version
agent-test-recorder update ./cassettes/old-test.json
agent-test-recorder update --all
agent-test-recorder update --all -d ./tests

| Command | Description | |---|---| | list | Recursively find and list all valid cassette files | | show <cassette> | Display cassette metadata and interaction summary | | diff <a> <b> | Compare two cassettes by matchKey (added/removed counts) | | prune --older-than <Nd> | Delete cassettes with createdAt older than N days | | stats | Aggregate stats: cassette count, interaction count, size, breakdown by provider | | update <cassette> | Migrate a cassette file to the latest format version | | update --all | Migrate all cassettes in a directory |

Edge Cases

| Situation | Behavior | |---|---| | Same request multiple times | Nth call with same hash returns Nth recording (ordered matching) | | No match in cassette (playback) | CassetteMismatchError with closest match and Jaccard similarity score | | No match in cassette (auto) | Record live call, append to existing cassette | | API error responses (4xx/5xx) | Recorded as-is and replayed faithfully | | Streaming drops mid-stream | Caught, marked incomplete: true, finish_reason: 'error' | | Parallel requests | Content-based hash matching -- order-independent | | Sensitive headers | Authorization and x-api-key auto-stripped by Fetch adapter | | Corrupted cassette JSON | CassetteCorruptError with file path and parse error details | | Cassette version mismatch | CassetteVersionError with hint to run migration command | | Concurrent test writes | Atomic write via temp file + rename prevents corruption |

Error Messages

All error classes include actionable hints:

CassetteMismatchError: No matching recording found

  Request:
    provider: openai
    method: chat.completions.create
    model: gpt-4o
    first message: "What is 3+3?"

  Closest match in cassette (similarity: 0.85):
    provider: openai
    method: chat.completions.create
    model: gpt-4o
    first message: "What is 2+2?"

  Hint: Run with mode: 'auto' to record this new interaction.

CassetteNotFoundError: Cassette not found: ./cassettes/test.json
  Hint: Run with mode: 'record' or 'auto' to create it.

CassetteVersionError: Cassette version mismatch: ./cassettes/test.json
  Found: v0, Expected: v1
  Hint: Run `agent-test-recorder update ./cassettes/test.json` to migrate.

Real-World Examples

Testing an AI agent end-to-end

import { withCassette } from 'agent-test-recorder/vitest';
import { OpenAIAdapter } from 'agent-test-recorder';
import OpenAI from 'openai';
import { MyAgent } from '../src/agent.js';

const client = new OpenAI();

describe('MyAgent', () => {
  it('handles a multi-step research task', withCassette('research-task', {
    setup: (recorder) => {
      const adapter = new OpenAIAdapter(recorder);
      adapter.instrument(client);
      return { teardown: () => adapter.restore(client) };
    },
  }, async () => {
    const agent = new MyAgent(client);
    const result = await agent.research('What causes northern lights?');

    // These assertions are deterministic because the LLM responses are replayed
    expect(result.sources).toHaveLength(3);
    expect(result.summary).toContain('solar wind');
    expect(result.toolCalls).toContain('web_search');
  }));
});

Cost monitoring during recording

let totalTokens = 0;

const recorder = new Recorder({
  cassettePath: './cassettes/expensive-test.json',
  mode: 'record',
  onRecord: (interaction) => {
    const usage = interaction.response.tokenUsage;
    if (usage) {
      totalTokens += usage.prompt + usage.completion;
      console.log(`[${interaction.request.params.model}] ${usage.prompt}+${usage.completion} tokens`);
    }
  },
});

// ... run your test ...

await recorder.stop();
console.log(`Total tokens used: ${totalTokens}`);
// Next run with mode: 'auto' → 0 tokens, 0 cost

Using multiple providers in one test

it('compares OpenAI and Anthropic', withCassette('provider-compare', {
  setup: (recorder) => {
    const oai = new OpenAIAdapter(recorder);
    const ant = new AnthropicAdapter(recorder);
    oai.instrument(openaiClient);
    ant.instrument(anthropicClient);
    return {
      teardown: () => {
        oai.restore(openaiClient);
        ant.restore(anthropicClient);
      },
    };
  },
}, async () => {
  const [gptResult, claudeResult] = await Promise.all([
    openaiClient.chat.completions.create({ model: 'gpt-4o', messages }),
    anthropicClient.messages.create({ model: 'claude-sonnet-4-20250514', max_tokens: 1024, messages }),
  ]);
  // Both responses are recorded in the same cassette and replayed together
}));

API Reference

Exported Classes

| Class | Description | |---|---| | Recorder | Core record/replay engine | | Matcher | Request matching and similarity scoring | | BaseAdapter | Abstract base class for SDK adapters | | OpenAIAdapter | Adapter for OpenAI SDK (chat.completions.create, embeddings.create) | | AnthropicAdapter | Adapter for Anthropic SDK (messages.create) | | FetchAdapter | Adapter for generic HTTP via globalThis.fetch |

Exported Functions

| Function | Signature | Description | |---|---|---| | createEmptyCassette | (name: string) => Cassette | Create a new empty cassette with metadata | | readCassette | (filePath: string, options?: { autoMigrate?: boolean }) => Promise<Cassette> | Read and validate a cassette file; auto-migrate old versions when autoMigrate: true | | writeCassette | (filePath: string, cassette: Cassette) => Promise<void> | Atomically write a cassette to disk | | migrateCassette | (cassette: any) => Cassette | Upgrade a cassette object to the latest version |

Exported Error Classes

| Error | When thrown | |---|---| | CassetteMismatchError | No matching recorded interaction in playback mode | | CassetteNotFoundError | Cassette file does not exist in playback mode | | CassetteCorruptError | Cassette file is not valid JSON | | CassetteVersionError | Cassette meta.version does not match CASSETTE_VERSION |

Exported Types

type RecordMode = 'record' | 'playback' | 'auto' | 'passthrough' | 'update';

interface RecorderConfig {
  cassettePath: string;
  mode: RecordMode;
  matcherOptions?: MatcherOptions;
  onRecord?: (interaction: Interaction) => void;
  onReplay?: (interaction: Interaction) => void;
  onMismatch?: (request: InteractionRequest) => void;
}

interface MatcherOptions {
  ignoreFields?: string[];
  custom?: (request: InteractionRequest, recorded: InteractionRequest) => boolean;
}

interface Cassette {
  meta: CassetteMeta;
  interactions: Interaction[];
}

interface CassetteMeta {
  version: number;
  name: string;
  createdAt: string;
  updatedAt: string;
}

interface Interaction {
  request: InteractionRequest;
  response: InteractionResponse;
  timestamp: string;
  matchKey: string;
}

interface InteractionRequest {
  provider: string;
  method: string;
  params: Record<string, unknown>;
}

interface InteractionResponse {
  body: Record<string, unknown>;
  tokenUsage?: { prompt: number; completion: number };
  latency?: number;
  streaming?: boolean;
  incomplete?: boolean;
}

interface FetchAdapterOptions {
  urlPattern: RegExp;
}

interface CassetteOptions {
  mode?: RecordMode;
  basePath?: string;
  matcherOptions?: MatcherOptions;
  setup?: (recorder: Recorder) => { teardown: () => void };
}

const CASSETTE_VERSION = 1;

Recorder Methods

| Method | Signature | Description | |---|---|---| | start() | () => Promise<void> | Load cassette (if exists), prepare for recording/playback | | handle() | (request, handler) => Promise<InteractionResponse> | Match or record an interaction | | stop() | () => Promise<void> | Save cassette (if recording), release resources | | getRecordedCount() | () => number | Number of interactions recorded in this session | | getInteractionCount() | () => number | Total interactions (recorded + existing cassette) | | getSummary() | () => RecorderSummary | Session summary: mode, recorded, replayed, cassettePath, totalInteractions |

Adapter Methods

| Method | OpenAI | Anthropic | Fetch | |---|---|---|---| | instrument() | instrument(client) | instrument(client) | instrument() | | restore() | restore(client) | restore(client) | restore() |

License

MIT -- see LICENSE