Skip to main content
The EvalTest class runs a single test scenario multiple times and provides statistical metrics like accuracy, precision, and recall.

Import

import { EvalTest } from "@mcpjam/sdk";

Constructor

new EvalTest(options: EvalTestOptions)

Parameters

options
EvalTestOptions
required
Configuration for the evaluation test.

EvalTestOptions

PropertyTypeRequiredDescription
namestringYesUnique identifier for the test
testTestFunctionYesThe test function to run

TestFunction Type

type TestFunction = (agent: TestAgent) => Promise<boolean>
The test function receives a TestAgent and must return a boolean:
  • true = test passed
  • false = test failed

Example

const test = new EvalTest({
  name: "addition-accuracy",
  test: async (agent) => {
    const result = await agent.prompt("Add 2 and 3");
    return result.hasToolCall("add");
  },
});

Methods

run()

Executes the test multiple times.
run(agent: TestAgent, options: RunOptions): Promise<void>

Parameters

ParameterTypeDescription
agentTestAgentThe agent to test with
optionsRunOptionsRun configuration

RunOptions

PropertyTypeRequiredDefaultDescription
iterationsnumberYes-Number of test runs
concurrencynumberNo1Parallel test runs
retriesnumberNo0Retry failed tests
timeoutMsnumberNo60000Timeout per test (ms)
onProgressProgressCallbackNo-Progress callback

ProgressCallback Type

type ProgressCallback = (completed: number, total: number) => void

Example

await test.run(agent, {
  iterations: 30,
  concurrency: 5,
  retries: 2,
  timeoutMs: 30000,
  onProgress: (done, total) => {
    console.log(`${done}/${total}`);
  },
});

accuracy()

Returns the success rate (0.0 - 1.0).
accuracy(): number

Returns

number - Proportion of tests that passed.

Example

console.log(`Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
// "Accuracy: 96.7%"

precision()

Returns the precision metric.
precision(): number

Returns

number - True positives / (True positives + False positives).

recall()

Returns the recall metric.
recall(): number

Returns

number - True positives / (True positives + False negatives).

truePositiveRate()

Returns the true positive rate (same as recall).
truePositiveRate(): number

falsePositiveRate()

Returns the false positive rate.
falsePositiveRate(): number

Returns

number - False positives / (False positives + True negatives).

averageTokenUse()

Returns the average tokens used per iteration.
averageTokenUse(): number

Returns

number - Mean token count.

Example

console.log(`Avg tokens: ${test.averageTokenUse()}`);

getResults()

Returns raw results from all iterations.
getResults(): TestResult[]

Returns

TestResult[] - Array of individual test results.

TestResult Type

PropertyTypeDescription
passedbooleanWhether the test passed
errorstring | undefinedError message if failed
latencyMsnumberTime taken
tokensnumberTokens used

Example

const results = test.getResults();

const failures = results.filter(r => !r.passed);
console.log(`Failures: ${failures.length}`);

for (const fail of failures) {
  console.log(`  Error: ${fail.error}`);
}

Properties

name

The test’s identifier.
test.name // "addition-accuracy"

Test Function Patterns

Simple Tool Check

test: async (agent) => {
  const result = await agent.prompt("Add 5 and 3");
  return result.hasToolCall("add");
}

Argument Validation

test: async (agent) => {
  const result = await agent.prompt("Add 10 and 20");
  const args = result.getToolArguments("add");
  return args?.a === 10 && args?.b === 20;
}

Response Content

test: async (agent) => {
  const result = await agent.prompt("What is 5 + 5?");
  return result.getText().includes("10");
}

Multiple Conditions

test: async (agent) => {
  const result = await agent.prompt("Calculate 5 * 3");
  return (
    result.hasToolCall("multiply") &&
    !result.hasError() &&
    result.getText().length > 0
  );
}

Multi-Turn Conversation

test: async (agent) => {
  const r1 = await agent.prompt("Create a project");
  const r2 = await agent.prompt("Add a task to it", { context: r1 });
  return r1.hasToolCall("createProject") && r2.hasToolCall("createTask");
}

With Validators

import { matchToolCallWithArgs } from "@mcpjam/sdk";

test: async (agent) => {
  const result = await agent.prompt("Add 2 and 3");
  return matchToolCallWithArgs("add", { a: 2, b: 3 }, result.getToolCalls());
}

Complete Example

import { MCPClientManager, TestAgent, EvalTest } from "@mcpjam/sdk";

async function main() {
  const manager = new MCPClientManager({
    everything: {
      command: "npx",
      args: ["-y", "@modelcontextprotocol/server-everything"],
    },
  });
  await manager.connectToServer("everything");

  const agent = new TestAgent({
    tools: await manager.getTools(),
    model: "anthropic/claude-sonnet-4-20250514",
    apiKey: process.env.ANTHROPIC_API_KEY,
    temperature: 0.1,
  });

  const test = new EvalTest({
    name: "addition",
    test: async (agent) => {
      const r = await agent.prompt("Add 2 and 3");
      return r.hasToolCall("add");
    },
  });

  console.log("Running evaluation...");

  await test.run(agent, {
    iterations: 30,
    concurrency: 5,
    onProgress: (done, total) => {
      process.stdout.write(`\r${done}/${total}`);
    },
  });

  console.log("\n\nResults:");
  console.log(`  Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
  console.log(`  Precision: ${(test.precision() * 100).toFixed(1)}%`);
  console.log(`  Recall: ${(test.recall() * 100).toFixed(1)}%`);
  console.log(`  Avg tokens: ${test.averageTokenUse()}`);

  await manager.disconnectServer("everything");
}