Skip to main content
The EvalTest class runs a single test scenario multiple times and provides statistical metrics like accuracy, precision, and recall.

Import

import { EvalTest } from "@mcpjam/sdk";

Constructor

new EvalTest(options: EvalTestConfig)

Parameters

options
EvalTestConfig
required
Configuration for the evaluation test.

EvalTestConfig

PropertyTypeRequiredDescription
namestringYesUnique identifier for the test
testTestFunctionYesThe test function to run

TestFunction Type

type TestFunction = (agent: EvalAgent) => boolean | Promise<boolean>;
The test function receives an EvalAgent and must return a boolean:
  • true = test passed
  • false = test failed
Both TestAgent and mock agents implement the EvalAgent interface, so you can use either for testing.

Example

const test = new EvalTest({
  name: "addition-accuracy",
  test: async (agent) => {
    const result = await agent.prompt("Add 2 and 3");
    return result.hasToolCall("add");
  },
});

Methods

run()

Executes the test multiple times and returns detailed results.
run(agent: EvalAgent, options: EvalTestRunOptions): Promise<EvalRunResult>

Parameters

ParameterTypeDescription
agentEvalAgentThe agent to test with (TestAgent or mock)
optionsEvalTestRunOptionsRun configuration

EvalTestRunOptions

PropertyTypeRequiredDefaultDescription
iterationsnumberYes-Number of test runs
concurrencynumberNo5Parallel test runs
retriesnumberNo0Retry failed tests
timeoutMsnumberNo30000Per-iteration wall-clock timeout in ms. The active prompt is aborted at this deadline, then given a 1 second grace period to settle so partial tool calls and trace history can still be captured.
onProgressProgressCallbackNo-Progress callback
onFailure(report: string) => voidNo-Called with a failure report if any iterations fail
mcpjamMCPJamReportingConfigNo-Auto-save results to MCPJam
Results are automatically saved to MCPJam after the run completes when an API key is available via mcpjam.apiKey or the MCPJAM_API_KEY environment variable. Set mcpjam.enabled: false to disable.

ProgressCallback Type

type ProgressCallback = (completed: number, total: number) => void;

Example

await test.run(agent, {
  iterations: 30,
  concurrency: 5,
  retries: 2,
  timeoutMs: 30000,
  mcpjam: {
    suiteName: "Addition Eval",
    strict: false,
  },
  onProgress: (done, total) => {
    console.log(`${done}/${total}`);
  },
  onFailure: (report) => {
    console.error(report);
  },
});

accuracy()

Returns the success rate (0.0 - 1.0).
accuracy(): number

Returns

number - Proportion of tests that passed.

Example

console.log(`Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
// "Accuracy: 96.7%"

precision()

Returns the precision metric.
precision(): number

Returns

number - True positives / (True positives + False positives).

recall()

Returns the recall metric.
recall(): number

Returns

number - True positives / (True positives + False negatives).

truePositiveRate()

Returns the true positive rate (same as recall).
truePositiveRate(): number

falsePositiveRate()

Returns the false positive rate.
falsePositiveRate(): number

Returns

number - False positives / (False positives + True negatives).

averageTokenUse()

Returns the average tokens used per iteration.
averageTokenUse(): number

Returns

number - Mean token count.

Example

console.log(`Avg tokens: ${test.averageTokenUse()}`);

getResults()

Returns the full run result from the last run.
getResults(): EvalRunResult | null

Returns

EvalRunResult | null - The run result, or null if run() hasn’t been called.

EvalRunResult Type

PropertyTypeDescription
iterationsnumberTotal iterations run
successesnumberNumber that passed
failuresnumberNumber that failed
resultsboolean[]Pass/fail per iteration
iterationDetailsIterationResult[]Detailed per-iteration results
tokenUsageobjectAggregate and per-iteration token usage
latencyobjectLatency stats (e2e, llm, mcp) with p50/p95

IterationResult Type

PropertyTypeDescription
passedbooleanWhether this iteration passed
latenciesLatencyBreakdown[]Latency per prompt in this iteration
tokens{ total, input, output }Token usage
errorstring | undefinedError message if failed
retryCountnumber | undefinedNumber of retries attempted
promptsPromptResult[] | undefinedPrompt results from this iteration

getName()

Returns the test’s name.
getName(): string

getConfig()

Returns the test’s configuration.
getConfig(): EvalTestConfig

getAllIterations()

Returns all iteration details from the last run.
getAllIterations(): IterationResult[]

getFailedIterations()

Returns only the failed iterations from the last run.
getFailedIterations(): IterationResult[]

Example

const failures = test.getFailedIterations();
console.log(`${failures.length} failures`);

for (const fail of failures) {
  console.log(`  Error: ${fail.error}`);
}

getSuccessfulIterations()

Returns only the successful iterations from the last run.
getSuccessfulIterations(): IterationResult[]

getFailureReport()

Returns a formatted failure report with traces from all failed iterations. Useful for debugging.
getFailureReport(): string

Example

await test.run(agent, { iterations: 30 });

if (test.accuracy() < 0.9) {
  console.error(test.getFailureReport());
}

Properties

name

The test’s identifier (via getName()).
test.getName(); // "addition-accuracy"

Test Function Patterns

Simple Tool Check

test: async (agent) => {
  const result = await agent.prompt("Add 5 and 3");
  return result.hasToolCall("add");
};

Argument Validation

test: async (agent) => {
  const result = await agent.prompt("Add 10 and 20");
  const args = result.getToolArguments("add");
  return args?.a === 10 && args?.b === 20;
};

Response Content

test: async (agent) => {
  const result = await agent.prompt("What is 5 + 5?");
  return result.getText().includes("10");
};

Multiple Conditions

test: async (agent) => {
  const result = await agent.prompt("Calculate 5 * 3");
  return (
    result.hasToolCall("multiply") &&
    !result.hasError() &&
    result.getText().length > 0
  );
};

Multi-Turn Conversation

test: async (agent) => {
  const r1 = await agent.prompt("Create a project");
  const r2 = await agent.prompt("Add a task to it", { context: r1 });
  return r1.hasToolCall("createProject") && r2.hasToolCall("createTask");
};

With Validators

import { matchToolCallWithArgs } from "@mcpjam/sdk";

test: async (agent) => {
  const result = await agent.prompt("Add 2 and 3");
  return matchToolCallWithArgs("add", { a: 2, b: 3 }, result.getToolCalls());
};

Complete Example

import { MCPClientManager, TestAgent, EvalTest } from "@mcpjam/sdk";

async function main() {
  const manager = new MCPClientManager({
    everything: {
      command: "npx",
      args: ["-y", "@modelcontextprotocol/server-everything"],
    },
  });
  await manager.connectToServer("everything");

  const agent = new TestAgent({
    tools: await manager.getTools(),
    model: "anthropic/claude-sonnet-4-20250514",
    apiKey: process.env.ANTHROPIC_API_KEY,
    temperature: 0.1,
  });

  const test = new EvalTest({
    name: "addition",
    test: async (agent) => {
      const r = await agent.prompt("Add 2 and 3");
      return r.hasToolCall("add");
    },
  });

  console.log("Running evaluation...");

  const result = await test.run(agent, {
    iterations: 30,
    concurrency: 5,
    mcpjam: { suiteName: "Addition Eval" },
    onProgress: (done, total) => {
      process.stdout.write(`\r${done}/${total}`);
    },
    onFailure: (report) => {
      console.error(report);
    },
  });

  console.log("\n\nResults:");
  console.log(`  Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
  console.log(`  Precision: ${(test.precision() * 100).toFixed(1)}%`);
  console.log(`  Recall: ${(test.recall() * 100).toFixed(1)}%`);
  console.log(`  Avg tokens: ${test.averageTokenUse()}`);
  console.log(`  Iterations: ${result.iterations}`);
  console.log(`  Successes: ${result.successes}`);

  await manager.disconnectServer("everything");
}