EvalTest - MCPJam Inspector

The EvalTest class runs a single test scenario multiple times and provides statistical metrics like accuracy, precision, and recall.

Import

import { EvalTest } from "@mcpjam/sdk";

Constructor

new EvalTest(options: EvalTestConfig)

Parameters

options

EvalTestConfig

required

Configuration for the evaluation test.

EvalTestConfig

Property	Type	Required	Description
`name`	`string`	Yes	Unique identifier for the test
`test`	`TestFunction`	Yes	The test function to run

TestFunction Type

type TestFunction = (agent: EvalAgent) => boolean | Promise<boolean>;

The test function receives an EvalAgent and must return a boolean:

true = test passed
false = test failed

Both TestAgent and mock agents implement the EvalAgent interface, so you can use either for testing.

Example

const test = new EvalTest({
  name: "addition-accuracy",
  test: async (agent) => {
    const result = await agent.prompt("Add 2 and 3");
    return result.hasToolCall("add");
  },
});

Methods

run()

Executes the test multiple times and returns detailed results.

run(agent: EvalAgent, options: EvalTestRunOptions): Promise<EvalRunResult>

Parameters

Parameter	Type	Description
`agent`	`EvalAgent`	The agent to test with (`TestAgent` or mock)
`options`	`EvalTestRunOptions`	Run configuration

EvalTestRunOptions

Property	Type	Required	Default	Description
`iterations`	`number`	Yes	-	Number of test runs
`concurrency`	`number`	No	`5`	Parallel test runs
`retries`	`number`	No	`0`	Retry failed tests
`timeoutMs`	`number`	No	`30000`	Per-iteration wall-clock timeout in ms. The active prompt is aborted at this deadline, then given a 1 second grace period to settle so partial tool calls and trace history can still be captured.
`onProgress`	`ProgressCallback`	No	-	Progress callback
`onFailure`	`(report: string) => void`	No	-	Called with a failure report if any iterations fail
`mcpjam`	`MCPJamReportingConfig`	No	-	Auto-save results to MCPJam

Results are automatically saved to MCPJam after the run completes when an API key is available via mcpjam.apiKey or the MCPJAM_API_KEY environment variable. Set mcpjam.enabled: false to disable.

ProgressCallback Type

type ProgressCallback = (completed: number, total: number) => void;

Example

await test.run(agent, {
  iterations: 30,
  concurrency: 5,
  retries: 2,
  timeoutMs: 30000,
  mcpjam: {
    suiteName: "Addition Eval",
    strict: false,
  },
  onProgress: (done, total) => {
    console.log(`${done}/${total}`);
  },
  onFailure: (report) => {
    console.error(report);
  },
});

accuracy()

Returns the success rate (0.0 - 1.0).

accuracy(): number

Returns

number - Proportion of tests that passed.

Example

console.log(`Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
// "Accuracy: 96.7%"

precision()

Returns the precision metric.

precision(): number

Returns

number - True positives / (True positives + False positives).

recall()

Returns the recall metric.

recall(): number

Returns

number - True positives / (True positives + False negatives).

truePositiveRate()

Returns the true positive rate (same as recall).

truePositiveRate(): number

falsePositiveRate()

Returns the false positive rate.

falsePositiveRate(): number

Returns

number - False positives / (False positives + True negatives).

averageTokenUse()

Returns the average tokens used per iteration.

averageTokenUse(): number

Returns

number - Mean token count.

Example

console.log(`Avg tokens: ${test.averageTokenUse()}`);

getResults()

Returns the full run result from the last run.

getResults(): EvalRunResult | null

Returns

EvalRunResult | null - The run result, or null if run() hasn’t been called.

EvalRunResult Type

Property	Type	Description
`iterations`	`number`	Total iterations run
`successes`	`number`	Number that passed
`failures`	`number`	Number that failed
`results`	`boolean[]`	Pass/fail per iteration
`iterationDetails`	`IterationResult[]`	Detailed per-iteration results
`tokenUsage`	`object`	Aggregate and per-iteration token usage
`latency`	`object`	Latency stats (e2e, llm, mcp) with p50/p95

IterationResult Type

Property	Type	Description
`passed`	`boolean`	Whether this iteration passed
`latencies`	`LatencyBreakdown[]`	Latency per prompt in this iteration
`tokens`	`{ total, input, output }`	Token usage
`error`	`string \| undefined`	Error message if failed
`retryCount`	`number \| undefined`	Number of retries attempted
`prompts`	`PromptResult[] \| undefined`	Prompt results from this iteration

getName()

Returns the test’s name.

getName(): string

getConfig()

Returns the test’s configuration.

getConfig(): EvalTestConfig

getAllIterations()

Returns all iteration details from the last run.

getAllIterations(): IterationResult[]

getFailedIterations()

Returns only the failed iterations from the last run.

getFailedIterations(): IterationResult[]

Example

const failures = test.getFailedIterations();
console.log(`${failures.length} failures`);

for (const fail of failures) {
  console.log(`  Error: ${fail.error}`);
}

getSuccessfulIterations()

Returns only the successful iterations from the last run.

getSuccessfulIterations(): IterationResult[]

getFailureReport()

Returns a formatted failure report with traces from all failed iterations. Useful for debugging.

getFailureReport(): string

Example

await test.run(agent, { iterations: 30 });

if (test.accuracy() < 0.9) {
  console.error(test.getFailureReport());
}

Properties

name

The test’s identifier (via getName()).

test.getName(); // "addition-accuracy"

Test Function Patterns

Simple Tool Check

test: async (agent) => {
  const result = await agent.prompt("Add 5 and 3");
  return result.hasToolCall("add");
};

Argument Validation

test: async (agent) => {
  const result = await agent.prompt("Add 10 and 20");
  const args = result.getToolArguments("add");
  return args?.a === 10 && args?.b === 20;
};

Response Content

test: async (agent) => {
  const result = await agent.prompt("What is 5 + 5?");
  return result.getText().includes("10");
};

Multiple Conditions

test: async (agent) => {
  const result = await agent.prompt("Calculate 5 * 3");
  return (
    result.hasToolCall("multiply") &&
    !result.hasError() &&
    result.getText().length > 0
  );
};

Multi-Turn Conversation

test: async (agent) => {
  const r1 = await agent.prompt("Create a project");
  const r2 = await agent.prompt("Add a task to it", { context: r1 });
  return r1.hasToolCall("createProject") && r2.hasToolCall("createTask");
};

With Validators

import { matchToolCallWithArgs } from "@mcpjam/sdk";

test: async (agent) => {
  const result = await agent.prompt("Add 2 and 3");
  return matchToolCallWithArgs("add", { a: 2, b: 3 }, result.getToolCalls());
};

Complete Example

import { MCPClientManager, TestAgent, EvalTest } from "@mcpjam/sdk";

async function main() {
  const manager = new MCPClientManager({
    everything: {
      command: "npx",
      args: ["-y", "@modelcontextprotocol/server-everything"],
    },
  });
  await manager.connectToServer("everything");

  const agent = new TestAgent({
    tools: await manager.getTools(),
    model: "anthropic/claude-sonnet-4-20250514",
    apiKey: process.env.ANTHROPIC_API_KEY,
    temperature: 0.1,
  });

  const test = new EvalTest({
    name: "addition",
    test: async (agent) => {
      const r = await agent.prompt("Add 2 and 3");
      return r.hasToolCall("add");
    },
  });

  console.log("Running evaluation...");

  const result = await test.run(agent, {
    iterations: 30,
    concurrency: 5,
    mcpjam: { suiteName: "Addition Eval" },
    onProgress: (done, total) => {
      process.stdout.write(`\r${done}/${total}`);
    },
    onFailure: (report) => {
      console.error(report);
    },
  });

  console.log("\n\nResults:");
  console.log(`  Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
  console.log(`  Precision: ${(test.precision() * 100).toFixed(1)}%`);
  console.log(`  Recall: ${(test.recall() * 100).toFixed(1)}%`);
  console.log(`  Avg tokens: ${test.averageTokenUse()}`);
  console.log(`  Iterations: ${result.iterations}`);
  console.log(`  Successes: ${result.successes}`);

  await manager.disconnectServer("everything");
}

Running Evals - Conceptual guide
EvalSuite Reference - Group multiple tests
Saving Eval Results - Save results to MCPJam
Validators Reference - Assertion functions

Overview

Inspector Features

SDK

Guides

Troubleshooting

​Import

​Constructor

​Parameters

​EvalTestConfig

​TestFunction Type

​Example

​Methods

​run()

​Parameters

​EvalTestRunOptions

​ProgressCallback Type

​Example

​accuracy()

​Returns

​Example

​precision()

​Returns

​recall()

​Returns

​truePositiveRate()

​falsePositiveRate()

​Returns

​averageTokenUse()

​Returns

​Example

​getResults()

​Returns

​EvalRunResult Type

​IterationResult Type

​getName()

​getConfig()

​getAllIterations()

​getFailedIterations()

​Example

​getSuccessfulIterations()

​getFailureReport()

​Example

​Properties

​name

​Test Function Patterns

​Simple Tool Check

​Argument Validation

​Response Content

​Multiple Conditions

​Multi-Turn Conversation

​With Validators

​Complete Example

​Related

Import

Constructor

Parameters

EvalTestConfig

TestFunction Type

Example

Methods

run()

Parameters

EvalTestRunOptions

ProgressCallback Type

Example

accuracy()

Returns

Example

precision()

Returns

recall()

Returns

truePositiveRate()

falsePositiveRate()

Returns

averageTokenUse()

Returns

Example

getResults()

Returns

EvalRunResult Type

IterationResult Type

getName()

getConfig()

getAllIterations()

getFailedIterations()

Example

getSuccessfulIterations()

getFailureReport()

Example

Properties

name

Test Function Patterns

Simple Tool Check

Argument Validation

Response Content

Multiple Conditions

Multi-Turn Conversation

With Validators

Complete Example

Related