EvalSuite - MCPJam Inspector

The EvalSuite class groups multiple EvalTest instances and provides aggregate metrics across all tests.

Import

import { EvalSuite, EvalTest } from "@mcpjam/sdk";

Constructor

new EvalSuite(options?: EvalSuiteConfig)

Parameters

options

EvalSuiteConfig

Configuration for the evaluation suite.

EvalSuiteConfig

Property	Type	Required	Description
`name`	`string`	No	Name for the suite (defaults to `"EvalSuite"`)
`mcpjam`	`MCPJamReportingConfig`	No	Auto-save results to MCPJam when the suite completes

When an API key is available — via mcpjam.apiKey or the MCPJAM_API_KEY environment variable — all test results are consolidated into a single run and saved to MCPJam after the suite completes. Individual EvalTest auto-saves are suppressed to avoid duplicate uploads. Set mcpjam.enabled: false to disable.

Example

const suite = new EvalSuite({ name: "Math Operations" });

With results saved to MCPJam:

const reportingSuite = new EvalSuite({
  name: "Math Operations",
  mcpjam: {
    suiteName: "Math Eval",
    passCriteria: { minimumPassRate: 90 },
  },
});

Methods

add()

Adds a test to the suite.

add(test: EvalTest): void

Parameters

Parameter	Type	Description
`test`	`EvalTest`	The test to add

Example

suite.add(new EvalTest({
  name: "addition",
  test: async (agent) => {
    const r = await agent.prompt("Add 2 and 3");
    return r.hasToolCall("add");
  },
}));

suite.add(new EvalTest({
  name: "multiplication",
  test: async (agent) => {
    const r = await agent.prompt("Multiply 4 by 5");
    return r.hasToolCall("multiply");
  },
}));

run()

Runs all tests in the suite and returns aggregate results.

run(agent: EvalAgent, options: EvalTestRunOptions): Promise<EvalSuiteResult>

Parameters

Parameter	Type	Description
`agent`	`EvalAgent`	The agent to test with (`TestAgent` or mock)
`options`	`EvalTestRunOptions`	Run configuration (same as EvalTest)

EvalTestRunOptions

Property	Type	Required	Default	Description
`iterations`	`number`	Yes	-	Number of runs per test
`concurrency`	`number`	No	`5`	Parallel runs per test
`retries`	`number`	No	`0`	Retry failed tests
`timeoutMs`	`number`	No	`30000`	Timeout per test (ms)
`onProgress`	`ProgressCallback`	No	-	Progress callback (tracks total across all tests)
`onFailure`	`(report: string) => void`	No	-	Called per test with a failure report if any iterations fail
`mcpjam`	`MCPJamReportingConfig`	No	-	Override suite-level MCPJam config

EvalSuiteResult

Property	Type	Description
`tests`	`Map<string, EvalRunResult>`	Per-test results
`aggregate.iterations`	`number`	Total iterations across all tests
`aggregate.successes`	`number`	Total passes
`aggregate.failures`	`number`	Total failures
`aggregate.accuracy`	`number`	Overall accuracy (0.0 - 1.0)
`aggregate.tokenUsage`	`{ total, perTest[] }`	Token usage
`aggregate.latency`	`{ e2e, llm, mcp }`	Latency stats with p50/p95

Example

const result = await suite.run(agent, {
  iterations: 30,
  concurrency: 5,
});

console.log(`Overall: ${(result.aggregate.accuracy * 100).toFixed(1)}%`);

Tests within the suite run sequentially, but each test’s iterations can run concurrently based on the concurrency setting.

accuracy()

Returns the aggregate accuracy across all tests.

accuracy(): number

Returns

number - Average accuracy of all tests (0.0 - 1.0).

Example

console.log(`Suite accuracy: ${(suite.accuracy() * 100).toFixed(1)}%`);

get()

Retrieves a specific test by name.

get(name: string): EvalTest | undefined

Parameters

Parameter	Type	Description
`name`	`string`	The test name

Returns

EvalTest | undefined - The test, or undefined if not found.

Example

const addTest = suite.get("addition");
if (addTest) {
  console.log(`Addition: ${(addTest.accuracy() * 100).toFixed(1)}%`);
}

getAll()

Returns all tests in the suite.

getAll(): EvalTest[]

Returns

EvalTest[] - Array of all tests.

Example

for (const test of suite.getAll()) {
  console.log(`${test.getName()}: ${(test.accuracy() * 100).toFixed(1)}%`);
}

getName()

Returns the suite’s name.

getName(): string

size()

Returns the number of tests in the suite.

size(): number

getResults()

Returns the full suite results from the last run.

getResults(): EvalSuiteResult | null

recall()

Returns aggregate recall (same as accuracy in basic context).

recall(): number

precision()

Returns aggregate precision (same as accuracy in basic context).

precision(): number

truePositiveRate()

Returns aggregate true positive rate (same as recall).

truePositiveRate(): number

falsePositiveRate()

Returns aggregate false positive rate.

falsePositiveRate(): number

averageTokenUse()

Returns average tokens per iteration across all tests.

averageTokenUse(): number

Properties

name

The suite’s name (via getName()).

suite.getName() // "Math Operations"

Complete Example

import { MCPClientManager, TestAgent, EvalSuite, EvalTest } from "@mcpjam/sdk";

async function main() {
  // Setup
  const manager = new MCPClientManager({
    everything: {
      command: "npx",
      args: ["-y", "@modelcontextprotocol/server-everything"],
    },
  });
  await manager.connectToServer("everything");

  const agent = new TestAgent({
    tools: await manager.getTools(),
    model: "anthropic/claude-sonnet-4-20250514",
    apiKey: process.env.ANTHROPIC_API_KEY,
    temperature: 0.1,
  });

  // Build suite (with auto-save to MCPJam)
  const suite = new EvalSuite({
    name: "Everything Server Tests",
    mcpjam: {
      suiteName: "Everything Server Eval",
      passCriteria: { minimumPassRate: 90 },
    },
  });

  suite.add(new EvalTest({
    name: "add",
    test: async (a) => (await a.prompt("Add 2+3")).hasToolCall("add"),
  }));

  suite.add(new EvalTest({
    name: "echo",
    test: async (a) => (await a.prompt("Echo 'test'")).hasToolCall("echo"),
  }));

  suite.add(new EvalTest({
    name: "longRunningOperation",
    test: async (a) => (await a.prompt("Run a long operation")).hasToolCall("longRunningOperation"),
  }));

  // Run
  console.log(`Running ${suite.getName()}...\n`);

  const result = await suite.run(agent, {
    iterations: 20,
    concurrency: 3,
    onProgress: (done, total) => {
      process.stdout.write(`\r  Progress: ${done}/${total}`);
    },
  });

  // Report
  console.log(`\n\nOverall: ${(suite.accuracy() * 100).toFixed(1)}%`);
  console.log(`Total iterations: ${result.aggregate.iterations}\n`);

  console.log("Per-test breakdown:");
  for (const test of suite.getAll()) {
    const pct = (test.accuracy() * 100).toFixed(1);
    console.log(`  ${test.getName()}: ${pct}%`);
  }

  // Access individual test
  const echoTest = suite.get("echo");
  if (echoTest) {
    console.log(`\nEcho test details:`);
    console.log(`  Precision: ${(echoTest.precision() * 100).toFixed(1)}%`);
    console.log(`  Recall: ${(echoTest.recall() * 100).toFixed(1)}%`);
    console.log(`  Avg tokens: ${echoTest.averageTokenUse()}`);
  }

  // Cleanup
  await manager.disconnectServer("everything");
}

Patterns

CI Quality Gate

await suite.run(agent, { iterations: 30 });

if (suite.accuracy() < 0.90) {
  console.error(`❌ Suite accuracy ${(suite.accuracy() * 100).toFixed(1)}% below 90% threshold`);
  process.exit(1);
}

console.log("✅ All quality gates passed");

Per-Test Thresholds

await suite.run(agent, { iterations: 30 });

const criticalTests = ["createOrder", "processPayment"];
let failed = false;

for (const name of criticalTests) {
  const test = suite.get(name);
  if (test && test.accuracy() < 0.95) {
    console.error(`❌ Critical test "${name}" below 95%`);
    failed = true;
  }
}

if (failed) process.exit(1);

Comparing Across Providers

const providers = [
  { model: "anthropic/claude-sonnet-4-20250514", key: "ANTHROPIC_API_KEY" },
  { model: "openai/gpt-4o", key: "OPENAI_API_KEY" },
];

for (const { model, key } of providers) {
  const agent = new TestAgent({
    tools,
    model,
    apiKey: process.env[key],
  });

  await suite.run(agent, { iterations: 20 });
  console.log(`${model}: ${(suite.accuracy() * 100).toFixed(1)}%`);
}

Running Evals - Conceptual guide
EvalTest Reference - Individual test API
Saving Eval Results - Save results to MCPJam
Testing Across Providers - Compare LLMs

Overview

Inspector Features

SDK

Guides

Troubleshooting

​Import

​Constructor

​Parameters

​EvalSuiteConfig

​Example

​Methods

​add()

​Parameters

​Example

​run()

​Parameters

​EvalTestRunOptions

​EvalSuiteResult

​Example

​accuracy()

​Returns

​Example

​get()

​Parameters

​Returns

​Example

​getAll()

​Returns

​Example

​getName()

​size()

​getResults()

​recall()

​precision()

​truePositiveRate()

​falsePositiveRate()

​averageTokenUse()

​Properties

​name

​Complete Example

​Patterns

​CI Quality Gate

​Per-Test Thresholds

​Comparing Across Providers

​Related

Import

Constructor

Parameters

EvalSuiteConfig

Example

Methods

add()

Parameters

Example

run()

Parameters

EvalTestRunOptions

EvalSuiteResult

Example

accuracy()

Returns

Example

get()

Parameters

Returns

Example

getAll()

Returns

Example

getName()

size()

getResults()

recall()

precision()

truePositiveRate()

falsePositiveRate()

averageTokenUse()

Properties

name

Complete Example

Patterns

CI Quality Gate

Per-Test Thresholds

Comparing Across Providers

Related