Skip to main content
The EvalSuite class groups multiple EvalTest instances and provides aggregate metrics across all tests.

Import

import { EvalSuite, EvalTest } from "@mcpjam/sdk";

Constructor

new EvalSuite(options?: EvalSuiteConfig)

Parameters

options
EvalSuiteConfig
Configuration for the evaluation suite.

EvalSuiteConfig

PropertyTypeRequiredDescription
namestringNoName for the suite (defaults to "EvalSuite")
mcpjamMCPJamReportingConfigNoAuto-save results to MCPJam when the suite completes
When an API key is available — via mcpjam.apiKey or the MCPJAM_API_KEY environment variable — all test results are consolidated into a single run and saved to MCPJam after the suite completes. Individual EvalTest auto-saves are suppressed to avoid duplicate uploads. Set mcpjam.enabled: false to disable.

Example

const suite = new EvalSuite({ name: "Math Operations" });
With results saved to MCPJam:
const reportingSuite = new EvalSuite({
  name: "Math Operations",
  mcpjam: {
    suiteName: "Math Eval",
    passCriteria: { minimumPassRate: 90 },
  },
});

Methods

add()

Adds a test to the suite.
add(test: EvalTest): void

Parameters

ParameterTypeDescription
testEvalTestThe test to add

Example

suite.add(new EvalTest({
  name: "addition",
  test: async (agent) => {
    const r = await agent.prompt("Add 2 and 3");
    return r.hasToolCall("add");
  },
}));

suite.add(new EvalTest({
  name: "multiplication",
  test: async (agent) => {
    const r = await agent.prompt("Multiply 4 by 5");
    return r.hasToolCall("multiply");
  },
}));

run()

Runs all tests in the suite and returns aggregate results.
run(agent: EvalAgent, options: EvalTestRunOptions): Promise<EvalSuiteResult>

Parameters

ParameterTypeDescription
agentEvalAgentThe agent to test with (TestAgent or mock)
optionsEvalTestRunOptionsRun configuration (same as EvalTest)

EvalTestRunOptions

PropertyTypeRequiredDefaultDescription
iterationsnumberYes-Number of runs per test
concurrencynumberNo5Parallel runs per test
retriesnumberNo0Retry failed tests
timeoutMsnumberNo30000Timeout per test (ms)
onProgressProgressCallbackNo-Progress callback (tracks total across all tests)
onFailure(report: string) => voidNo-Called per test with a failure report if any iterations fail
mcpjamMCPJamReportingConfigNo-Override suite-level MCPJam config

EvalSuiteResult

PropertyTypeDescription
testsMap<string, EvalRunResult>Per-test results
aggregate.iterationsnumberTotal iterations across all tests
aggregate.successesnumberTotal passes
aggregate.failuresnumberTotal failures
aggregate.accuracynumberOverall accuracy (0.0 - 1.0)
aggregate.tokenUsage{ total, perTest[] }Token usage
aggregate.latency{ e2e, llm, mcp }Latency stats with p50/p95

Example

const result = await suite.run(agent, {
  iterations: 30,
  concurrency: 5,
});

console.log(`Overall: ${(result.aggregate.accuracy * 100).toFixed(1)}%`);
Tests within the suite run sequentially, but each test’s iterations can run concurrently based on the concurrency setting.

accuracy()

Returns the aggregate accuracy across all tests.
accuracy(): number

Returns

number - Average accuracy of all tests (0.0 - 1.0).

Example

console.log(`Suite accuracy: ${(suite.accuracy() * 100).toFixed(1)}%`);

get()

Retrieves a specific test by name.
get(name: string): EvalTest | undefined

Parameters

ParameterTypeDescription
namestringThe test name

Returns

EvalTest | undefined - The test, or undefined if not found.

Example

const addTest = suite.get("addition");
if (addTest) {
  console.log(`Addition: ${(addTest.accuracy() * 100).toFixed(1)}%`);
}

getAll()

Returns all tests in the suite.
getAll(): EvalTest[]

Returns

EvalTest[] - Array of all tests.

Example

for (const test of suite.getAll()) {
  console.log(`${test.getName()}: ${(test.accuracy() * 100).toFixed(1)}%`);
}

getName()

Returns the suite’s name.
getName(): string

size()

Returns the number of tests in the suite.
size(): number

getResults()

Returns the full suite results from the last run.
getResults(): EvalSuiteResult | null

recall()

Returns aggregate recall (same as accuracy in basic context).
recall(): number

precision()

Returns aggregate precision (same as accuracy in basic context).
precision(): number

truePositiveRate()

Returns aggregate true positive rate (same as recall).
truePositiveRate(): number

falsePositiveRate()

Returns aggregate false positive rate.
falsePositiveRate(): number

averageTokenUse()

Returns average tokens per iteration across all tests.
averageTokenUse(): number

Properties

name

The suite’s name (via getName()).
suite.getName() // "Math Operations"

Complete Example

import { MCPClientManager, TestAgent, EvalSuite, EvalTest } from "@mcpjam/sdk";

async function main() {
  // Setup
  const manager = new MCPClientManager({
    everything: {
      command: "npx",
      args: ["-y", "@modelcontextprotocol/server-everything"],
    },
  });
  await manager.connectToServer("everything");

  const agent = new TestAgent({
    tools: await manager.getTools(),
    model: "anthropic/claude-sonnet-4-20250514",
    apiKey: process.env.ANTHROPIC_API_KEY,
    temperature: 0.1,
  });

  // Build suite (with auto-save to MCPJam)
  const suite = new EvalSuite({
    name: "Everything Server Tests",
    mcpjam: {
      suiteName: "Everything Server Eval",
      passCriteria: { minimumPassRate: 90 },
    },
  });

  suite.add(new EvalTest({
    name: "add",
    test: async (a) => (await a.prompt("Add 2+3")).hasToolCall("add"),
  }));

  suite.add(new EvalTest({
    name: "echo",
    test: async (a) => (await a.prompt("Echo 'test'")).hasToolCall("echo"),
  }));

  suite.add(new EvalTest({
    name: "longRunningOperation",
    test: async (a) => (await a.prompt("Run a long operation")).hasToolCall("longRunningOperation"),
  }));

  // Run
  console.log(`Running ${suite.getName()}...\n`);

  const result = await suite.run(agent, {
    iterations: 20,
    concurrency: 3,
    onProgress: (done, total) => {
      process.stdout.write(`\r  Progress: ${done}/${total}`);
    },
  });

  // Report
  console.log(`\n\nOverall: ${(suite.accuracy() * 100).toFixed(1)}%`);
  console.log(`Total iterations: ${result.aggregate.iterations}\n`);

  console.log("Per-test breakdown:");
  for (const test of suite.getAll()) {
    const pct = (test.accuracy() * 100).toFixed(1);
    console.log(`  ${test.getName()}: ${pct}%`);
  }

  // Access individual test
  const echoTest = suite.get("echo");
  if (echoTest) {
    console.log(`\nEcho test details:`);
    console.log(`  Precision: ${(echoTest.precision() * 100).toFixed(1)}%`);
    console.log(`  Recall: ${(echoTest.recall() * 100).toFixed(1)}%`);
    console.log(`  Avg tokens: ${echoTest.averageTokenUse()}`);
  }

  // Cleanup
  await manager.disconnectServer("everything");
}

Patterns

CI Quality Gate

await suite.run(agent, { iterations: 30 });

if (suite.accuracy() < 0.90) {
  console.error(`❌ Suite accuracy ${(suite.accuracy() * 100).toFixed(1)}% below 90% threshold`);
  process.exit(1);
}

console.log("✅ All quality gates passed");

Per-Test Thresholds

await suite.run(agent, { iterations: 30 });

const criticalTests = ["createOrder", "processPayment"];
let failed = false;

for (const name of criticalTests) {
  const test = suite.get(name);
  if (test && test.accuracy() < 0.95) {
    console.error(`❌ Critical test "${name}" below 95%`);
    failed = true;
  }
}

if (failed) process.exit(1);

Comparing Across Providers

const providers = [
  { model: "anthropic/claude-sonnet-4-20250514", key: "ANTHROPIC_API_KEY" },
  { model: "openai/gpt-4o", key: "OPENAI_API_KEY" },
];

for (const { model, key } of providers) {
  const agent = new TestAgent({
    tools,
    model,
    apiKey: process.env[key],
  });

  await suite.run(agent, { iterations: 20 });
  console.log(`${model}: ${(suite.accuracy() * 100).toFixed(1)}%`);
}