The EvalSuite class groups multiple EvalTest instances and provides aggregate metrics across all tests.
Import
import { EvalSuite, EvalTest } from "@mcpjam/sdk";
Constructor
new EvalSuite(options?: EvalSuiteConfig)
Parameters
Configuration for the evaluation suite.
EvalSuiteConfig
| Property | Type | Required | Description |
|---|
name | string | No | Name for the suite (defaults to "EvalSuite") |
mcpjam | MCPJamReportingConfig | No | Auto-save results to MCPJam when the suite completes |
When an API key is available — via mcpjam.apiKey or the MCPJAM_API_KEY environment variable — all test results are consolidated into a single run and saved to MCPJam after the suite completes. Individual EvalTest auto-saves are suppressed to avoid duplicate uploads. Set mcpjam.enabled: false to disable.
Example
const suite = new EvalSuite({ name: "Math Operations" });
With results saved to MCPJam:
const reportingSuite = new EvalSuite({
name: "Math Operations",
mcpjam: {
suiteName: "Math Eval",
passCriteria: { minimumPassRate: 90 },
},
});
Methods
add()
Adds a test to the suite.
add(test: EvalTest): void
Parameters
| Parameter | Type | Description |
|---|
test | EvalTest | The test to add |
Example
suite.add(new EvalTest({
name: "addition",
test: async (agent) => {
const r = await agent.prompt("Add 2 and 3");
return r.hasToolCall("add");
},
}));
suite.add(new EvalTest({
name: "multiplication",
test: async (agent) => {
const r = await agent.prompt("Multiply 4 by 5");
return r.hasToolCall("multiply");
},
}));
run()
Runs all tests in the suite and returns aggregate results.
run(agent: EvalAgent, options: EvalTestRunOptions): Promise<EvalSuiteResult>
Parameters
| Parameter | Type | Description |
|---|
agent | EvalAgent | The agent to test with (TestAgent or mock) |
options | EvalTestRunOptions | Run configuration (same as EvalTest) |
EvalTestRunOptions
| Property | Type | Required | Default | Description |
|---|
iterations | number | Yes | - | Number of runs per test |
concurrency | number | No | 5 | Parallel runs per test |
retries | number | No | 0 | Retry failed tests |
timeoutMs | number | No | 30000 | Timeout per test (ms) |
onProgress | ProgressCallback | No | - | Progress callback (tracks total across all tests) |
onFailure | (report: string) => void | No | - | Called per test with a failure report if any iterations fail |
mcpjam | MCPJamReportingConfig | No | - | Override suite-level MCPJam config |
EvalSuiteResult
| Property | Type | Description |
|---|
tests | Map<string, EvalRunResult> | Per-test results |
aggregate.iterations | number | Total iterations across all tests |
aggregate.successes | number | Total passes |
aggregate.failures | number | Total failures |
aggregate.accuracy | number | Overall accuracy (0.0 - 1.0) |
aggregate.tokenUsage | { total, perTest[] } | Token usage |
aggregate.latency | { e2e, llm, mcp } | Latency stats with p50/p95 |
Example
const result = await suite.run(agent, {
iterations: 30,
concurrency: 5,
});
console.log(`Overall: ${(result.aggregate.accuracy * 100).toFixed(1)}%`);
Tests within the suite run sequentially, but each test’s iterations can run concurrently based on the concurrency setting.
accuracy()
Returns the aggregate accuracy across all tests.
Returns
number - Average accuracy of all tests (0.0 - 1.0).
Example
console.log(`Suite accuracy: ${(suite.accuracy() * 100).toFixed(1)}%`);
get()
Retrieves a specific test by name.
get(name: string): EvalTest | undefined
Parameters
| Parameter | Type | Description |
|---|
name | string | The test name |
Returns
EvalTest | undefined - The test, or undefined if not found.
Example
const addTest = suite.get("addition");
if (addTest) {
console.log(`Addition: ${(addTest.accuracy() * 100).toFixed(1)}%`);
}
getAll()
Returns all tests in the suite.
Returns
EvalTest[] - Array of all tests.
Example
for (const test of suite.getAll()) {
console.log(`${test.getName()}: ${(test.accuracy() * 100).toFixed(1)}%`);
}
getName()
Returns the suite’s name.
size()
Returns the number of tests in the suite.
getResults()
Returns the full suite results from the last run.
getResults(): EvalSuiteResult | null
recall()
Returns aggregate recall (same as accuracy in basic context).
precision()
Returns aggregate precision (same as accuracy in basic context).
truePositiveRate()
Returns aggregate true positive rate (same as recall).
truePositiveRate(): number
falsePositiveRate()
Returns aggregate false positive rate.
falsePositiveRate(): number
averageTokenUse()
Returns average tokens per iteration across all tests.
averageTokenUse(): number
Properties
name
The suite’s name (via getName()).
suite.getName() // "Math Operations"
Complete Example
import { MCPClientManager, TestAgent, EvalSuite, EvalTest } from "@mcpjam/sdk";
async function main() {
// Setup
const manager = new MCPClientManager({
everything: {
command: "npx",
args: ["-y", "@modelcontextprotocol/server-everything"],
},
});
await manager.connectToServer("everything");
const agent = new TestAgent({
tools: await manager.getTools(),
model: "anthropic/claude-sonnet-4-20250514",
apiKey: process.env.ANTHROPIC_API_KEY,
temperature: 0.1,
});
// Build suite (with auto-save to MCPJam)
const suite = new EvalSuite({
name: "Everything Server Tests",
mcpjam: {
suiteName: "Everything Server Eval",
passCriteria: { minimumPassRate: 90 },
},
});
suite.add(new EvalTest({
name: "add",
test: async (a) => (await a.prompt("Add 2+3")).hasToolCall("add"),
}));
suite.add(new EvalTest({
name: "echo",
test: async (a) => (await a.prompt("Echo 'test'")).hasToolCall("echo"),
}));
suite.add(new EvalTest({
name: "longRunningOperation",
test: async (a) => (await a.prompt("Run a long operation")).hasToolCall("longRunningOperation"),
}));
// Run
console.log(`Running ${suite.getName()}...\n`);
const result = await suite.run(agent, {
iterations: 20,
concurrency: 3,
onProgress: (done, total) => {
process.stdout.write(`\r Progress: ${done}/${total}`);
},
});
// Report
console.log(`\n\nOverall: ${(suite.accuracy() * 100).toFixed(1)}%`);
console.log(`Total iterations: ${result.aggregate.iterations}\n`);
console.log("Per-test breakdown:");
for (const test of suite.getAll()) {
const pct = (test.accuracy() * 100).toFixed(1);
console.log(` ${test.getName()}: ${pct}%`);
}
// Access individual test
const echoTest = suite.get("echo");
if (echoTest) {
console.log(`\nEcho test details:`);
console.log(` Precision: ${(echoTest.precision() * 100).toFixed(1)}%`);
console.log(` Recall: ${(echoTest.recall() * 100).toFixed(1)}%`);
console.log(` Avg tokens: ${echoTest.averageTokenUse()}`);
}
// Cleanup
await manager.disconnectServer("everything");
}
Patterns
CI Quality Gate
await suite.run(agent, { iterations: 30 });
if (suite.accuracy() < 0.90) {
console.error(`❌ Suite accuracy ${(suite.accuracy() * 100).toFixed(1)}% below 90% threshold`);
process.exit(1);
}
console.log("✅ All quality gates passed");
Per-Test Thresholds
await suite.run(agent, { iterations: 30 });
const criticalTests = ["createOrder", "processPayment"];
let failed = false;
for (const name of criticalTests) {
const test = suite.get(name);
if (test && test.accuracy() < 0.95) {
console.error(`❌ Critical test "${name}" below 95%`);
failed = true;
}
}
if (failed) process.exit(1);
Comparing Across Providers
const providers = [
{ model: "anthropic/claude-sonnet-4-20250514", key: "ANTHROPIC_API_KEY" },
{ model: "openai/gpt-4o", key: "OPENAI_API_KEY" },
];
for (const { model, key } of providers) {
const agent = new TestAgent({
tools,
model,
apiKey: process.env[key],
});
await suite.run(agent, { iterations: 20 });
console.log(`${model}: ${(suite.accuracy() * 100).toFixed(1)}%`);
}