The EvalTest class runs a single test scenario multiple times and provides statistical metrics like accuracy, precision, and recall.
Import
import { EvalTest } from "@mcpjam/sdk";
Constructor
new EvalTest(options: EvalTestConfig)
Parameters
Configuration for the evaluation test.
EvalTestConfig
| Property | Type | Required | Description |
|---|
name | string | Yes | Unique identifier for the test |
test | TestFunction | Yes | The test function to run |
TestFunction Type
type TestFunction = (agent: EvalAgent) => boolean | Promise<boolean>;
The test function receives an EvalAgent and must return a boolean:
true = test passed
false = test failed
Both TestAgent and mock agents implement the EvalAgent interface, so you can use either for testing.
Example
const test = new EvalTest({
name: "addition-accuracy",
test: async (agent) => {
const result = await agent.prompt("Add 2 and 3");
return result.hasToolCall("add");
},
});
Methods
run()
Executes the test multiple times and returns detailed results.
run(agent: EvalAgent, options: EvalTestRunOptions): Promise<EvalRunResult>
Parameters
| Parameter | Type | Description |
|---|
agent | EvalAgent | The agent to test with (TestAgent or mock) |
options | EvalTestRunOptions | Run configuration |
EvalTestRunOptions
| Property | Type | Required | Default | Description |
|---|
iterations | number | Yes | - | Number of test runs |
concurrency | number | No | 5 | Parallel test runs |
retries | number | No | 0 | Retry failed tests |
timeoutMs | number | No | 30000 | Per-iteration wall-clock timeout in ms. The active prompt is aborted at this deadline, then given a 1 second grace period to settle so partial tool calls and trace history can still be captured. |
onProgress | ProgressCallback | No | - | Progress callback |
onFailure | (report: string) => void | No | - | Called with a failure report if any iterations fail |
mcpjam | MCPJamReportingConfig | No | - | Auto-save results to MCPJam |
Results are automatically saved to MCPJam after the run completes when an API
key is available via mcpjam.apiKey or the MCPJAM_API_KEY environment
variable. Set mcpjam.enabled: false to disable.
ProgressCallback Type
type ProgressCallback = (completed: number, total: number) => void;
Example
await test.run(agent, {
iterations: 30,
concurrency: 5,
retries: 2,
timeoutMs: 30000,
mcpjam: {
suiteName: "Addition Eval",
strict: false,
},
onProgress: (done, total) => {
console.log(`${done}/${total}`);
},
onFailure: (report) => {
console.error(report);
},
});
accuracy()
Returns the success rate (0.0 - 1.0).
Returns
number - Proportion of tests that passed.
Example
console.log(`Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
// "Accuracy: 96.7%"
precision()
Returns the precision metric.
Returns
number - True positives / (True positives + False positives).
recall()
Returns the recall metric.
Returns
number - True positives / (True positives + False negatives).
truePositiveRate()
Returns the true positive rate (same as recall).
truePositiveRate(): number
falsePositiveRate()
Returns the false positive rate.
falsePositiveRate(): number
Returns
number - False positives / (False positives + True negatives).
averageTokenUse()
Returns the average tokens used per iteration.
averageTokenUse(): number
Returns
number - Mean token count.
Example
console.log(`Avg tokens: ${test.averageTokenUse()}`);
getResults()
Returns the full run result from the last run.
getResults(): EvalRunResult | null
Returns
EvalRunResult | null - The run result, or null if run() hasn’t been called.
EvalRunResult Type
| Property | Type | Description |
|---|
iterations | number | Total iterations run |
successes | number | Number that passed |
failures | number | Number that failed |
results | boolean[] | Pass/fail per iteration |
iterationDetails | IterationResult[] | Detailed per-iteration results |
tokenUsage | object | Aggregate and per-iteration token usage |
latency | object | Latency stats (e2e, llm, mcp) with p50/p95 |
IterationResult Type
| Property | Type | Description |
|---|
passed | boolean | Whether this iteration passed |
latencies | LatencyBreakdown[] | Latency per prompt in this iteration |
tokens | { total, input, output } | Token usage |
error | string | undefined | Error message if failed |
retryCount | number | undefined | Number of retries attempted |
prompts | PromptResult[] | undefined | Prompt results from this iteration |
getName()
Returns the test’s name.
getConfig()
Returns the test’s configuration.
getConfig(): EvalTestConfig
getAllIterations()
Returns all iteration details from the last run.
getAllIterations(): IterationResult[]
getFailedIterations()
Returns only the failed iterations from the last run.
getFailedIterations(): IterationResult[]
Example
const failures = test.getFailedIterations();
console.log(`${failures.length} failures`);
for (const fail of failures) {
console.log(` Error: ${fail.error}`);
}
getSuccessfulIterations()
Returns only the successful iterations from the last run.
getSuccessfulIterations(): IterationResult[]
getFailureReport()
Returns a formatted failure report with traces from all failed iterations. Useful for debugging.
getFailureReport(): string
Example
await test.run(agent, { iterations: 30 });
if (test.accuracy() < 0.9) {
console.error(test.getFailureReport());
}
Properties
name
The test’s identifier (via getName()).
test.getName(); // "addition-accuracy"
Test Function Patterns
test: async (agent) => {
const result = await agent.prompt("Add 5 and 3");
return result.hasToolCall("add");
};
Argument Validation
test: async (agent) => {
const result = await agent.prompt("Add 10 and 20");
const args = result.getToolArguments("add");
return args?.a === 10 && args?.b === 20;
};
Response Content
test: async (agent) => {
const result = await agent.prompt("What is 5 + 5?");
return result.getText().includes("10");
};
Multiple Conditions
test: async (agent) => {
const result = await agent.prompt("Calculate 5 * 3");
return (
result.hasToolCall("multiply") &&
!result.hasError() &&
result.getText().length > 0
);
};
Multi-Turn Conversation
test: async (agent) => {
const r1 = await agent.prompt("Create a project");
const r2 = await agent.prompt("Add a task to it", { context: r1 });
return r1.hasToolCall("createProject") && r2.hasToolCall("createTask");
};
With Validators
import { matchToolCallWithArgs } from "@mcpjam/sdk";
test: async (agent) => {
const result = await agent.prompt("Add 2 and 3");
return matchToolCallWithArgs("add", { a: 2, b: 3 }, result.getToolCalls());
};
Complete Example
import { MCPClientManager, TestAgent, EvalTest } from "@mcpjam/sdk";
async function main() {
const manager = new MCPClientManager({
everything: {
command: "npx",
args: ["-y", "@modelcontextprotocol/server-everything"],
},
});
await manager.connectToServer("everything");
const agent = new TestAgent({
tools: await manager.getTools(),
model: "anthropic/claude-sonnet-4-20250514",
apiKey: process.env.ANTHROPIC_API_KEY,
temperature: 0.1,
});
const test = new EvalTest({
name: "addition",
test: async (agent) => {
const r = await agent.prompt("Add 2 and 3");
return r.hasToolCall("add");
},
});
console.log("Running evaluation...");
const result = await test.run(agent, {
iterations: 30,
concurrency: 5,
mcpjam: { suiteName: "Addition Eval" },
onProgress: (done, total) => {
process.stdout.write(`\r${done}/${total}`);
},
onFailure: (report) => {
console.error(report);
},
});
console.log("\n\nResults:");
console.log(` Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
console.log(` Precision: ${(test.precision() * 100).toFixed(1)}%`);
console.log(` Recall: ${(test.recall() * 100).toFixed(1)}%`);
console.log(` Avg tokens: ${test.averageTokenUse()}`);
console.log(` Iterations: ${result.iterations}`);
console.log(` Successes: ${result.successes}`);
await manager.disconnectServer("everything");
}