The EvalTest class runs a single test scenario multiple times and provides statistical metrics like accuracy, precision, and recall.
Import
import { EvalTest } from "@mcpjam/sdk";
Constructor
new EvalTest(options: EvalTestOptions)
Parameters
Configuration for the evaluation test.
EvalTestOptions
| Property | Type | Required | Description |
|---|
name | string | Yes | Unique identifier for the test |
test | TestFunction | Yes | The test function to run |
TestFunction Type
type TestFunction = (agent: TestAgent) => Promise<boolean>
The test function receives a TestAgent and must return a boolean:
true = test passed
false = test failed
Example
const test = new EvalTest({
name: "addition-accuracy",
test: async (agent) => {
const result = await agent.prompt("Add 2 and 3");
return result.hasToolCall("add");
},
});
Methods
run()
Executes the test multiple times.
run(agent: TestAgent, options: RunOptions): Promise<void>
Parameters
| Parameter | Type | Description |
|---|
agent | TestAgent | The agent to test with |
options | RunOptions | Run configuration |
RunOptions
| Property | Type | Required | Default | Description |
|---|
iterations | number | Yes | - | Number of test runs |
concurrency | number | No | 1 | Parallel test runs |
retries | number | No | 0 | Retry failed tests |
timeoutMs | number | No | 60000 | Timeout per test (ms) |
onProgress | ProgressCallback | No | - | Progress callback |
ProgressCallback Type
type ProgressCallback = (completed: number, total: number) => void
Example
await test.run(agent, {
iterations: 30,
concurrency: 5,
retries: 2,
timeoutMs: 30000,
onProgress: (done, total) => {
console.log(`${done}/${total}`);
},
});
accuracy()
Returns the success rate (0.0 - 1.0).
Returns
number - Proportion of tests that passed.
Example
console.log(`Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
// "Accuracy: 96.7%"
precision()
Returns the precision metric.
Returns
number - True positives / (True positives + False positives).
recall()
Returns the recall metric.
Returns
number - True positives / (True positives + False negatives).
truePositiveRate()
Returns the true positive rate (same as recall).
truePositiveRate(): number
falsePositiveRate()
Returns the false positive rate.
falsePositiveRate(): number
Returns
number - False positives / (False positives + True negatives).
averageTokenUse()
Returns the average tokens used per iteration.
averageTokenUse(): number
Returns
number - Mean token count.
Example
console.log(`Avg tokens: ${test.averageTokenUse()}`);
getResults()
Returns raw results from all iterations.
getResults(): TestResult[]
Returns
TestResult[] - Array of individual test results.
TestResult Type
| Property | Type | Description |
|---|
passed | boolean | Whether the test passed |
error | string | undefined | Error message if failed |
latencyMs | number | Time taken |
tokens | number | Tokens used |
Example
const results = test.getResults();
const failures = results.filter(r => !r.passed);
console.log(`Failures: ${failures.length}`);
for (const fail of failures) {
console.log(` Error: ${fail.error}`);
}
Properties
name
The test’s identifier.
test.name // "addition-accuracy"
Test Function Patterns
test: async (agent) => {
const result = await agent.prompt("Add 5 and 3");
return result.hasToolCall("add");
}
Argument Validation
test: async (agent) => {
const result = await agent.prompt("Add 10 and 20");
const args = result.getToolArguments("add");
return args?.a === 10 && args?.b === 20;
}
Response Content
test: async (agent) => {
const result = await agent.prompt("What is 5 + 5?");
return result.getText().includes("10");
}
Multiple Conditions
test: async (agent) => {
const result = await agent.prompt("Calculate 5 * 3");
return (
result.hasToolCall("multiply") &&
!result.hasError() &&
result.getText().length > 0
);
}
Multi-Turn Conversation
test: async (agent) => {
const r1 = await agent.prompt("Create a project");
const r2 = await agent.prompt("Add a task to it", { context: r1 });
return r1.hasToolCall("createProject") && r2.hasToolCall("createTask");
}
With Validators
import { matchToolCallWithArgs } from "@mcpjam/sdk";
test: async (agent) => {
const result = await agent.prompt("Add 2 and 3");
return matchToolCallWithArgs("add", { a: 2, b: 3 }, result.getToolCalls());
}
Complete Example
import { MCPClientManager, TestAgent, EvalTest } from "@mcpjam/sdk";
async function main() {
const manager = new MCPClientManager({
everything: {
command: "npx",
args: ["-y", "@modelcontextprotocol/server-everything"],
},
});
await manager.connectToServer("everything");
const agent = new TestAgent({
tools: await manager.getTools(),
model: "anthropic/claude-sonnet-4-20250514",
apiKey: process.env.ANTHROPIC_API_KEY,
temperature: 0.1,
});
const test = new EvalTest({
name: "addition",
test: async (agent) => {
const r = await agent.prompt("Add 2 and 3");
return r.hasToolCall("add");
},
});
console.log("Running evaluation...");
await test.run(agent, {
iterations: 30,
concurrency: 5,
onProgress: (done, total) => {
process.stdout.write(`\r${done}/${total}`);
},
});
console.log("\n\nResults:");
console.log(` Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
console.log(` Precision: ${(test.precision() * 100).toFixed(1)}%`);
console.log(` Recall: ${(test.recall() * 100).toFixed(1)}%`);
console.log(` Avg tokens: ${test.averageTokenUse()}`);
await manager.disconnectServer("everything");
}