Why Run Evals?
A single test pass can be misleading:- The LLM might get lucky on one attempt
- Temperature introduces randomness
- Different phrasings might fail where others succeed
- “This tool is called correctly 97% of the time”
- “Arguments are correct in 90% of cases”
- “Average latency is 1.2 seconds”
EvalTest: Single Test Scenario
EvalTest runs one test function multiple times:
Writing Test Functions
Test functions receive anEvalAgent (implemented by TestAgent and mock agents) and return boolean:
Run Options
Metrics
After running, access various metrics:EvalSuite: Multiple Tests
Group related tests together:Save Results to MCPJam
BothEvalTest and EvalSuite can automatically save results to MCPJam when a run completes. Set MCPJAM_API_KEY in your environment and results are saved automatically:
Suite Results
Choosing Iteration Count
| Scenario | Iterations | Why |
|---|---|---|
| Quick smoke test | 10 | Fast feedback during development |
| Regular testing | 30 | Good statistical significance |
| Pre-release | 50-100 | High confidence before shipping |
| Benchmarking | 100+ | Comparing models or changes |
Best Practices
Use Low Temperature
More deterministic results for testing:Handle Rate Limits
Reduce concurrency for rate-limited APIs:Test Edge Cases
Don’t just test the happy path:Set Quality Thresholds
Fail CI if accuracy drops below a threshold:Generate evals from the Inspector
You can also generate eval code from the MCPJam Inspector. Click ⋮ → Copy markdown for server evals on any server card, then paste it into an LLM. See the Quickstart for details. If you have aMCPJAM_API_KEY, the generated code will automatically save results to the Evals tab in the Inspector. Go to Settings > Workspace API Key to get your key.

