Evals System Architecture
The Evals system in MCPJam Inspector is a comprehensive testing framework designed to evaluate MCP (Model Context Protocol) server implementations. This guide provides a deep dive into the architecture, data flows, and key components to help you contribute effectively.Overview
The Evals system allows developers to:- Run automated tests against MCP servers to validate tool implementations
- Generate test cases using AI based on available server tools
- Track results in real-time with detailed metrics and analytics
- Compare expected vs actual behavior using agentic LLM loops
Key Features
- Multi-step wizard UI for test configuration
- Support for multiple LLM providers (OpenAI, Anthropic, DeepSeek, Ollama)
- Real-time result tracking via MCPJamBackend
- AI-powered test case generation
- Agentic execution with up to 20 conversation turns
- Token usage and performance metrics
Architecture Overview
The Evals system is composed of three main layers:System Components
1. Client Layer (UI)
EvalRunner Component (client/src/components/evals/eval-runner.tsx
)
The primary UI for configuring and launching evaluation runs.
Architecture: 4-Step Wizard
Step Details:
-
Select Servers: Choose from connected MCP servers
- Filters: Only shows connected servers
- Validation: At least one server required
-
Choose Model: Select LLM provider and model
- Providers: OpenAI, Anthropic, DeepSeek, Ollama, MCPJam
- Credential check: Validates API keys via
hasToken()
-
Define Tests: Create or generate test cases
- Manual entry: Title, query, expected tool calls, number of runs
- AI generation: Click “Generate Tests” to create 6 test cases (2 easy, 2 medium, 2 hard)
-
Review & Run: Confirm and execute
- Displays summary of configuration
- POST to
/api/mcp/evals/run
Results Components (client/src/components/evals/*
)
Real-time display of evaluation results.
Component Hierarchy:
2. Server Layer (API)
Evals Routes (server/routes/mcp/evals.ts
)
HTTP API endpoints for eval execution and test generation.
Endpoint: POST /api/mcp/evals/run
Request Schema:
resolveServerIdsOrThrow()
: Case-insensitive server ID matchingtransformServerConfigsToEnvironment()
: Converts server manager format to CLI formattransformLLMConfigToLlmsConfig()
: Routes API keys appropriately
Endpoint: POST /api/mcp/evals/generate-tests
Request Schema:
Test Generation Agent (server/services/eval-agent.ts
)
Generates test cases using backend LLM.
Algorithm:
- Groups tools by server ID
- Creates system prompt with MCP agent instructions
- Creates user prompt with tool definitions and requirements
- Calls backend LLM (meta-llama/llama-3.3-70b-instruct)
- Parses JSON response
- Returns 6 test cases (2 easy, 2 medium, 2 hard)
3. CLI Layer (Execution Engine)
Runner (evals-cli/src/evals/runner.ts
)
The core orchestrator that executes evaluation tests.
Entry Points:
runEvalsWithApiKey()
: CLI mode with API key authenticationrunEvalsWithAuth()
: UI mode with Convex authentication
- Max 20 conversation turns to prevent infinite loops
- Token usage tracking (prompt + completion)
- Duration measurement
- Tool call recording
Evaluator (evals-cli/src/evals/evaluator.ts
)
Compares expected vs actual tool calls to determine pass/fail status.
Logic:
- ✅ All expected tools must be called
- ⚠️ Additional unexpected tools are allowed (marked but don’t fail)
RunRecorder (evals-cli/src/db/tests.ts
)
Database interface for persisting evaluation results.
Two Modes:
- API Key Mode (
createRunRecorder
): Uses CLI-based database client - Auth Mode (
createRunRecorderWithAuth
): Uses Convex HTTP client
Data Models
Database Schema
TypeScript Interfaces
Integration Points
LLM Providers
The system supports multiple execution paths based on the selected model: Provider Configuration:MCP Server Integration
Connection Workflow: Transport Support:-
STDIO: Command execution with stdin/stdout
-
HTTP/SSE: Server-Sent Events
-
Streamable HTTP: Custom streaming protocol
MCPJam Backend
Database Actions:Contributing Guide
Adding a New LLM Provider
- Update LLM config schema in
evals-cli/src/utils/validators.ts
:
- Add provider case in
evals-cli/src/evals/runner.ts
:
- Add to UI model list in
client/src/hooks/use-chat.tsx
:
Adding a New MCP Transport
- Update server definition schema in
evals-cli/src/utils/validators.ts
:
- Implement transport in Mastra MCP client (external library).
-
Update config transformer in
server/utils/eval-transformer.ts
:
Debugging Evals
Enable verbose logging:Testing Changes
Run evals locally:- Start development server:
npm run dev
- Navigate to “Run evals” tab
- Configure and execute test
- Check browser console for errors
- View results in “Eval results” tab
Common Issues
Issue: Test cases are not created- Check Convex auth token validity
- Verify
CONVEX_URL
andCONVEX_HTTP_URL
environment variables - Inspect browser network tab for failed requests
- Verify server connection status in ClientManager
- Check tool definitions in
listTools()
response - Ensure tool names match exactly (case-sensitive)
- Confirm
/streaming
endpoint is accessible - Check Convex auth token in request headers
- Verify model ID format (
@mcpjam/...
)
Performance Considerations
Optimization Strategies
-
Parallel Execution: Run multiple test cases concurrently
-
Tool Batching: Execute independent tools in parallel
-
Database Batching: Batch iteration updates
-
Caching: Cache tool definitions between iterations
Metrics
Key performance indicators:- Average iteration duration: Time from start to finish
- Token usage per iteration: Prompt + completion tokens
- Tool execution time: Time spent in MCP calls
- Database write time: Time to persist results
- LLM response time: Time for each model call
helpers.ts
aggregation functions.
Security Considerations
API Key Management
- Never commit API keys to version control
- Store keys in localStorage (client) or environment variables (CLI)
- Use Convex auth tokens for backend models (no API key exposure)
Input Validation
All inputs are validated with Zod schemas:Error Handling
- Never expose internal errors to the client
- Sanitize error messages before logging
- Catch all exceptions in async functions
- Validate all external inputs (LLM responses, tool results)
Future Enhancements
Potential areas for contribution:- Parallel Test Execution: Run multiple test cases simultaneously
- Custom Evaluators: Support for user-defined pass/fail criteria
- Retry Logic: Automatic retry on transient failures
- Result Comparison: Compare results across different models
- Historical Analysis: Trend analysis of eval performance over time
- Export Results: Download results as CSV/JSON
- Shareable Suites: Share test configurations with team members
- Scheduling: Run evals on a schedule (cron-like)
Glossary
Term | Definition |
---|---|
Eval Suite | A collection of test cases executed together |
Test Case | A single test with a query and expected tool calls |
Iteration | One execution of a test case (test cases can have multiple runs) |
Agentic Loop | Iterative LLM conversation with tool calling |
Tool Call | Invocation of an MCP server tool by the LLM |
Expected Tools | Tools that should be called for a test to pass |
Actual Tools | Tools that were actually called during execution |
Missing Tools | Expected tools that were not called (causes failure) |
Unexpected Tools | Tools called but not expected (logged, doesn’t fail) |
RunRecorder | Interface for persisting eval results to database |
MCPClient | Mastra client for communicating with MCP servers |
Resources
- MCP Specification: https://spec.modelcontextprotocol.io
- Mastra MCP Client: https://mastra.dev
- Convex Database: https://convex.dev
- Vercel AI SDK: https://sdk.vercel.ai
Questions?
If you have questions or need help contributing:- Check the GitHub Issues
- Join our Discord community
- Read the main Contributing Guide