Your users are connecting to your MCP server from different clients like Claude Desktop, Cursor, etc, and with different LLMs. MCP evals ensures that your MCP server works across all environments. Evals helps you:
  • Discover workflows that are breaking your server and get actionable ways on resolving them.
  • Benchmark your server’s performance and catch regressions in future changes.
  • Programatically test queries on a MCP server with a command. No more doing QA one by one.

E2E testing (beta)

We built a CLI that performs MCP evals and End to End (E2E) testing. The CLI creates a simulated end user’s environment and tests popular user flows. An example of E2E test for PayPal MCP:
  1. Connect the PayPal MCP server to testing agent. To simulate Claude Desktop, we can configure the agent to use a Claude model with a default system prompt.
  2. Query the agent to run a typical user query like “Create a refund for order ID 412”
  3. Let the testing agent run the query.
  4. Check the testing agents’ tracing, make sure that it called the tool create_refund and successfully created a refund.