Test Cases

Test Cases provide a comprehensive testing environment for evaluating your MCP server’s tool performance across different LLMs. Create test cases, generate tests automatically with AI, run evaluations, and analyze results with detailed metrics and visualizations.

Key Features

The Test Cases feature provides everything you need to evaluate MCP server reliability:

AI-Powered Test Generation - Automatically generate comprehensive test cases from your tool definitions
Negative Test Cases - Test edge cases where tools should NOT be triggered
Multi-Model Evaluation - Run tests across different LLM providers and models
Accuracy Metrics - View test statistics, see how results change across runs, and compare performance between different LLMs
Detailed Run Analysis - View test duration, token consumption, and model performance breakdowns
Batch Operations - Run entire test cases or individual tests with a single click

Getting Started

To start testing your MCP server:

Connect your MCP server - Use the Servers tab to connect to your MCP server
Navigate to Test Cases - Click the Test Cases tab in MCPJam Inspector
Create tests - Either:
- Click the plus icon to manually create a test case and configure the scenario, user prompt, expected tools, and expected output
- Click the magic wand icon to auto-generate tests from your tools
Configure models - Click Results & Runs in the sidebar, then use the Models dropdown to select which models to test against
Run Tests - Click Run to execute all test cases, or run individual tests from the test case view

When using auto-generate, Claude Haiku creates realistic test scenarios from your tool definitions, including both positive cases (tools that should be called) and negative cases (tools that should NOT be called).

Test Case Structure

Each test case contains:

Scenario - A description of the use case to test
User Prompt - The exact prompt or interaction to begin the test
Tool Triggered - Which tools should be called
Expected Output - The output or experience we should expect to receive back from the MCP server

Positive vs Negative Tests

Positive Tests verify that your tools are correctly triggered when they should be. Examples include:

Single tool usage (“Get me the weather in Tokyo”)
Multiple tools in one request (“Find flights to Paris and check the weather there”)

Negative Tests verify that tools are NOT called when they shouldn’t be. Examples include:

Meta questions about tools (“What parameters does search accept?”)
Similar keywords without action intent (“I was reading about file systems”)
Ambiguous or conversational prompts

Negative tests are marked with a “NEG” badge in the sidebar for easy identification.

Managing Test Cases

Use the sidebar to manage your test cases:

Create new tests - Click the plus icon to add a test case manually
Generate tests with AI - Click the magic wand to auto-generate tests from your tools
Duplicate tests - Use the dropdown menu on any test to create a copy
Delete tests - Remove tests you no longer need

Running Tests

You can run tests in two ways:

Run a single test - Choose which of your configured models to use and click Run
Run all tests - Click Results & Runs, configure your models, then click the Run button

Running tests requires connected MCP servers. If the Run button is disabled, check that your servers are connected in the Servers tab.

Analyzing Results

Results & Runs View

Click Results & Runs in the sidebar to see overall analytics:

Accuracy Donut - Overall accuracy percentage
Accuracy Chart - Shows pass rates across runs (line connects multiple runs)
Performance by Model - Bar chart comparing models

Use the dropdown to switch between “Runs” and “Test Cases” views: Runs view:

Run History - Shows all your runs with their metrics (Run ID, Start time, Duration, Passed, Failed, Accuracy, Tokens)

Test Cases view:

Test Cases Table - List of all tests with Test Case Name, Iterations, Avg Accuracy, Avg Duration

Run Detail View

When you click on a run:

Metrics Summary - Accuracy, Passed, Failed, Total, Duration
All Iterations Table - List of all test executions with:
- Test name
- Model used
- Tools called
- Tokens consumed
- Duration
Run Summary Sidebar - Click “View run summary” to see:
- Duration per Test (averaged across models)
- Tokens per Test (averaged across models)
- Performance by Model

Test Case Detail View

When you click on a test case:

Performance Across Runs - Line chart showing how this test performs over time
Performance by Model - Bar chart comparing pass rates across models
Iterations List - All executions of this test with:
- Pass/fail status
- Model used
- Tool calls, tokens, duration
- Run ID
Expanded Details - Click an iteration to see:
- Expected vs Actual tool calls
- Full trace showing the conversation, model reasoning, and tool execution details

Debugging Tests

Use the visual status indicators to quickly identify issues:

Green border - Test passed (expected tools were called)
Red border - Test failed (tools were not called as expected)
Yellow border - Test pending
Gray border - Test cancelled

For failed tests, expand the iteration to view:

Expected vs Actual tool calls (see what’s missing or unexpected)
Full conversation trace (understand why the model made different decisions)

If tests are consistently failing, check that your expected tool calls match your tool definitions. The LLM may be calling tools with slightly different argument formats.

Best Practices

Writing Effective Test Cases

Be specific - Include concrete values (dates, IDs, names) in prompts
Test realistic scenarios - Write prompts as real users would
Cover edge cases - Include negative tests for boundary conditions
Review the trace - When tests fail, view the full conversation to understand LLM reasoning
Iterate before submission - Ensure your test cases pass locally before submitting to app stores

For ChatGPT App Submission

If you’re building a ChatGPT app, OpenAI requires 5 positive test cases and 3 negative test cases. Use the AI generation feature to quickly create these, then customize as needed.

App store approval cycles can be lengthy. A single test case failure can cost you days of waiting - test thoroughly before submitting.

Overview

Inspector Features

SDK

Guides

Troubleshooting

Key Features

Getting Started

Test Case Structure

Positive vs Negative Tests

Managing Test Cases

Running Tests

Analyzing Results

Results & Runs View

Run Detail View

Test Case Detail View

Debugging Tests

Best Practices

Writing Effective Test Cases

For ChatGPT App Submission

Overview

Inspector Features

SDK

Guides

Troubleshooting

​Key Features

​Getting Started

​Test Case Structure

​Positive vs Negative Tests

​Managing Test Cases

​Running Tests

​Analyzing Results

​Results & Runs View

​Run Detail View

​Test Case Detail View

​Debugging Tests

​Best Practices

​Writing Effective Test Cases

​For ChatGPT App Submission

Key Features

Getting Started

Test Case Structure

Positive vs Negative Tests

Managing Test Cases

Running Tests

Analyzing Results

Results & Runs View

Run Detail View

Test Case Detail View

Debugging Tests

Best Practices

Writing Effective Test Cases

For ChatGPT App Submission