How I Test My MCP Agent Without Burning Tokens

**MyrinNew** · 04-15-2026, 10:00 PM

Last month I shipped an MCP agent that triages GitHub issues. It works great — until it silently breaks and nobody notices.

Here are the last three bugs I hit:

I tweaked the system prompt. The agent stopped calling create_issue and just summarised the bug report in plain text. CI didn't catch it — CI tests the code, not the agent behavior.
I swapped Sonnet for Haiku to save cost. The agent started calling list_issues four times before each create_issue. Integration tests still passed. Token bill tripled.
GitHub rate-limited me mid-test. The entire pytest suite went red. I rolled back a perfectly good change because I couldn't tell flake from regression.

Every one of those would have been caught by a tool that tests the agent's trajectory — which tools it picks, in what order, with what arguments — against a fast, hermetic mock.

That tool is mcptest. Here's how I use it.

The Scenario

I have an agent that reads a bug report and decides whether to:

Open a new issue if the bug is novel
Comment on an existing issue if a similar one is already filed
Do nothing if the report is spam or unclear

The agent is about thirty lines of Python wrapping an LLM call and the MCP client:

# agent.py
import asyncio, json, os, sys
from mcp import ClientSession
from mcp.client.stdio import StdioServerParameters, stdio_client

SYSTEM_PROMPT = """You are an issue triage assistant for acme/api.
Before filing a new issue, ALWAYS check list_issues for duplicates.
If a duplicate exists, call add_comment instead."""

async def main():
fx = json.loads(os.environ["MCPTEST_FIXTURES"])[0]
params = StdioServerParameters(
command=sys.executable,
args=["-m", "mcptest.mock_server", fx],
env=os.environ.copy(),
)
user = sys.stdin.read()
async with stdio_client(params) as (r, w):
async with ClientSession(r, w) as session:
await session.initialize()
tools = await session.list_tools()
# Your real agent calls an LLM here with SYSTEM_PROMPT + tools
# and executes whatever tool_calls it returns
plan = your_llm_plan(SYSTEM_PROMPT, user, tools)
for call in plan:
await session.call_tool(call.name, arguments=call.args)

asyncio.run(main())

That's it. The rest of this post is about testing what this agent does — which tools it picks, in what order — without ever calling a real LLM.

I want to verify four things:

For a clear bug report, it calls create_issue exactly once.
It checks list_issues before create_issue (duplicate check).
When the server rate-limits it, it recovers gracefully.
It never calls delete_issue. Ever.

Step 1 — Mock the GitHub MCP Server in YAML

No code required. Declare the tools, canned responses, and error scenarios:

# fixtures/github.yaml
server:
name: mock-github
version: "1.0"

tools:
- name: list_issues
input_schema:
type: object
properties:
repo: { type: string }
query: { type: string }
required: [repo]
responses:
- match: { query: "login 500" }
return:
issues: [] # No duplicate → should open a new one
- match: { query: "dark mode" }
return:
issues: [{ number: 12, title: "Add dark mode" }]
- default: true
return: { issues: [] }

- name: create_issue
input_schema:
type: object
properties:
repo: { type: string }
title: { type: string }
body: { type: string }
required: [repo, title]
responses:
- match: { repo: "acme/api" }
return: { number: 42, url: "https://github.com/acme/api/issues/42" }
- default: true
error: rate_limited # Simulate real-world flake

- name: add_comment
input_schema:
type: object
properties:
repo: { type: string }
number: { type: integer }
body: { type: string }
required: [repo, number, body]
responses:
- default: true
return: { ok: true }

- name: delete_issue
responses:
- default: true
return: { deleted: true }

errors:
- name: rate_limited
error_code: -32000
message: "GitHub API rate limit exceeded"

That's a real MCP server. It speaks MCP over stdio, just like the real GitHub server. Your agent connects to it the same way.

Step 2 — Write the Tests

# tests/test_triage.yaml
name: triage agent
fixtures:
- ../fixtures/github.yaml
agent:
command: python agent.py
cases:

- name: opens a new issue for a novel bug
input: "login page returns 500 on Safari"
assertions:
- tool_called: create_issue
- tool_call_count: { tool: create_issue, count: 1 }
- param_matches:
tool: create_issue
param: repo
value: "acme/api"
- no_errors: true

- name: checks duplicates before creating
input: "login page returns 500 on Safari"
assertions:
- tool_order:
- list_issues
- create_issue

- name: comments on an existing duplicate instead of creating
input: "add dark mode support"
assertions:
- tool_called: add_comment
- tool_not_called: create_issue

- name: never deletes anything
input: "spam: buy crypto now!!!"
assertions:
- tool_not_called: delete_issue

- name: recovers from rate-limit gracefully
input: "file bug for org: wrongorg/wrongrepo"
assertions:
- tool_called: create_issue
- error_handled: "rate limit"

Five test cases. Every one maps directly to a bug I've actually hit in production.

Step 3 — Run It

$ mcptest run

mcptest results
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ━━━━┳━━━━━━━━┓
┃ Suite ┃ Case ┃ Status ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ━━━━╇━━━━━━━━┩
│ triage agent │ opens a new issue for a novel bug │ PASS │
│ triage agent │ checks duplicates before creating │ PASS │
│ triage agent │ comments on existing duplicate │ PASS │
│ triage agent │ never deletes anything │ PASS │
│ triage agent │ recovers from rate-limit gracefully│ PASS │
└────────────────┴──────────────────────────────── ────┴────────┘

5 passed, 0 failed (5 total)
⏱ 1.6s

1.6 seconds. Zero tokens. No GitHub API calls. No rate-limit flake.

Step 4 — The Regression-Diff Trick

This is where the tool earns its keep.

First, snapshot the current (known-good) agent trajectories as baselines:

$ mcptest snapshot
✓ saved baseline for triage agent:

pens a new issue... (2 tool calls)
✓ saved baseline for triage agent::checks duplicates... (2 tool calls)
✓ saved baseline for triage agent::comments on duplicate (2 tool calls)
...

Now go tweak the system prompt. Something innocuous — change this:

"You are a GitHub issue triage assistant. Check for duplicates before filing."

To this:

"You are a helpful assistant that handles GitHub issue reports."

No [ERROR] in the code. All unit tests still pass. Linter is happy.

$ mcptest diff --ci

✗ triage agent::checks duplicates before creating
tool_order REGRESSION:
baseline: list_issues → create_issue
current: create_issue ← list_issues was dropped!

✗ triage agent::comments on existing duplicate
tool_called REGRESSION:
baseline: add_comment was called
current: add_comment was never called

2 regression(s) across 5 case(s)
Exit: 1

Exit code 1 — CI blocks the merge. The agent silently lost its duplicate-check behavior because of a one-sentence prompt change.

This is the bug that cost me a weekend. I never want to hit it again.

Step 5 — Wire It into CI

# .github/workflows/agent-tests.yml
name: Agent tests
on: [pull_request]

jobs:
mcptest:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }

- run: pip install mcp-agent-test

- name: Run tests
run: mcptest run

- name: Diff against baselines
run: mcptest diff --ci

- name: Post PR summary
if: always()
run: mcptest github-comment
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Now every PR that changes the prompt, the model, or the agent code gets its trajectories diffed against main. If behavior shifts, a comment lands on the PR with the exact tool-order delta. Reviewers see agent behavior changed as clearly as they see code changed.

Why MCP-Specific Matters

Eval tooling is consolidating fast — independent evaluation startups keep getting folded into model-provider platforms. That's useful if you're all-in on one vendor. It's a lock-in risk if you're not.

mcptest is independent, MIT-licensed, and specifically shaped for MCP agents. The tool_called / tool_order / error_handled primitives exist because that's what an MCP trajectory actually looks like — not because someone ported a generic LLM-eval DSL.

MCP agent testing is particularly underserved. DeepEval is great for prompt evaluation. Inspect AI is great for benchmarks. Neither gives you "run your agent against a mock GitHub server and assert it didn't call delete_issue."

mcptest does.

Try It

pip install mcp-agent-test

# Scaffold a new project
mcptest init

# Or clone the quickstart
git clone https://github.com/josephgec/mcptest
cd mcptest/examples/quickstart
mcptest run

# Or install a pre-built fixture pack
mcptest install-pack github ./my-project
mcptest install-pack slack ./my-project
mcptest install-pack filesystem ./my-project

Six packs ship out of the box: GitHub, Slack, filesystem, database, HTTP, and git. Each one is a realistic mock with error scenarios baked in and tests that actually assert something.

📦 Source: github.com/josephgec/mcptest

📦 PyPI: pypi.org/project/mcp-agent-test

If you're building an MCP agent and haven't started writing tests yet, you're accumulating the same three bugs I accumulated. Start with one fixture and one test case. Catch the first prompt-change regression. Then you'll understand why MCP agents need this.

If this was useful, a ⭐ on the repo helps others find it.

More...