How I Test My MCP Agent Without Burning Tokens

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    How I Test My MCP Agent Without Burning Tokens

    Last month I shipped an MCP agent that triages GitHub issues. It works great — until it silently breaks and nobody notices.


    Here are the last three bugs I hit:

    1. I tweaked the system prompt. The agent stopped calling create_issue and just summarised the bug report in plain text. CI didn't catch it — CI tests the code, not the agent behavior.
    2. I swapped Sonnet for Haiku to save cost. The agent started calling list_issues four times before each create_issue. Integration tests still passed. Token bill tripled.
    3. GitHub rate-limited me mid-test. The entire pytest suite went red. I rolled back a perfectly good change because I couldn't tell flake from regression.


    Every one of those would have been caught by a tool that tests the agent's trajectory — which tools it picks, in what order, with what arguments — against a fast, hermetic mock.


    That tool is mcptest. Here's how I use it.





    The Scenario

    I have an agent that reads a bug report and decides whether to:
    • Open a new issue if the bug is novel
    • Comment on an existing issue if a similar one is already filed
    • Do nothing if the report is spam or unclear


    The agent is about thirty lines of Python wrapping an LLM call and the MCP client:






    # agent.py
    import asyncio, json, os, sys
    from mcp import ClientSession
    from mcp.client.stdio import StdioServerParameters, stdio_client

    SYSTEM_PROMPT = """You are an issue triage assistant for acme/api.
    Before filing a new issue, ALWAYS check list_issues for duplicates.
    If a duplicate exists, call add_comment instead."""

    async def main():
    fx = json.loads(os.environ["MCPTEST_FIXTURES"])[0]
    params = StdioServerParameters(
    command=sys.executable,
    args=["-m", "mcptest.mock_server", fx],
    env=os.environ.copy(),
    )
    user = sys.stdin.read()
    async with stdio_client(params) as (r, w):
    async with ClientSession(r, w) as session:
    await session.initialize()
    tools = await session.list_tools()
    # Your real agent calls an LLM here with SYSTEM_PROMPT + tools
    # and executes whatever tool_calls it returns
    plan = your_llm_plan(SYSTEM_PROMPT, user, tools)
    for call in plan:
    await session.call_tool(call.name, arguments=call.args)

    asyncio.run(main())







    That's it. The rest of this post is about testing what this agent does — which tools it picks, in what order — without ever calling a real LLM.


    I want to verify four things:

    1. For a clear bug report, it calls create_issue exactly once.
    2. It checks list_issues before create_issue (duplicate check).
    3. When the server rate-limits it, it recovers gracefully.
    4. It never calls delete_issue. Ever.





    Step 1 — Mock the GitHub MCP Server in YAML

    No code required. Declare the tools, canned responses, and error scenarios:






    # fixtures/github.yaml
    server:
    name: mock-github
    version: "1.0"

    tools:
    - name: list_issues
    input_schema:
    type: object
    properties:
    repo: { type: string }
    query: { type: string }
    required: [repo]
    responses:
    - match: { query: "login 500" }
    return:
    issues: [] # No duplicate → should open a new one
    - match: { query: "dark mode" }
    return:
    issues: [{ number: 12, title: "Add dark mode" }]
    - default: true
    return: { issues: [] }

    - name: create_issue
    input_schema:
    type: object
    properties:
    repo: { type: string }
    title: { type: string }
    body: { type: string }
    required: [repo, title]
    responses:
    - match: { repo: "acme/api" }
    return: { number: 42, url: "https://github.com/acme/api/issues/42" }
    - default: true
    error: rate_limited # Simulate real-world flake

    - name: add_comment
    input_schema:
    type: object
    properties:
    repo: { type: string }
    number: { type: integer }
    body: { type: string }
    required: [repo, number, body]
    responses:
    - default: true
    return: { ok: true }

    - name: delete_issue
    responses:
    - default: true
    return: { deleted: true }

    errors:
    - name: rate_limited
    error_code: -32000
    message: "GitHub API rate limit exceeded"







    That's a real MCP server. It speaks MCP over stdio, just like the real GitHub server. Your agent connects to it the same way.





    Step 2 — Write the Tests





    # tests/test_triage.yaml
    name: triage agent
    fixtures:
    - ../fixtures/github.yaml
    agent:
    command: python agent.py
    cases:

    - name: opens a new issue for a novel bug
    input: "login page returns 500 on Safari"
    assertions:
    - tool_called: create_issue
    - tool_call_count: { tool: create_issue, count: 1 }
    - param_matches:
    tool: create_issue
    param: repo
    value: "acme/api"
    - no_errors: true

    - name: checks duplicates before creating
    input: "login page returns 500 on Safari"
    assertions:
    - tool_order:
    - list_issues
    - create_issue

    - name: comments on an existing duplicate instead of creating
    input: "add dark mode support"
    assertions:
    - tool_called: add_comment
    - tool_not_called: create_issue

    - name: never deletes anything
    input: "spam: buy crypto now!!!"
    assertions:
    - tool_not_called: delete_issue

    - name: recovers from rate-limit gracefully
    input: "file bug for org: wrongorg/wrongrepo"
    assertions:
    - tool_called: create_issue
    - error_handled: "rate limit"







    Five test cases. Every one maps directly to a bug I've actually hit in production.





    Step 3 — Run It





    $ mcptest run

    mcptest results
    ┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ━━━━┳━━━━━━━━┓
    ┃ Suite ┃ Case ┃ Status ┃
    ┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ━━━━╇━━━━━━━━┩
    │ triage agent │ opens a new issue for a novel bug │ PASS │
    │ triage agent │ checks duplicates before creating │ PASS │
    │ triage agent │ comments on existing duplicate │ PASS │
    │ triage agent │ never deletes anything │ PASS │
    │ triage agent │ recovers from rate-limit gracefully│ PASS │
    └────────────────┴──────────────────────────────── ────┴────────┘

    5 passed, 0 failed (5 total)
    ⏱ 1.6s







    1.6 seconds. Zero tokens. No GitHub API calls. No rate-limit flake.





    Step 4 — The Regression-Diff Trick

    This is where the tool earns its keep.


    First, snapshot the current (known-good) agent trajectories as baselines:






    $ mcptest snapshot
    ✓ saved baseline for triage agent:pens a new issue... (2 tool calls)
    ✓ saved baseline for triage agent::checks duplicates... (2 tool calls)
    ✓ saved baseline for triage agent::comments on duplicate (2 tool calls)
    ...







    Now go tweak the system prompt. Something innocuous — change this:


    "You are a GitHub issue triage assistant. Check for duplicates before filing."


    To this:


    "You are a helpful assistant that handles GitHub issue reports."


    No [ERROR] in the code. All unit tests still pass. Linter is happy.






    $ mcptest diff --ci

    ✗ triage agent::checks duplicates before creating
    tool_order REGRESSION:
    baseline: list_issues → create_issue
    current: create_issue ← list_issues was dropped!

    ✗ triage agent::comments on existing duplicate
    tool_called REGRESSION:
    baseline: add_comment was called
    current: add_comment was never called

    2 regression(s) across 5 case(s)
    Exit: 1







    Exit code 1 — CI blocks the merge. The agent silently lost its duplicate-check behavior because of a one-sentence prompt change.


    This is the bug that cost me a weekend. I never want to hit it again.





    Step 5 — Wire It into CI





    # .github/workflows/agent-tests.yml
    name: Agent tests
    on: [pull_request]

    jobs:
    mcptest:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-python@v5
    with: { python-version: "3.11" }

    - run: pip install mcp-agent-test

    - name: Run tests
    run: mcptest run

    - name: Diff against baselines
    run: mcptest diff --ci

    - name: Post PR summary
    if: always()
    run: mcptest github-comment
    env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}







    Now every PR that changes the prompt, the model, or the agent code gets its trajectories diffed against main. If behavior shifts, a comment lands on the PR with the exact tool-order delta. Reviewers see agent behavior changed as clearly as they see code changed.





    Why MCP-Specific Matters

    Eval tooling is consolidating fast — independent evaluation startups keep getting folded into model-provider platforms. That's useful if you're all-in on one vendor. It's a lock-in risk if you're not.


    mcptest is independent, MIT-licensed, and specifically shaped for MCP agents. The tool_called / tool_order / error_handled primitives exist because that's what an MCP trajectory actually looks like — not because someone ported a generic LLM-eval DSL.


    MCP agent testing is particularly underserved. DeepEval is great for prompt evaluation. Inspect AI is great for benchmarks. Neither gives you "run your agent against a mock GitHub server and assert it didn't call delete_issue."


    mcptest does.





    Try It





    pip install mcp-agent-test

    # Scaffold a new project
    mcptest init

    # Or clone the quickstart
    git clone https://github.com/josephgec/mcptest
    cd mcptest/examples/quickstart
    mcptest run

    # Or install a pre-built fixture pack
    mcptest install-pack github ./my-project
    mcptest install-pack slack ./my-project
    mcptest install-pack filesystem ./my-project







    Six packs ship out of the box: GitHub, Slack, filesystem, database, HTTP, and git. Each one is a realistic mock with error scenarios baked in and tests that actually assert something.


    📦 Source: github.com/josephgec/mcptest

    📦 PyPI: pypi.org/project/mcp-agent-test





    If you're building an MCP agent and haven't started writing tests yet, you're accumulating the same three bugs I accumulated. Start with one fixture and one test case. Catch the first prompt-change regression. Then you'll understand why MCP agents need this.


    If this was useful, a ⭐ on the repo helps others find it.




    More...
Working...