MP-301b · Module 2

End-to-End Test Flows

3 min read

End-to-end tests simulate realistic multi-tool conversations. Instead of testing each tool in isolation, you write a test that mirrors what an LLM would do: call tool A, use data from A's response to call tool B, verify the final state. This catches integration bugs that single-tool tests miss — data format mismatches between tools, missing cross-references in tool descriptions, and ordering dependencies that break when tools are called in unexpected sequences.

The most valuable e2e pattern is the "golden path" test: a complete user workflow exercised through tool calls. For a CRM server, the golden path might be: search_customers → get_customer → update_customer → get_customer (verify update persisted). For a knowledge base, it might be: index_document → search → get_document → delete_document → search (verify deletion). Each golden path test validates the entire tool ecosystem works together, not just that individual tools respond correctly.

import { describe, it, expect, beforeAll, afterAll } from "vitest";
import { createTestPair } from "../helpers/fixture-server.js";

describe("CRM workflow e2e", () => {
  let client: Awaited<ReturnType<typeof createTestPair>>["client"];
  let cleanup: () => Promise<void>;

  beforeAll(async () => {
    ({ client, cleanup } = await createTestPair());
  });
  afterAll(() => cleanup());

  it("search → get → update → verify flow", async () => {
    // Step 1: Search for a customer
    const searchResult = await client.callTool({
      name: "search_customers",
      arguments: { query: "Acme" },
    });
    expect(searchResult.isError).toBeFalsy();
    const searchData = JSON.parse(searchResult.content[0].text);
    expect(searchData.results.length).toBeGreaterThan(0);

    // Step 2: Get full details using ID from search
    const customerId = searchData.results[0].id;
    const getResult = await client.callTool({
      name: "get_customer",
      arguments: { customer_id: customerId },
    });
    expect(getResult.isError).toBeFalsy();
    const customer = JSON.parse(getResult.content[0].text);
    expect(customer.id).toBe(customerId);

    // Step 3: Update using the same ID
    const updateResult = await client.callTool({
      name: "update_customer",
      arguments: { customer_id: customerId, status: "vip" },
    });
    expect(updateResult.isError).toBeFalsy();

    // Step 4: Verify update persisted
    const verifyResult = await client.callTool({
      name: "get_customer",
      arguments: { customer_id: customerId },
    });
    const updated = JSON.parse(verifyResult.content[0].text);
    expect(updated.status).toBe("vip");
  });
});

Do This

  • Test multi-tool workflows that mirror real LLM conversation patterns
  • Use data from one tool's response as input to the next — this catches format mismatches
  • Verify state changes persist across tool calls within a session
  • Limit e2e tests to 3-5 critical golden paths to keep the suite maintainable

Avoid This

  • Duplicate unit test coverage in e2e tests — each layer tests different things
  • Hardcode expected values that depend on fixture data ordering
  • Skip cleanup between e2e tests — state leakage causes cascading failures
  • Write e2e tests for every possible tool combination — focus on real user flows