RC-401h · Module 1

Regression Testing Prompts Like Code

5 min read

If you would not ship a code change without running your test suite, you should not ship a prompt change without running your prompt regression suite. The engineering standard is identical — the artifact is different. Prompts are probabilistic, which means your regression suite is not looking for bit-perfect output equality. It is looking for behavioral consistency: does the new version of this prompt produce outputs that are structurally valid, semantically correct, and within acceptable variation bounds compared to the baseline?

A prompt regression suite has three layers. The first layer is schema validation — every test case must produce output that passes the schema contract. The second layer is golden-set comparison — a curated set of inputs with expected outputs, evaluated by either exact match (for deterministic fields) or LLM-as-judge (for semantic fields). The third layer is behavioral boundary testing — deliberately adversarial inputs designed to probe the prompt's guard rails: jailbreak attempts, malformed inputs, edge cases the prompt designer did not anticipate.

// Prompt regression test structure — run on every prompt version bump
// Framework: Vitest. LLM-as-judge uses a lightweight evaluation model.

import { describe, it, expect } from 'vitest';
import { resolvePrompt } from '../lib/promptLibrary';
import { validateHandoff } from '../schemas/handoff-payload.schema';
import { llmJudge } from '../lib/llmJudge';

const PROMPT_REF = 'forge/staging/handoff/1.5.0';

describe(`Regression: ${PROMPT_REF}`, () => {

  // Layer 1: Schema validation on all golden-set inputs
  it.each(goldenSet)('produces valid schema output — case: %s', async ({ input }) => {
    const output = await callPrompt(PROMPT_REF, input);
    expect(() => validateHandoff(output)).not.toThrow();
  });

  // Layer 2: Semantic correctness on priority golden cases
  it('correctly sets priority=high for urgent escalation input', async () => {
    const output = await callPrompt(PROMPT_REF, urgentInput);
    const parsed = validateHandoff(output);
    expect(parsed.priority).toBe('high');
    expect(parsed.escalation_reason).toBeTruthy();
  });

  // Layer 2: LLM-as-judge for summary quality
  it('summary is coherent and within token budget', async () => {
    const output = await callPrompt(PROMPT_REF, standardInput);
    const parsed = validateHandoff(output);
    const judgment = await llmJudge({
      criterion: 'Is this summary coherent, accurate, and under 500 characters?',
      content: parsed.payload.summary,
    });
    expect(judgment.pass).toBe(true);
  });

  // Layer 3: Boundary — malformed input should not crash, should return low-priority default
  it('handles missing task_type gracefully', async () => {
    const output = await callPrompt(PROMPT_REF, malformedInput);
    const parsed = validateHandoff(output);
    expect(parsed.priority).toBe('low');
  });
});

Build the Golden Set Collect 20–50 real production inputs that cover the full behavioral range of the prompt. Label the expected output for deterministic fields. For semantic fields, have a human reviewer rate the baseline output as acceptable. This set is your regression anchor — any new version must match or exceed baseline on every case.
Define the Pass Threshold Set a numeric pass threshold: e.g., 95% of golden-set cases must pass schema validation, 90% must match deterministic fields exactly, and 85% must pass LLM-as-judge evaluation on semantic fields. These thresholds are recorded in the prompt library metadata. Promotion from staging to production requires meeting all three thresholds.
Automate on Every Version Bump Wire the regression suite into your CI pipeline. Any prompt version bump triggers a full suite run. The result — pass/fail/threshold scores — is attached to the library registration record. Promotion is blocked until the suite passes. No exceptions. The eleven most dangerous words in prompt engineering are "we will test it more thoroughly after we ship it."