PM-201c · Module 2

The Golden Dataset

4 min read

The golden dataset is the non-negotiable foundation of prompt quality control. It is a set of input-output pairs that define what correct behavior looks like for a given prompt. Every time the prompt changes, it is run against the golden dataset and the outputs are compared to the expected results. If a new version passes the golden dataset, it is a candidate for production. If it fails, the change is not promoted. This is the minimum viable test infrastructure for a production prompt.

1. Collect representative samples Gather 15-30 input examples that cover the full range of inputs the prompt will encounter in production. Include typical cases, edge cases, and the specific inputs that have caused failures in the past. A golden dataset that only covers easy cases will not catch regressions on hard ones.
2. Define expected outputs For each input, define what a correct output looks like. This does not require an exact string match — it requires a specification: correct format, required fields present, tone compliance, length within range, no hallucinated content. The expected output is a specification, not a verbatim answer.
3. Define the approval threshold How many samples must pass for the prompt to be approved? 100%? 95%? Are some failure categories disqualifying even at low frequency (e.g., any hallucination)? Define the threshold and commit to it before running the test.
4. Maintain the dataset When new failure modes are discovered in production, add a corresponding sample to the golden dataset. The dataset should grow to represent everything the prompt has been tested against. A static golden dataset that never grows is a quality gate with known blind spots.