RC-401h · Module 2

Instrumentation: Knowing When a Prompt Is Degrading

5 min read

Prompt degradation is the condition where a prompt that worked correctly begins producing outputs of declining quality — subtly, gradually, without throwing an error. The causes are varied: model drift after a provider update, input distribution shift as your user base evolves, context accumulation that pushes effective instructions past the model's attention peak, or schema contract drift where the consuming system started interpreting outputs differently. The symptom is always the same: production outcomes deteriorate while all the machinery continues running normally.

You cannot detect prompt degradation by waiting for user complaints. You detect it through instrumentation — a telemetry layer that continuously measures prompt output quality against quantifiable criteria and alerts you when those measurements trend in the wrong direction. CLAWMANDER's daily reports framework applies this principle to agent operations. This lesson applies it to the prompt system layer.

// Prompt telemetry middleware — wraps every prompt invocation
// Records quality signals for trend analysis and degradation detection

export interface PromptTelemetryEvent {
  prompt_ref: string;           // e.g., "forge/prod/system/1.4.0"
  invocation_id: string;        // uuid
  timestamp: string;            // ISO 8601
  input_tokens: number;
  output_tokens: number;
  latency_ms: number;
  schema_valid: boolean;        // did output pass the schema contract?
  schema_errors: string[];      // field-level validation failures
  judge_score?: number;         // 0–100, LLM-as-judge on sampled calls
  user_feedback?: 'positive' | 'negative' | null;
  model_id: string;             // exact model version used
}

// Degradation alert thresholds — register per prompt in library metadata
export interface DegradationThresholds {
  schema_valid_rate_min: number;    // e.g., 0.95 — alert if drops below 95%
  judge_score_min: number;          // e.g., 75 — alert if rolling avg drops below
  latency_p95_max_ms: number;       // e.g., 8000 — alert if p95 latency exceeds
  evaluation_window_hours: number;  // e.g., 24 — rolling window for threshold checks
}

// Called after every prompt invocation in the middleware layer
export async function recordTelemetry(event: PromptTelemetryEvent): Promise<void> {
  await telemetryStore.insert(event);
  await checkDegradationThresholds(event.prompt_ref);
}

// Runs on a schedule and on every telemetry write
async function checkDegradationThresholds(promptRef: string): Promise<void> {
  const thresholds = await promptLibrary.getThresholds(promptRef);
  const window = await telemetryStore.getWindow(promptRef, thresholds.evaluation_window_hours);

  const schemaValidRate = window.filter(e => e.schema_valid).length / window.length;
  if (schemaValidRate < thresholds.schema_valid_rate_min) {
    await alertOps({ promptRef, metric: 'schema_valid_rate', value: schemaValidRate, threshold: thresholds.schema_valid_rate_min });
  }
  // ... repeat for judge_score and latency
}

Sampling strategy matters. You cannot run an LLM-as-judge evaluation on every production call — the cost is prohibitive and the latency impact is unacceptable. Instead, sample 5–10% of production calls for judge scoring, stratified by input type to ensure coverage across the full behavioral range of the prompt. The schema validation layer runs on 100% of calls — it is lightweight and the signal-to-noise ratio is high. The judge evaluation layer runs on the sample and feeds the rolling average metric.