PM-301i · Module 2

What to Monitor

4 min read

Production prompt monitoring has two categories: operational metrics (is the system working?) and quality metrics (is the prompt performing?). Most teams monitor operational metrics adequately. Most teams have no quality metrics at all. The result: they know when the API is down, but not when the prompt is quietly degrading.

Operational metrics: latency (p50, p95, p99 response times), token usage (input + output tokens per request), error rate (API errors, timeouts, malformed responses), and fallback rate (requests where the fallback path was triggered instead of the primary prompt). These are the table stakes. If you are not collecting these, start there.

Quality metrics require more design. Format compliance rate: what percentage of outputs match the required format? This is automatable — parse the output and check against the schema. A decline in format compliance is an early warning signal that the model behavior has shifted or an upstream data change has broken the input preprocessing. User correction rate: how often do users edit, regenerate, or override the prompt output? This is the strongest signal of quality degradation because it measures actual user response to output quality, not a proxy metric. Downstream error rate: what percentage of outputs cause errors in the downstream system that consumes them? A prompt producing valid JSON that fails downstream schema validation is a quality failure even if format compliance is 100%.

Latency (p50/p95/p99) Track at the prompt level, not just the endpoint level. Different prompts have very different latency profiles based on output length. Baseline each prompt independently and alert on deviation from its own baseline.
Token Usage Monitor average input and output tokens per request. Sharp increases in output token count indicate the prompt may be generating verbose or unintended content. Sharp decreases indicate truncation or refusal. Both are quality signals.
Format Compliance Rate Parse every output and validate against the expected schema. Track the daily compliance rate. This is your first-line quality signal — it is automatable, real-time, and sensitive to model behavior changes.
User Correction Rate Instrument user actions: regenerate, edit, reject. Track the rate over time. Rising correction rates indicate declining quality before it triggers escalations or complaints.
Downstream Error Rate Track errors in downstream systems that trace to prompt outputs. A downstream parse failure is a prompt quality failure, even if the output looks syntactically valid to the prompt monitoring system.