PM-301d · Module 2

When Self-Critique Backfires

4 min read

Self-critique improves quality on many task types but actively degrades it on others. The failure mode is confident-wrong-answer amplification: when the model is wrong and confident, self-critique reinforces the wrong answer rather than correcting it. The model critiques from the same wrong frame that produced the error, finds no fault, and outputs the error with higher confidence.

High-Risk Task Types for Self-Critique Factual claims outside the model's training data. Mathematical reasoning (the model re-computes using the same flawed method). Complex logical deduction (the model re-applies the same flawed inference). In these cases, self-critique does not improve accuracy — it confirms errors.
Detection: Measure Pre- and Post-Critique Accuracy Test self-critique on your task type empirically. Compare accuracy before critique and after critique on 50+ examples. If post-critique accuracy is not meaningfully higher (>3%), self-critique is not helping and may be hurting. Measure — do not assume.
Alternatives When Self-Critique Fails Ensemble approaches: generate multiple independent responses and compare. External validation: route to a separate prompt or system that validates claims against a known-good source. Human-in-the-loop checkpoints for high-stakes claims. None of these are as cheap as self-critique, but they work when self-critique does not.
Scope Self-Critique to Format, Not Fact On tasks where factual self-critique fails, scope the critique to format and structural dimensions only: completeness, format compliance, scope adherence. These are low-hallucination critique dimensions. Reserve factual critique for tasks where the model has reliable knowledge, or for tasks where you can ground the critique against provided documents.