CX-301c · Module 3

Data Quality for Prediction

3 min read

A predictive model built on dirty data produces confident wrong answers — which are worse than no answers at all because they create false certainty. Data quality management for predictive health models means ensuring that every input to the model is accurate, current, and consistently measured. One bad data source contaminates the entire prediction. The CSM who enters interaction notes inconsistently produces a sentiment signal that oscillates randomly. The product team that changes their usage metric definition mid-quarter produces an adoption signal discontinuity. Data quality is the foundation. Everything else is decoration without it.

Define Data Quality Standards For each model input, define: what it measures, how it is collected, how often it is updated, and what constitutes a valid entry. Response velocity must be measured in hours from send to reply, collected from email timestamps, updated weekly, and only counts personalized communications. The standard prevents the garbage-in problem that undermines every prediction model.
Monitor Data Completeness Track the percentage of accounts with complete model inputs. An account missing three of seven inputs produces a prediction based on half the data — which is half a prediction. Set a completeness threshold: accounts below 80% data completeness should be flagged for manual assessment rather than trusted to the model.
Audit Data Integrity Quarterly Quarterly, sample 10% of accounts and verify that the model inputs match reality. Is the recorded response velocity consistent with actual email timestamps? Is the stakeholder count accurate? Does the adoption metric reflect real usage? Discrepancies reveal collection process failures that must be fixed to maintain model accuracy.