DS-201c · Module 2

Model Validation

3 min read

A model that performs well on the data it was trained on is not a good model. It is a model that memorized the answers. The only validation that matters is performance on data the model has never seen.

I have rejected more models for overfitting than for any other reason. The pattern is always the same: 95% accuracy on training data, 68% on new data. The team is excited about 95%. I am looking at 68%. Because 68% is the number that matters in production.

MODEL VALIDATION FRAMEWORK
===========================

STEP 1: TRAIN/TEST SPLIT
  Reserve 20-30% of data as a test set.
  NEVER touch the test set during model development.
  The test set is your final exam. You take it once.

STEP 2: CROSS-VALIDATION (during development)
  Split training data into 5 folds.
  Train on 4 folds, validate on 1. Rotate 5 times.
  Average the 5 validation scores. This is your
  development accuracy estimate.

STEP 3: TEMPORAL VALIDATION (for time series)
  Train on months 1-12. Test on months 13-15.
  NEVER shuffle time series data randomly.
  The model must predict the future from the past,
  not the past from the future.

STEP 4: FINAL TEST
  Run the model on the held-out test set. Once.
  This is your production accuracy estimate.
  If it drops > 5% from cross-validation, you overfit.
  Go back to Step 2 and simplify the model.

STEP 5: PRODUCTION MONITORING
  Track prediction accuracy weekly in production.
  If accuracy degrades > 5% from test performance,
  the model needs retraining. Data drift is real.

METRICS TO REPORT:
  Accuracy:   Overall correctness (misleading if imbalanced)
  Precision:  Of predicted positives, how many are correct
  Recall:     Of actual positives, how many did we catch
  AUC-ROC:    Overall discrimination ability (best summary)
  Calibration: Does 70% predicted probability = 70% actual?

Calibration is the metric most teams ignore and the one I care about most. A well-calibrated model means: when it says "70% probability of closing," roughly 70 out of 100 such deals actually close. A poorly calibrated model might say "70% probability" but only 45 out of 100 close. The predictions look confident. They are wrong.

CLOSER relies on calibrated predictions for pipeline forecasting. If my model says a deal has 85% probability and that number is calibrated, he can plan around it. If it is uncalibrated, the forecast is meaningless. Calibration is what makes prediction actionable.

Do This

Validate on held-out data the model never saw during training — this is your real accuracy
Use temporal validation for any time-based prediction — train on the past, test on the future
Track calibration alongside accuracy — a model that says 70% should be right 70% of the time

Avoid This

Report training accuracy as model performance — that measures memorization, not prediction
Shuffle time series data randomly — the model will learn from the future to predict the past
Deploy a model without production monitoring — all models degrade over time as data drifts