DS-301g · Module 2

Validation Methodology

3 min read

A model that performs well on training data and poorly on new data has memorized the past, not learned the patterns. Validation methodology prevents this. For time series: walk-forward validation. Train on months one through twelve, predict month thirteen. Train on months one through thirteen, predict month fourteen. Repeat. The model is always predicting data it has not seen. For cross-sectional data: k-fold cross-validation. Split the data into five folds. Train on four, test on one. Repeat five times. The average performance across folds is the expected real-world accuracy. For both: never use the test set to tune the model. The test set is sacred — it simulates real-world performance. Any contamination invalidates the estimate.