DS-301g · Module 1

Building Baseline Models

3 min read

Every model improvement is measured against a baseline. The baseline is the simplest reasonable prediction method. For time series: the baseline is the naive forecast (next period equals this period) or the seasonal naive (next period equals the same period last year). For classification: the baseline is the majority class (predict the most common outcome for every observation). For regression: the baseline is the mean (predict the average for every observation). If your sophisticated model does not meaningfully outperform the baseline, the sophistication is not justified. In practice, 30-40% of "machine learning" models deployed in business do not significantly outperform the naive baseline. They add complexity without adding accuracy. The baseline is the test that prevents this waste.

Do This

  • Build the naive baseline first and measure its accuracy — this is the bar the model must clear
  • Calculate the improvement over baseline for every model iteration — if it is less than 5%, question the value
  • Report model accuracy alongside baseline accuracy — the delta is the model's actual contribution

Avoid This

  • Skip the baseline and report the model's accuracy in isolation — 85% accuracy means nothing without context
  • Deploy a complex model that marginally outperforms the baseline — the maintenance cost exceeds the accuracy gain
  • Assume more complexity equals more accuracy — test it against the baseline and let the data decide