DS-301g · Module 1

Building Baseline Models

3 min read

Every model improvement is measured against a baseline. The baseline is the simplest reasonable prediction method. For time series: the baseline is the naive forecast (next period equals this period) or the seasonal naive (next period equals the same period last year). For classification: the baseline is the majority class (predict the most common outcome for every observation). For regression: the baseline is the mean (predict the average for every observation). If your sophisticated model does not meaningfully outperform the baseline, the sophistication is not justified. In practice, 30-40% of "machine learning" models deployed in business do not significantly outperform the naive baseline. They add complexity without adding accuracy. The baseline is the test that prevents this waste.

Do This

Build the naive baseline first and measure its accuracy — this is the bar the model must clear
Calculate the improvement over baseline for every model iteration — if it is less than 5%, question the value
Report model accuracy alongside baseline accuracy — the delta is the model's actual contribution

Avoid This

Skip the baseline and report the model's accuracy in isolation — 85% accuracy means nothing without context
Deploy a complex model that marginally outperforms the baseline — the maintenance cost exceeds the accuracy gain
Assume more complexity equals more accuracy — test it against the baseline and let the data decide