DS-301g · Module 1
The Model Selection Framework
3 min read
The model does not choose itself. The data characteristics, the prediction horizon, the explainability requirement, and the accuracy threshold collectively determine which model fits. Time series with strong seasonality: SARIMA or Prophet. Time series with multiple external drivers: regression with lagged variables or gradient-boosted models. Classification problems: logistic regression for explainability, random forest or XGBoost for accuracy. The framework: first, identify the prediction type (numeric value, category, or probability). Second, assess the data characteristics (sample size, feature count, seasonality, stationarity). Third, apply the explainability filter (does the audience need to understand why the model predicts what it predicts?). The intersection of these three determines the shortlist of appropriate models.
- Identify the Prediction Type What is the model predicting? A continuous value (revenue next quarter), a category (will this deal close?), or a probability (what is the likelihood of churn?). The type determines the model family.
- Assess Data Characteristics How much data is available? Fewer than two hundred observations: use simple models (linear regression, ARIMA). More than one thousand: complex models become viable (ensemble methods, neural networks). Seasonality, trend, and stationarity each constrain model selection.
- Apply the Explainability Filter Who will consume the predictions? If the audience is a CFO, the model must be explainable — "revenue is driven by these five factors with these weights." If the audience is an automated system, accuracy beats explainability. Match the model to the audience.