DR-301f · Module 1

Selection Bias in Source Material

3 min read

Selection bias is the most dangerous bias because it is invisible in the data you have. It lives in the data you do not have. A customer satisfaction survey with an 85% score is meaningless if only satisfied customers responded. A competitive analysis based on the five largest competitors misses the startup that will disrupt the market. A hiring pattern analysis based on LinkedIn data misses the companies that post primarily on their own career sites. Selection bias distorts conclusions not by corrupting the data you see, but by ensuring the data you see is an unrepresentative sample of reality.

Ask: Who Is Missing? For every dataset, ask what population it claims to represent and whether the sample actually covers that population. A "comprehensive market survey" of 200 companies may have been drawn from a single industry conference's attendee list — comprehensive within that event, severely biased against companies that do not attend conferences.
Ask: What Is Not Measured? For every metric, ask what the measurement methodology excludes. Revenue growth measured by quarterly reporting excludes companies that do not file quarterly. Market share measured by web traffic excludes companies whose business operates offline. The exclusions define the blind spots.
Triangulate to Compensate Use multiple data sources with different selection biases to approximate the true population. If Source A over-represents large companies and Source B over-represents startups, using both partially compensates for each other's selection bias. Perfect compensation is impossible — but awareness of the remaining gaps enables honest confidence calibration.