DR-301a · Module 1
Data Normalization
3 min read
Raw collected data is unusable for analysis. Source A reports revenue in millions, Source B in thousands. Source A uses "Acme Corp," Source B uses "Acme Corporation," Source C uses the stock ticker "ACM." Source A timestamps data in UTC, Source B in Eastern Time, Source C in epoch milliseconds. Before any analysis can happen, all data must speak the same language. Normalization is the translation layer that makes cross-source analysis possible.
Data normalization has three core operations. Schema mapping: converting each source's data format into your internal schema. Every source has different field names, different data types, different nesting structures — the schema mapper translates all of them into a common structure. Entity resolution: recognizing that "Acme Corp," "Acme Corporation," and "ACM" all refer to the same entity. This requires a master entity registry with aliases, identifiers, and disambiguation rules. Temporal alignment: converting all timestamps to a single timezone and format so that data from different sources can be compared chronologically.
Do This
- Define your internal schema before building collection adapters — the schema drives the adapter design
- Maintain a master entity registry with all known aliases, ticker symbols, and identifiers
- Normalize timestamps to UTC immediately on ingestion — never store local times
- Normalize numeric values to consistent units — always millions or always thousands, never mixed
Avoid This
- Store raw data in its original format and normalize at query time — this is slow and error-prone at scale
- Assume entity names are consistent across sources — they almost never are
- Treat normalization as a one-time ETL job — source formats change, and your normalization must adapt
- Skip deduplication — the same news story from three RSS feeds creates three records unless you deduplicate