SA-301e · Module 1

Data Quality Gates

3 min read

Bad data in a pipeline propagates to every downstream consumer. The dashboard shows wrong numbers. The ML model trains on corrupted inputs. The executive makes a decision based on a metric that was computed from null values that should have been rejected. Data quality gates are validation steps embedded in the pipeline that catch data problems before they propagate — the circuit breakers of data architecture.

  1. Schema Validation Validate that incoming data matches the expected schema — correct types, required fields present, values within expected ranges. Schema validation catches structural problems at the extraction boundary before transformation logic encounters unexpected formats. A schema validation failure halts the pipeline and alerts the team — it is cheaper to investigate a paused pipeline than to remediate corrupted downstream data.
  2. Volume Anomaly Detection Monitor the row count at each pipeline stage. A source that normally produces 50,000 records and suddenly produces 500 indicates a source failure. A source that normally produces 50,000 and produces 500,000 indicates a duplication or a changed extraction scope. Volume anomalies are the earliest signal that something is wrong with the data.
  3. Cross-Source Reconciliation When data from multiple sources is combined, reconcile the totals. The sum of orders from the order pipeline should match the sum of orders from the financial pipeline. Discrepancies indicate data loss, duplication, or timing differences between extraction windows. Reconciliation is the data quality gate that catches problems no single-source validation can detect.