SA-301e · Module 1
Pipeline Orchestration
3 min read
A data pipeline is a directed acyclic graph of tasks — extraction, transformation, validation, and loading steps with dependencies between them. Orchestration is the system that manages execution order, handles failures, retries individual steps, and provides visibility into pipeline health. Without orchestration, the pipeline is a collection of scripts run by cron jobs with failure notification via "someone noticed the dashboard is stale."
- DAG Design Model the pipeline as a DAG where each node is an idempotent task and each edge is a dependency. Task B depends on Task A means Task B does not run until Task A completes successfully. The DAG makes dependencies explicit and prevents the circular dependencies that create pipeline deadlocks. Visualize the DAG — the diagram is the operational documentation.
- Failure Handling Every task must define three behaviors: retry policy (how many times and with what backoff), failure action (skip downstream tasks, fail the entire pipeline, or alert and continue), and recovery mechanism (rerun from the failed task, not from the beginning). A pipeline that restarts from the beginning on any failure wastes hours of successful processing.
- Observability Track three metrics per pipeline: execution duration (is the pipeline slowing down?), task success rate (which tasks fail most frequently?), and data freshness (how old is the data in the target?). Alert on duration anomalies — a pipeline that normally runs in 20 minutes and takes 90 is signaling a problem even if it eventually succeeds.