SA-301e · Module 3

Data Lake Architecture

3 min read

A data lake stores raw data in its native format at any scale. The promise: flexibility to analyze data in ways that were not anticipated when it was collected. The reality: without architecture, the data lake becomes a data swamp — petabytes of data that nobody can find, understand, or trust. The architecture that prevents the swamp is not the storage technology. It is the organization, cataloging, and governance that make the data discoverable and reliable.

Zone Architecture Organize the lake into zones: raw (data as ingested, immutable), cleaned (validated, deduplicated, schema-enforced), curated (transformed, business-logic-applied, consumption-ready). Each zone has different access patterns, quality guarantees, and governance requirements. Data progresses through zones as it is validated and transformed. Consumers access the curated zone. Data engineers access the cleaned zone. Only pipelines write to the raw zone.
Partitioning Strategy Partition data by the access pattern: date partitions for time-series queries, region partitions for geographic analysis, entity partitions for entity-specific lookups. Partitioning determines query performance — a query that scans one partition instead of the entire dataset is orders of magnitude faster. Choose the partition key based on the most common query pattern, not the most convenient ingestion format.
Format Selection Columnar formats (Parquet, ORC) for analytical queries — they read only the columns needed. Row formats (JSON, Avro) for streaming ingestion — they write efficiently. Delta, Iceberg, and Hudi add transactional capabilities to the lake: ACID transactions, schema evolution, and time travel. The format decision affects query performance, storage cost, and the operational model for data updates.