Part VII · The Data Plane
Chapter 90 ~23 min read

The lakehouse story: Parquet, Iceberg, Delta, Hudi

"The data lake was cheap and wrong. The data warehouse was right and expensive. The lakehouse is cheap, right, and slightly confusing"

For most of the 2010s, data stacks split cleanly into two layers: a data lake on object storage for cheap, raw, append-everything storage, and a data warehouse (Redshift, BigQuery, Snowflake) for expensive, structured, query-fast analytics. The split made sense because each system was good at what it did and bad at what the other did. The lake had no schema, no ACID guarantees, no indexing, no transactions, and no ability to do atomic operations — but it stored petabytes for pennies. The warehouse had strong schemas, fast queries, and ACID semantics — but you paid dollars per GB per month and moving data in was an ETL problem every team hated.

Then three open table formats landed in rough contemporaneity — Iceberg (Netflix, 2017), Delta Lake (Databricks, 2019), Hudi (Uber, 2017) — and they all solved the same problem: give the data lake ACID, schema evolution, time travel, and efficient updates without moving the data out of object storage. Combined with Parquet as the universal columnar format underneath, the result is the lakehouse: a system that looks like a warehouse at the query layer and a lake at the storage layer. You get the warehouse’s correctness with the lake’s cost profile. This chapter lays out how that trick works, what each format does differently, and why ML systems care.

Outline:

  1. Lake and warehouse: the old split.
  2. Parquet: the columnar foundation.
  3. The table-format problem.
  4. Apache Iceberg.
  5. Delta Lake.
  6. Apache Hudi.
  7. ACID on object storage.
  8. Time travel and schema evolution.
  9. Why lakehouses matter for ML.
  10. Picking one.
  11. The mental model.

90.1 Lake and warehouse: the old split

A data lake, classically, is a bunch of files in object storage. You dump CSV, JSON, Avro, ORC, Parquet, whatever, organized by date prefixes under a bucket. Readers discover files by listing the bucket. Writers append new files. Schemas are implicit — whatever the files happen to contain. There is no central authority, no transactions, no atomic commits. If two writers append to the same partition at the same time, either both writes succeed separately, or one silently overwrites the other, or you get a half-written file sitting in the prefix because the writer crashed mid-upload. Readers have to deal with all of this.

A data warehouse is a column-oriented database. Data is loaded via ETL, stored in the warehouse’s proprietary format, indexed and partitioned by the warehouse’s optimizer, queried via SQL. Snowflake, BigQuery, and Redshift are the three big commercial options. They are fast, consistent, and expensive. The cost model is roughly $20-40 per TB of storage per month plus compute for queries. For petabyte workloads, warehouses get expensive fast, and data ingestion is a chore (ETL jobs, schema migrations, historical backfills).

The traditional stack had both. The lake held everything raw and cheap. ETL jobs moved a curated subset into the warehouse. The warehouse served analytics. Data scientists used both: lake for training data and exploration, warehouse for dashboards and reporting. The split was operationally painful (everything existed twice, with reconciliation issues) and expensive (you paid warehouse rates for the curated subset).

The lakehouse vision is to collapse the two: keep everything on object storage, use Parquet as the universal format, and add a table format on top that gives you the warehouse-grade features. A query engine (Spark, Trino, DuckDB, Dremio, Databricks SQL, Snowflake’s Iceberg support, etc.) reads the Parquet files directly and respects the table format’s metadata. The data lives once, in one place, queried many ways.

Lakehouse layered architecture: object storage at the base holds Parquet files; the table format adds ACID metadata; a query engine sits on top; the old lake plus warehouse split collapses into one stack. Object Storage (S3 / GCS / ADLS) Parquet files — cheap, durable, columnar — $0.023/GB/mo Table Format (Iceberg / Delta / Hudi) ACID commits · time travel · schema evolution · partition metadata Query Engine (Spark · Trino · DuckDB · Databricks SQL · Snowflake) reads Parquet directly, respects table-format metadata for pruning + ACID Applications ML training · feature pipelines · BI dashboards · data science notebooks
The lakehouse collapses the old lake–warehouse split into three layers: cheap Parquet files on object storage, a table format that adds ACID semantics on top, and a query engine that reads both — one copy of data, warehouse-grade correctness.

90.2 Parquet: the columnar foundation

Apache Parquet is a columnar storage format for hierarchical data, originally from Twitter and Cloudera (2013), now the de-facto open standard for analytical data on object storage. Every modern table format uses Parquet (or occasionally ORC, which is conceptually similar) as its underlying file format.

The key ideas:

Columnar layout. Within a file, values are stored column by column, not row by row. A query that reads three columns out of a hundred only reads those three columns, not the entire file. For analytical queries (sums, averages, filters over wide tables), this is a 10-50× IO reduction compared to row-oriented formats.

Compression per column. Each column has its own compression codec. String columns compress well with dictionary + Snappy/ZSTD. Integer columns use bit-packing and RLE. Timestamps use delta encoding. The compression is chosen per column and per file, so heterogeneous data compresses efficiently.

Row groups. A Parquet file is divided into row groups (typically 128 MB to 1 GB each). Within a row group, each column is a column chunk. Column chunks contain pages (typically 1 MB each). The row group is the unit of parallelism for readers: a query can skip entire row groups that don’t contain relevant data.

Statistics and min/max. Each column chunk stores the min and max values. A query with a predicate (WHERE ts > '2026-04-01') can skip row groups whose max ts is less than ‘2026-04-01’. This is predicate pushdown, and it’s the mechanism that makes Parquet fast for filtering.

Dictionary encoding. For columns with low cardinality (categorical fields, enum-like strings), Parquet stores a dictionary of distinct values and then stores integer references. String columns with thousands of rows but only a dozen distinct values compress to almost nothing.

Nested types. Parquet supports structs, lists, and maps via a dremel-style encoding (repetition and definition levels). You can store JSON-like data natively without stringifying it.

The upshot: a well-written Parquet file compresses better than CSV by 10-20×, reads 5-50× faster for analytical queries, and is supported by every major query engine. It is the column format of the open-source data world.

Parquet columnar layout: within a row group, column chunks are stored contiguously; a query reading two columns out of ten skips the other eight entirely via ranged GETs. Row group 0 (128 MB) col 0: ts delta encoded ← read col 1: user_id skipped col 2: model skipped col 3: latency bit-packed ← read cols 4–9: skipped column chunk stats: min/max/null_count — reader uses these to skip entire row groups ranged GET for col 0 bytes + col 3 bytes only — 2/10 columns read = ~80% IO saved
Parquet's columnar layout means a query that reads 2 of 10 columns issues targeted ranged GETs for only those column chunks — a 5-10× IO reduction compared to a row-oriented format that would read every column.

The rough file layout a reader sees:

File magic: PAR1
Row group 0:
  Column 0 data + stats
  Column 1 data + stats
  ...
Row group 1:
  ...
File footer:
  Schema
  Per-row-group metadata with stats
  Key-value metadata
File magic: PAR1

Readers open the file, seek to the footer to read metadata, identify which row groups and columns they need, and issue ranged GETs for exactly those byte ranges. On S3, a well-optimized Parquet reader (e.g., DuckDB, Trino) does maybe a dozen GETs for a single-query read over a multi-GB file. This is why lakehouse queries on S3 can be fast.

90.3 The table-format problem

Parquet gives you efficient file-level storage, but a table is more than a single file. A real table is a set of files, potentially millions of them, possibly partitioned by date or tenant or other dimensions, with writes arriving continuously and readers expecting consistent snapshots. The problems you run into without a table format:

  • Concurrent writes collide. Two writers appending to the same partition don’t see each other. A reader might see one, both, or the half-baked state in between.
  • Schema evolution is unsafe. If one file has a column that another doesn’t, readers have to reconcile. Adding a column is easy; renaming, dropping, or changing a type is a nightmare.
  • Deletes are impossible in Parquet. Parquet files are immutable. To delete a row, you have to rewrite the entire file without it. Doing this without a coordination layer leaves the old file in place and readers see both versions.
  • No atomic commits. Writing a new version of a file is not atomic relative to the pointer that tells readers which file to use. Readers might see torn writes.
  • No snapshot isolation. A long-running query that reads files from a table may see files that were added or removed mid-query. The result is inconsistent.
  • No efficient metadata. Listing millions of files in a prefix is expensive on S3. For tables with tens of millions of Parquet files, a plain listing is too slow.

A table format is a layer on top of Parquet that solves these. It provides a metadata file (or a log of metadata files) that tells you the current set of Parquet files in the table, their schemas, their partitioning, and the history of changes. Writes produce new Parquet files and a new metadata commit. Readers read the metadata once, then know exactly which files to read. Because the metadata commit is a single atomic file swap (using conditional PUT on S3 or a catalog-backed lock), concurrent writes serialize through it.

Three table formats dominate: Iceberg, Delta Lake, and Hudi. All three do roughly the same thing with different design choices.

90.4 Apache Iceberg

Iceberg, originated at Netflix, is the most architecturally clean of the three. The design goal was to replace Hive-style partitioned tables (which rely on filesystem listings and have all the problems above) with a proper metadata format.

The Iceberg metadata hierarchy:

  • Table metadata (metadata.json): the current state of the table — schema, partition spec, current snapshot id, list of historical snapshots.
  • Snapshot: a point-in-time view of the table. Each snapshot points to a manifest list.
  • Manifest list: a list of manifests, each covering some subset of the table’s files. Manifest lists are usually Avro files.
  • Manifest: a file listing the Parquet data files in the table, along with file-level statistics (min/max, null counts, row counts). Also Avro.
  • Data files: the actual Parquet files.

When a reader wants to query the table, it:

  1. Reads metadata.json to find the current snapshot.
  2. Reads the snapshot’s manifest list to find the relevant manifests.
  3. Reads each manifest to get the list of data files and their stats.
  4. Filters data files based on query predicates (using the stats).
  5. Reads the surviving Parquet files.
Iceberg metadata hierarchy: catalog points to metadata.json, which references a snapshot, which has a manifest list pointing to manifests, which list the actual Parquet data files. Catalog atomic pointer metadata.json schema + snapshots Snapshot point-in-time manifest list Avro, small manifest 1 file list + stats manifest 2 file list + stats Parquet data files on S3
Iceberg's metadata hierarchy lets the reader prune using manifest stats before touching a single data file — for a table with 10 million Parquet files, a predicate query may read fewer than 100 of them.

This sounds like a lot of round trips, but manifests are small (typically a few hundred KB each) and the stats let the reader prune aggressively before touching any data files. For a table with 10 million Parquet files, a query can easily prune down to a few dozen files without reading any of them.

Writes produce new data files, new manifests (reusing manifests from the previous snapshot if they’re unchanged), and a new snapshot. The commit happens by writing a new metadata.json and atomically swapping the pointer. The swap uses a catalog — Iceberg tables are always associated with a catalog (Hive Metastore, AWS Glue, Nessie, JDBC, REST) that maintains the pointer to the current metadata file. The catalog provides the atomicity primitive: a conditional update of “the current metadata is X” to “the current metadata is Y, only if it’s still X.”

Iceberg’s killer features:

  • Hidden partitioning. Partitioning is defined in metadata, not in file paths. You can change the partition scheme without rewriting any data. Readers translate queries to partition predicates automatically.
  • Efficient schema evolution. Adding, dropping, renaming columns is a metadata-only change. No data rewrite.
  • Time travel. Every historical snapshot is accessible. You can query “the state of the table at snapshot N” or “at timestamp T.” Useful for debugging, reproducibility, and rollback.
  • Row-level deletes (v2 spec). Deletes can be recorded as “delete files” that mark specific rows or positions as removed, without rewriting the data files.
  • Format-independence. The spec doesn’t mandate Parquet; ORC and Avro data files work too.

Iceberg is the format most likely to win the open lakehouse war in 2026. It has the broadest engine support (Spark, Trino, Flink, Snowflake, DuckDB, Dremio, ClickHouse, Athena), the cleanest design, and the most active community outside a single company.

90.5 Delta Lake

Delta Lake, from Databricks, is the most widely deployed table format in the Databricks ecosystem. It is simpler in design than Iceberg and in some ways more pragmatic.

The Delta metadata hierarchy is a single transaction log stored in the _delta_log/ directory inside the table’s root:

my_table/
  _delta_log/
    00000000000000000000.json
    00000000000000000001.json
    00000000000000000002.json
    ...
  part-00000-....snappy.parquet
  part-00001-....snappy.parquet
  ...

Each numbered JSON file is a transaction — a list of actions, such as add (a Parquet file was added), remove (a file was removed), metadata (schema changed), protocol (format version changed). Readers reconstruct the current state by replaying all transactions in order, starting from the last checkpoint (a periodic snapshot of the full state, written every 10 transactions by default).

Writes work by:

  1. Writing new Parquet files to the table directory.
  2. Writing a new transaction file with a sequence number one higher than the current max.
  3. The “commit” is the successful creation of that new transaction file. On S3, this uses a conditional PUT or a DynamoDB-based coordination to prevent two writers from creating the same sequence number.

Delta is simpler than Iceberg — there’s no manifest hierarchy, just the transaction log — which makes it easier to reason about but less efficient for very large tables. For tables with tens of millions of files, replaying the log (even with checkpoints) can become slow, whereas Iceberg’s manifest structure scales better.

Delta’s killer features:

  • ACID transactions across multiple file operations.
  • Time travel via transaction replay.
  • MERGE INTO for efficient upserts — the canonical SCD Type 2 and change-data-capture sink pattern.
  • VACUUM for garbage-collecting old files after time travel window expires.
  • Z-ordering for multi-dimensional clustering of data (improves predicate pushdown on multiple columns at once).
  • Column mapping (in recent versions) for true rename support.

Delta’s major tie-in is to Databricks. Most non-Databricks engines can now read Delta tables — Delta OSS is Apache 2.0 — and Delta-rs, a Rust implementation, gives you non-Spark readers and writers. But the deepest integration is still in Databricks products.

90.6 Apache Hudi

Hudi (Hadoop Upserts Deletes and Incrementals), from Uber, was the first of the three and was designed specifically for the upsert-heavy workloads Uber had: ride data streaming in, updates to existing rides, deletes for GDPR. Hudi’s design is the most different from the other two.

Hudi supports two table types:

  • Copy-on-write (CoW): updates rewrite the Parquet files containing the changed rows. Reads are fast (always reading the latest Parquet files), writes are slower. Similar in principle to Iceberg and Delta.
  • Merge-on-read (MoR): updates are written as small delta files (often Avro). Reads merge the delta files with the base Parquet files on the fly. Writes are fast, reads are slower. A background compaction process periodically merges deltas into the base.

MoR tables are Hudi’s differentiator. For workloads with high update rates, the compaction tradeoff is worth it: your writers aren’t blocked on full-file rewrites. For analytical workloads that are mostly read, CoW is usually better.

Hudi also has native incremental query support — readers can ask for “everything that changed since timestamp T” and get just the delta, which is how downstream systems pull changes efficiently.

Hudi is less widely adopted outside specific use cases, and the format’s complexity (two table types, many tunables) makes it harder to pick up. It remains a strong option for update-heavy pipelines where Delta and Iceberg struggle.

90.7 ACID on object storage

The fundamental trick of all three formats: how do you get atomic commits on S3, which has no multi-object transactions?

The answer is that you don’t need multi-object transactions. You need one atomic operation: swapping the pointer to the current state. Everything else is additive — new data files are written before the commit, and they become visible only when the pointer swap succeeds.

The pointer swap is implemented differently:

  • Iceberg uses a catalog (Hive Metastore, Glue, Nessie, REST) as the source of truth for “the current metadata file.” The catalog provides a conditional-update API: “set the current metadata from X to Y if it’s still X.” Most catalogs implement this with a transactional database.
  • Delta uses a filesystem-level commit — the new transaction file’s sequence number is the commit point, and conflicts are resolved with conditional writes. On S3, Delta uses a DynamoDB coordination layer to prevent two writers from creating the same sequence number (newer S3 conditional writes can replace this).
  • Hudi uses a timeline of instant files and a locking scheme (optimistic concurrency control).

In all three, the commit is serializable — two concurrent writers either serialize through the catalog or detect a conflict and retry. Readers see a consistent snapshot as of the moment they read the current metadata. Isolation is snapshot isolation, the standard ACID guarantee for OLAP systems.

The 2024 S3 conditional write feature makes the DynamoDB dance unnecessary for Delta on S3, simplifying the architecture.

90.8 Time travel and schema evolution

Time travel means querying the state of the table as of a past snapshot or timestamp. All three formats support it: each commit creates a new snapshot, and old snapshots remain accessible until they’re expired by a retention policy.

-- Iceberg
SELECT * FROM my_table VERSION AS OF 1234567890;
SELECT * FROM my_table TIMESTAMP AS OF '2026-04-01 00:00:00';

-- Delta
SELECT * FROM my_table VERSION AS OF 42;
SELECT * FROM my_table TIMESTAMP AS OF '2026-04-01';

Time travel costs little extra storage if the change rate is low — old snapshots reference the same Parquet files. It costs a lot if the change rate is high and a retention policy isn’t in place, because every rewritten file stays in the table directory until garbage-collected.

For ML systems, time travel is genuinely useful:

  • Reproduce a training run. The training code was run on the table as of snapshot N. You can reconstruct the exact training data by querying VERSION AS OF N.
  • Roll back a bad commit. A bug wrote corrupt data. Revert to the previous snapshot.
  • Debug inference quality regressions. The feature table looked different a week ago. Query “what was in this table when the model was trained?”

Schema evolution is how you change the table’s schema over time without breaking readers or rewriting data. The three formats support similar operations:

  • Add a column: always safe. New data has the column, old data has NULL. Metadata-only change.
  • Drop a column: metadata-only in Iceberg. Delta and Hudi may require rewriting files depending on version.
  • Rename a column: Iceberg supports it natively via column IDs. Delta supports it in newer versions with column mapping enabled. Hudi’s support varies.
  • Change a column’s type: supported only for “safe” promotions (int → bigint, float → double). Other changes require a new column and a migration.

The column-ID approach (Iceberg’s default, Delta’s column mapping) is key: internally, columns are referenced by stable numeric IDs, not by name. Renaming a column in the metadata doesn’t touch the data files — the IDs stay the same, only the name-to-ID map changes. This is how Iceberg handles rename as a metadata-only operation.

90.9 Why lakehouses matter for ML

ML systems have three lakehouse-relevant problems:

1. Training data versioning. An ML team needs to know, for a given model, exactly what training data it was trained on. Without a table format, the answer is some combination of “the files in this S3 prefix at some time” and “a custom manifest we wrote.” With a table format, the answer is a snapshot id, reproducible indefinitely. Time travel gives you literal reproducibility: training on VERSION AS OF N is the same operation a week later or a year later.

2. Feature consistency between training and serving. The feature pipeline writes features to a lakehouse table. The training pipeline reads from the same table. The online feature store (Chapter 91) syncs from the same table. Without ACID, readers see inconsistent snapshots of the table and training data might not match what’s in production. With a table format, everyone reads from a consistent snapshot.

3. Efficient updates for changing data. Features change. User profiles change. Labels get corrected. Without a table format, updating a row means rewriting a partition and orchestrating a reader cutover. With Delta, Iceberg, or Hudi, updates are a MERGE INTO statement or an upsert operation, transactionally applied.

In addition, the operational story is better:

  • One storage layer for raw events, curated features, model-ready datasets, evaluation sets, and inference logs. One bucket, one format, one query engine.
  • Incremental processing is natural. Readers can query “what’s new since the last time I read” and get only the delta.
  • Cost is object storage cost. Storing 100 TB of training data in Iceberg on S3 is $2300/month, not $20000/month in a warehouse.

The downside is that lakehouses are newer than warehouses, the tooling is less mature, and ML teams often need to build their own feature-store-on-lakehouse infrastructure. Managed lakehouses (Databricks, Snowflake’s Iceberg, Dremio, Tabular) are filling the gap but add cost.

For an ML platform at any real scale, the lakehouse is the right storage layer for training data, labeled datasets, feature tables, and offline evaluation results. Iceberg is the default choice in the open-source ecosystem as of 2026.

90.10 Picking one

A short decision tree:

  • If you are on Databricks, use Delta. It’s the native format and the tooling is best. You can still read it from non-Databricks engines via Delta OSS.
  • If you want the broadest open-source support and you’re not tied to a single vendor, use Iceberg. It has the most engine support, the cleanest spec, and the most active non-single-vendor community.
  • If your workload is extremely update-heavy and you need merge-on-read semantics for write performance, use Hudi. This is a smaller category than it used to be; Iceberg and Delta have added upsert support.
  • If you’re on Snowflake, you can now use Snowflake’s managed Iceberg tables, which give you Snowflake’s query performance over Iceberg-formatted data on your own S3.
  • If you’re on AWS, Athena, Glue, EMR, and Redshift all speak Iceberg. Delta support is newer but present.

Do not pick based on current feature parity — the formats leapfrog each other yearly. Pick based on the ecosystem you’re already in and the specific features your workloads need (merge-on-read, hidden partitioning, Z-ordering).

90.11 The mental model

Eight points to take into Chapter 91:

  1. The lakehouse collapses lake and warehouse. Object storage for cheap capacity, table formats for warehouse-grade correctness.
  2. Parquet is the columnar foundation. Every table format uses it. Column pruning + predicate pushdown + compression = analytical speed.
  3. A table format adds atomic metadata on top of Parquet. The key trick is atomic pointer swapping of the current metadata file.
  4. Iceberg has a metadata → manifest list → manifest → data file hierarchy. Designed for large tables. Hidden partitioning, schema evolution by column ID.
  5. Delta has a transaction log in _delta_log/ with periodic checkpoints. Simpler, Databricks-native, broadly compatible.
  6. Hudi has CoW and MoR table types. MoR uses delta files merged on read for fast writes.
  7. Time travel and schema evolution are standard features. All three support them, with slightly different degrees of flexibility.
  8. For ML, lakehouses give you reproducible training data, consistent feature pipelines, and efficient updates, at object-storage prices.

In Chapter 91, the focus moves from the bulk data layer to the specialized ML-serving layer: feature stores.


Read it yourself

  • Ryan Blue’s Netflix OSS talk Iceberg: A Modern Table Format for Big Data and the Iceberg spec on iceberg.apache.org.
  • Michael Armbrust et al., Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores (VLDB 2020). The Delta paper.
  • The Apache Hudi paper Hudi: Uber Engineering’s Incremental Processing Framework on Hadoop.
  • Jacques Nadeau et al., Apache Parquet: the Parquet specification and format guide.
  • The Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics CIDR 2021 paper by the Databricks team.
  • Bauplan, Dremio, and Tabular blog posts on Iceberg internals — a surprising amount of depth is written up in those.

Practice

  1. A Parquet file has 10 row groups of 100 MB each. A query filters on ts > X where X is chosen such that only the last 2 row groups could match. How much data does the reader actually fetch?
  2. Walk through the steps of an Iceberg writer committing a new file: what does it write, in what order, and what’s the atomicity primitive?
  3. Why does Iceberg use a column-ID-based schema instead of column names? Construct a scenario where this matters.
  4. A Delta table has 100,000 transaction files and no checkpoints. A reader opens it. What’s the cost? Why do checkpoints exist?
  5. Compare copy-on-write and merge-on-read in Hudi. Construct a workload where each wins and quantify why.
  6. For an ML training pipeline that reads a features table and a labels table and joins them, how does Iceberg snapshot isolation help you reproduce the training run a month later?
  7. Stretch: Using DuckDB or PySpark, create an Iceberg table of 1 million rows, run a query that uses predicate pushdown, then evolve the schema by adding a column, then time-travel-query the old snapshot. Compare query latency before and after the schema evolution.