Technical primer: Iceberg changelogs in Python transforms

This guide explains how to use changelogs in Python transforms with Iceberg tables. The guide covers the two main changelog modes, their mechanics, and best practices for avoiding common pitfalls.

What is a changelog?

A changelog is a view of the changes (inserts, updates, and deletes) that have occurred in a table between two points in time (as represented by Iceberg snapshots ↗). Changelogs enable efficient incremental processing, allowing you to process only what has changed since the last run.

There are two primary ways to generate a changelog in Foundry:

  • Identifier changelog (recommended): One or more identifier columns provided; recommended for most use cases.
  • Net changes changelog: No identifier columns provided; use when input data does not have a reliable set of primary keys.

Identifier changelog (has identifier columns)

Use the "identifier changelog" mode when you have one or more columns in your input data that uniquely identify each row, together constituting a primary key. This mode is more performant than the "net changes changelog" mode and provides richer semantics, including update-before and update-after records.

Identifier changelog syntax

To generate an identifier changelog, run source.changelog and provide an array of primary keys:

Copied!
1 changelog_df = source.changelog(["id"])

Identifier changelog implementation details

Foundry's "identifier changelog" mode is implemented using the compute_updates option in Iceberg's create_changelog_view procedure.

With compute_updates:

  • Row insertions and deletions are represented as INSERT and DELETE changes.
  • Row updates are based on the identifier columns and are represented as a pair of UPDATE_BEFORE and UPDATE_AFTER rows, also known as "pre/post update images" ↗.

Foundry's API implements an additional step on top of Iceberg's create_changelog_view to get the latest update across a transaction range. For example, if a row has its value updated from 1 to 2 in one snapshot, and then from 2 to 3 in a subsequent snapshot, the resulting identifier changelog will only show an UPDATE_BEFORE of 2 and an UPDATE_AFTER of 3.

The Iceberg table specification supports identifier fields ↗ in metadata but does not enforce uniqueness. Because of that, Transforms changelogs do not currently rely on identifier fields in metadata.

Identifier changelog schema

The schema for the identifier changelog:

Column nameTypeDescription
All data columnsvariousAll columns from the source table
_change_typestringOne of INSERT, DELETE, UPDATE_BEFORE, UPDATE_AFTER
_change_ordinalintMonotonically increasing number for ordering changes

Identifier changelog change types

The change types for the identifier changelog:

Change typeDescription
UPDATE_BEFOREThe old version of the row for the identifier.
UPDATE_AFTERThe new version of the row for the identifier.
INSERTNew row for this identifier.
DELETERow removed for this identifier.

Identifier changelog example

An example identifier changelog:

idvalue_change_type_change_ordinal
id1val1UPDATE_BEFORE1
id1val2UPDATE_AFTER1
id2val2INSERT2

Net changes changelog (no identifier columns)

Use the "net changes changelog" mode when there is no set of identifier columns that uniquely identify entries in your input data. This approach provides more flexibility than the "identifier changelog" mode, but comes at the cost of performance.

Net changes changelog syntax

To generate a net changes changelog:

Copied!
1 changelog_df = source.changelog()

Hint: Use "net changes changelog" when you can't trust your primary keys or have duplicate rows in the source.

Net changes changelog implementation details

Foundry's "net changes changelog" mode is implemented using the net_changes option in Iceberg's create_changelog_view procedure.

In Iceberg, updating a row is implemented as a deletion and insertion of a row. There is no inherent notion of a row update in that nothing connects the deletion and insertion as semantically referring to the same row.

The "net changes changelog" is the simplest type of changelog in that it represents a row update as a pair of rows, one having as _change_type value DELETE and the other having _change_type value INSERT. (Deletions and insertions are represented by a single row with DELETE and INSERT respectively.)

The "net" in "net changes" refers to row updates (insertions, deletions, and updates) being combined across a range of snapshots. For example, if a row has its value updated from 1 to 2 in one snapshot, and 2 to 3 in a subsequent snapshot, then the "net changes changelog" will show a DELETE of 1 and an INSERT of 3. The 2 update is being skipped.

Net changes changelog schema

The schema for the net changes changelog:

Column nameTypeDescription
All data columnsvariousAll columns from the source table
_change_typestringINSERT or DELETE
_change_ordinalintMonotonically increasing number for ordering changes

Net changes changelog change types

The change types for the identifier changelog:

Change typeDescription
INSERTRow insertion
DELETERow deletion

Net changes changelog example

An example net changes changelog:

idvalue_change_type_change_ordinal
id1val1DELETE1
id1val2INSERT1

Example changelog evolution

This example demonstrates how an Iceberg table evolves across snapshots V0 to V3 and how changelogs are produced.

Example table evolution across snapshots

V0V1V2V3
[id1, val1][id1, val2][id1, val3][id1, val3]
[id2, val2][id1, val3]

Identifier changelog example evolution

The following tables show how the identifier changelog evolves across snapshots for the example Iceberg table.

Identifier changelog example: V0 -> V1

idvalue_change_type_change_ordinal
ID1VAL1UPDATE_BEFORE1
ID1VAL2UPDATE_AFTER1

Identifier changelog example: V1 -> V2

idvalue_change_type_change_ordinal
ID1VAL2UPDATE_BEFORE2
ID1VAL3UPDATE_AFTER2
ID2VAL2INSERT2

Identifier changelog example: V0 -> V2

idvalue_change_type_change_ordinal
ID1VAL1UPDATE_BEFORE1
ID1VAL3UPDATE_AFTER2
ID2VAL2INSERT2

Identifier changelog example: V0 -> V3 (failure case)

In this example, the identifier changelog from V0->V3 fails as there are duplicate rows for ID1 value of the primary key. This highlights the importance of unique identifier columns for the identifier changelog mode.

Net changes changelog example evolution

The following tables show how the net changes changelog evolves across snapshots for the example Iceberg table.

Net changes changelog example: V0 -> V1

idvalue_change_type_change_ordinal
ID1VAL1DELETE1
ID1VAL2INSERT1

Net changes changelog example: V1 -> V2

idvalue_change_type_change_ordinal
ID1VAL2DELETE2
ID1VAL3INSERT2
ID2VAL2INSERT2

Net changes changelog example: V0 -> V2

idvalue_change_type_change_ordinal
ID1VAL1DELETE1
ID1VAL2INSERT1
ID1VAL2DELETE2
ID1VAL3INSERT2
ID2VAL2INSERT2

Rows with strikethrough are deleted to build the net changes changelog:

idvalue_change_type_change_ordinal
ID1VAL1DELETE1
ID1VAL3INSERT2
ID2VAL2INSERT2

Net changes changelog example: V0 -> V3

idvalue_change_type_change_ordinal
ID1VAL1DELETE1
ID1VAL2INSERT1
ID1VAL2DELETE2
ID1VAL3INSERT2
ID2VAL2INSERT2
ID2VAL2DELETE3
ID1VAL3INSERT3

Net changes after strikethrough rows removed:

idvalue_change_type_change_ordinal
ID1VAL1DELETE1
ID1VAL3INSERT2
ID1VAL3INSERT3

Uniqueness requirements and common pitfalls

When working with changelogs, you must ensure uniqueness of identifier columns at two stages: the source table and the destination table.

Uniqueness in the source table (source.changelog([identifier_cols]))

When you call:

Copied!
1 changelog_df = source.changelog(["id"])

the API expects that the identifier columns (here, id) uniquely identify each row in your source table.

If there are duplicate rows for the same identifier in the source, changelog creation will fail:

ValueError: Duplicate rows found for identifier columns ['id'] in changelog view.

If you cannot guarantee uniqueness in your source, use the net changes changelog instead:

Copied!
1 changelog_df = source.changelog() # No identifier columns

Uniqueness in the destination table (output.apply_changelog(changelog_df, keys))

When you call:

Copied!
1 output.apply_changelog(changelog_df, ["id"])

the keys argument specifies the identifier columns for the destination table. These columns must match the identifier columns in your changelog dataframe. The destination table must also be uniquely keyed by these columns.

If there are multiple rows in the destination table that match the same key, or if the changelog dataframe contains duplicates for the key, apply_changelog will fail:

Exception: apply_changelog failed: duplicate keys detected in changelog dataframe for identifier columns ['id']

or

Exception: apply_changelog failed: multiple rows in destination table match key ['id']

Always ensure both your changelog dataframe and your destination table are deduplicated on the identifier columns before calling apply_changelog.

Deduplicate before applying changelog

If you want to use the identifier changelog for its performance and semantics, you must deduplicate your data before calling apply_changelog. For example:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from pyspark.sql import functions as F from pyspark.sql.window import Window # Group rows by identifiers and order by change ordinal such that the latest # row comes first (`_row_number == 1`). This allows us to filter a net changes # changelog to only keep the latest row per identifier. window = Window.partitionBy("id").orderBy(F.col("_change_ordinal").desc()) deduped_df = ( changelog_df .withColumn("_row_number", F.row_number().over(window)) .filter("_row_number = 1") .drop("_row_number") ) output.apply_changelog(deduped_df, ["id"])

If you skip this step and there are duplicates, apply_changelog will fail as shown above.

Summary and best practices

  • Prefer identifier changelog if your identifier columns are unique in the source.
  • Use net changes changelog if you have duplicates or cannot trust your primary key in the source.
  • Always deduplicate both the changelog dataframe and ensure the destination table is uniquely keyed before apply_changelog to avoid errors.
  • Understand the semantics and performance tradeoffs of each changelog mode.

For further reading, see the Iceberg documentation on changelog views ↗.