Iceberg changelogs and CDC pipelines

Foundry transforms make it easy to build robust, scalable Change Data Capture (CDC) pipelines leveraging Apache Iceberg’s changelog and snapshot features. You can use CDC in transforms to efficiently process new, updated, or deleted records since the last pipeline run, enabling efficient, incremental, low-latency data movement and processing.

In addition to existing support for append-only incremental transforms on datasets, Foundry now offers full CDC processing support for Iceberg tables as part of the transforms-tables library. This capability leverages Iceberg’s changelog views ↗ to retrieve inserts, updates, and deletes between Iceberg table snapshots.

Benefits of CDC processing

Using CDC with Iceberg tables offers a number of benefits including:

  • Efficient incremental processing: CDC avoids reprocessing the entire dataset on every run, improving performance and reducing costs.
  • Streaming and real-time pipelines: CDC enables low-latency data movement by processing only new and changed records.
  • Audit and slowly changing dimensions (SCD): CDC lets you track before/after changes for full audit trails or SCD Type 2 implementations.

Quick start: using changelog views in Python transforms

You can use the Palantir transforms API to read and write changelogs from Iceberg tables:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 from transforms.api import incremental, transform from transforms.tables import TableInput, TableOutput @incremental(v2_semantics=True) @transform( source=TableInput("<PATH>/your_iceberg_input_table"), output=TableOutput("<PATH>/your_iceberg_output_table"), ) def cdc_transform(ctx, source, output): # Read only the changes since the last run changelog_df = source.changelog(["your_primary_key"]) # Apply your business logic to the changelog output.apply_changelog(changelog_df, ["your_primary_key"])

For more detailed guides and examples, see the next sections with changelog code examples and a technical primer, including a walkthrough of an example with no primary keys in the input.