Iceberg tables [Beta]

Beta

Iceberg table support is in the beta phase of development and may not be available on your enrollment. Functionality may change during active development. Contact Palantir Support to request enabling Iceberg tables.

Apache Iceberg ↗ is a widely adopted open-source table format and is available in Foundry via Iceberg tables as an alternative resource type for representing tabular data.

Foundry offers Iceberg both as managed tables and as virtual tables via an external catalog or storage provider.

What is Iceberg?

Apache Iceberg ↗ is an open table format that has gained significant traction in the data and analytics community due to benefits around scalability, performance, and broad ecosystem support. In particular, the wide adoption of the Iceberg format specification enables a broad array of integrations and interoperability across modern data ecosystems.

The Apache Iceberg project includes the Apache Iceberg table format specification ↗, as well as a set of engine connectors that support the Iceberg specification, such as Spark ↗, Flink ↗ and more.

Beyond the core Apache Iceberg project, there is a growing ecosystem of connectors and engines that support the Iceberg format:

Foundry Iceberg catalog

Apache Iceberg defines an Iceberg REST Catalog ↗ specification, which outlines a set of endpoints and behaviors that a service must implement to function as an Iceberg REST catalog.

By adhering to this specification, the service becomes compatible with a growing number of compute engines that support the Iceberg REST Catalog.

Foundry now natively implements this Iceberg REST Catalog specification, In addition, Foundry also supports connectivity to third-party Iceberg REST catalogs such as Databricks Unity Catalog ↗.

Iceberg tables vs. datasets

As Palantir expands coverage for Iceberg tables, the features and benefits available for Foundry Iceberg tables will grow over time and limitations will be removed.

Below summarizes the current state of support as compared to datasets, while Iceberg tables are in the Beta phase of development.

Benefits of Iceberg tables

The following benefits of Iceberg tables are currently available in Foundry:

  • Interoperability: Open Iceberg format means third-party tools can more easily read and write Palantir Iceberg tables.
  • Compaction: Support for automated compaction without affecting incremental reads.
  • Row edits: Support for DELETE, UPDATE and MERGE INTO statements, which allow you to conditionally modify rows without the need to re-snapshot.
  • Changelogs: Incrementally consume row deletions and updates.
  • Enriched table history: Enhanced history view surfacing Iceberg table metadata ↗.

The interface for viewing Iceberg table history can be seen in the following screenshot:

Iceberg table history

Notable differences between Iceberg tables and Foundry datasets

There are several notable differences worth calling out between the behavior of Iceberg tables and Foundry datasets today. Note that Palantir is exploring features to improve consistency on these dimensions and working towards parity across the two formats.

  • Default branches: Iceberg’s main branch is called main, whereas Foundry’s main branch is called master. In Foundry’s integration with Iceberg, main and master are treated as the same, which means a Foundry job running on master will write to an Iceberg’s table's main.
  • Schema evolution on branches: In Iceberg, branches do not have their own schemas. Instead, branches track the "current" schema and the current schema is shared across a table by all branches. This means that you cannot alter the schema on a branch without also changing the schema on main.
  • Automatic schema evolution: Iceberg is strict about the schema when writing to an existing Iceberg table. Any change in schema needs to be made explicitly via an ALTER TABLE command.
  • Partially completed jobs: Foundry dataset transactions are "all-or-nothing", meaning that if a job writes to an output dataset and then fails, any intermediate progress is discarded. This is not true for jobs on Iceberg tables, which are not "all-or-nothing"; Iceberg snapshots are independent from Foundry transactions and any partial update happening within an aborted job remains even if the job fails.

Foundry functionality not yet available for Iceberg tables

Note that certain Foundry features are not currently supported for Iceberg tables, including:

Iceberg terminology disambiguation

Iceberg introduces terms that do not always have direct equivalents in Foundry:

Iceberg termMeaningFoundry disambiguation
Table metadata ↗Metadata describing the structure, schema, and state of an Iceberg table.Foundry also tracks metadata for tables and datasets across all formats, but this metadata is managed internally by Foundry services. For Iceberg tables, Foundry exposes the native Iceberg metadata explicitly as a JSON file via the Iceberg catalog.
Snapshots ↗A record of the table's state at a specific point in time. Iceberg creates a new snapshot with every data modification.An Iceberg snapshot is roughly equivalent to a Foundry transaction. However, Foundry also uses the term "snapshot" differently: in Foundry, a "snapshot" refers to a transaction that fully replaces all data in a dataset, while an Iceberg snapshot captures all types of data operations, including incremental changes.
Append snapshotAn operation where only data files were added and no files were removed.Same as Foundry append transaction.
Replace snapshotAn operation where data and delete files were added and removed without changing table data (such as compaction, changing the data file format, or relocating data files).No direct Foundry equivalent
Overwrite snapshotAn operation where data and delete files were added and removed in a logical overwrite operation. Unlike replace, overwrites change the logical set of records in the table.Equivalent to a Foundry update transaction.
Delete snapshotsAn operation where data files were removed and their contents logically deleted and/or "delete files" were added to delete rows. In Iceberg, "delete files" are metadata files that store information about the rows deleted from a table.Logically similar to a Foundry delete transaction, but Iceberg can express row deletions via delete files, unlike Foundry delete transactions.

Storage options

You can choose your preferred storage configuration for Iceberg tables according to the following options:

StorageStorage typeIceberg status
Bring-your-own-bucket (AWS, Azure)ManagedBeta
Virtual tablesExternalBeta
Foundry managed storageManagedComing soon

Palantir support can provide assistance in setting up bring-your-own-bucket storage.

Using Iceberg in Foundry

Foundry managed Iceberg tables can be created in two ways: either within Foundry via Transforms, or outside Foundry via an external compute engine.

Accessing catalog metadata

Foundry exposes Iceberg catalog metadata explicitly as a JSON file via dataset application under the Details tab.

Catalog metadata in dataset application:

Iceberg metadata file

Python transforms

Iceberg tables can be used as inputs and outputs in Python transforms using the transforms.tables API, which can be imported in the transforms-tables package.

Example: Generate a simple Iceberg table

Copied!
1 2 3 4 5 6 7 8 9 from transforms.api import transform, TransformContext from transforms.tables import TableOutput, TableTransformOutput @transform( output=TableOutput("/.../Output") ) def compute(ctx: TransformContext, output: TableTransformOutput): df_custom = ctx.spark_session.createDataFrame([["Hello"], ["World"]], schema=["phrase"]) output.write_dataframe(df_custom)

Example: Iceberg table output, Iceberg table input

Copied!
1 2 3 4 5 6 7 8 9 from transforms.api import transform from transforms.tables import TableInput, TableOutput, TableTransformInput, TableTransformOutput @transform( source_table=TableInput("/.../Input"), output_table=TableOutput("/.../Output") ) def compute(source_table: TableTransformInput, output_table: TableTransformOutput): output_table.write_dataframe(source_table.dataframe())

Example: Iceberg table output, dataset input

Copied!
1 2 3 4 5 6 7 8 9 10 from transforms.api import transform, Input, TransformInput from transforms.tables import TableOutput, TableTransformOutput @transform( source_dataset=Input("/.../Input"), output_table=TableOutput("/.../Output") ) def compute(source_dataset: TransformInput, output_table: TableTransformOutput): output_table.write_dataframe(source_dataset.dataframe())

Example: Dataset output, Iceberg table input

Copied!
1 2 3 4 5 6 7 8 from transforms.api import transform, Output, TransformOutput from transforms.tables import TableInput, TableTransformInput @transform( source_table=TableInput("/.../Input"), output=Output("/.../Output") ) def compute(source_table: TableTransformInput, output: TransformOutput): output.write_dataframe(source_table.dataframe())

Jupyter® notebook in Code Workspaces

The below instructions provide current details on how to read and write to Iceberg tables in a Jupyter® notebook. This beta workflow will be simplified in the future.

Set up Code Workspaces to use Iceberg

  1. PySpark setup: Set up a code workspace to use PySpark following the instructions in FAQ: Can I use PySpark in Code Workspaces?.
  2. Upload Iceberg JARs: Download the Spark 3.5 with Scala 2.12 and aws-bundle JARs from Iceberg's releases page ↗. Create a new folder called /libs, and upload the JARs into this folder.
  3. Network policy: Import the network policy for your Iceberg storage bucket into your code workspace.

Example Jupyter® notebook code

To begin, create a Spark session. Note that running this code will prompt you to enter a user token, which can be generated in your account settings. See User-generated tokens for a step-by-step guide on creating a token.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from pyspark.sql import SparkSession from getpass import getpass spark = ( SparkSession.builder .master("local[*]") .appName("foundry") .config("spark.jars", "file:///home/user/repo/libs/iceberg-spark-runtime-3.5_2.12-1.9.1.jar,file:///home/user/repo/libs/iceberg-aws-bundle-1.9.1.jar") .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.foundry", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.foundry.type", "rest") .config("spark.sql.catalog.foundry.uri", "https://<your_foundry_url>/iceberg") .config("spark.sql.catalog.foundry.default-namespace", "foundry") .config("spark.sql.catalog.foundry.token", getpass("Foundry token:")) .config("spark.sql.defaultCatalog", "foundry") .getOrCreate() )

Iceberg's documentation ↗ provides more context on the above parameters which are used to establish connectivity to the Iceberg catalog. Remember to update the spark.jars filepaths using the names of the JARs you uploaded in Step 2.

Of these, the Foundry-specific Iceberg catalog parameters ↗ are:

ParameterValueDescription
spark.sql.catalog.foundryorg.apache.iceberg.spark.SparkCatalogCatalog implementation class ↗.
spark.sql.catalog.foundry.typerestUnderlying catalog implementation type, i.e. REST
spark.sql.catalog.foundry.urihttps://<your_foundry_url>/icebergURL for the REST catalog
spark.sql.catalog.foundry.default-namespacefoundryDefault namespace for the catalog
spark.sql.catalog.foundry.tokengetpass("Foundry token:")Prompts for token access credentials
Escaping whitespace

If your path contains a whitespace, you must ensure that the space is correctly escaped. With Spark, you can use backticks (`) to escape whitespace, for example `/.../My folder/Iceberg table`. With PyIceberg, you can use URL encoding.

Now you can use your Spark session to read and write from your Iceberg tables. For example, following the Iceberg documentation's quickstart guide ↗, you can create a table and insert rows.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 from pyspark.sql.types import DoubleType, FloatType, LongType, StructType,StructField, StringType schema = StructType([ StructField("vendor_id", LongType(), True), StructField("trip_id", LongType(), True), StructField("trip_distance", FloatType(), True), StructField("fare_amount", DoubleType(), True), StructField("store_and_fwd_flag", StringType(), True) ]) df = spark.createDataFrame([], schema) df.writeTo("`/.../taxis`").create() schema = spark.table("`/.../taxis`").schema data = [ (1, 1000371, 1.8, 15.32, "N"), (2, 1000372, 2.5, 22.15, "N"), (2, 1000373, 0.9, 9.01, "N"), (1, 1000374, 8.4, 42.13, "Y") ] df = spark.createDataFrame(data, schema) df.writeTo("`/.../taxis`").append()

Creating Iceberg tables externally

Iceberg's open table format allows you to read and write Foundry Iceberg tables using external engines.

For example, the below image uses PyIceberg ↗ to create a Foundry table from a Jupyter notebook running on your computer. You could perform the same exercise with any engine that supports Iceberg REST catalogs.

Jupyter notebook to create Iceberg table


Jupyter®, JupyterLab®, and the Jupyter® logos are trademarks or registered trademarks of NumFOCUS.

All third-party trademarks (including logos and icons) referenced remain the property of their respective owners. No affiliation or endorsement is implied.