Iceberg table support is in the beta phase of development and may not be available on your enrollment. Functionality may change during active development. Contact Palantir Support to request enabling Iceberg tables.
Apache Iceberg ↗ is a widely adopted open-source table format and is available in Foundry via Iceberg tables as an alternative resource type for representing tabular data.
Foundry offers Iceberg both as managed tables and as virtual tables via an external catalog or storage provider.
Apache Iceberg ↗ is an open table format that has gained significant traction in the data and analytics community due to benefits around scalability, performance, and broad ecosystem support. In particular, the wide adoption of the Iceberg format specification enables a broad array of integrations and interoperability across modern data ecosystems.
The Apache Iceberg project includes the Apache Iceberg table format specification ↗, as well as a set of engine connectors that support the Iceberg specification, such as Spark ↗, Flink ↗ and more.
Beyond the core Apache Iceberg project, there is a growing ecosystem of connectors and engines that support the Iceberg format:
Apache Iceberg defines an Iceberg REST Catalog ↗ specification, which outlines a set of endpoints and behaviors that a service must implement to function as an Iceberg REST catalog.
By adhering to this specification, the service becomes compatible with a growing number of compute engines that support the Iceberg REST Catalog.
Foundry now natively implements this Iceberg REST Catalog specification, In addition, Foundry also supports connectivity to third-party Iceberg REST catalogs such as Databricks Unity Catalog ↗.
As Palantir expands coverage for Iceberg tables, the features and benefits available for Foundry Iceberg tables will grow over time and limitations will be removed.
Below summarizes the current state of support as compared to datasets, while Iceberg tables are in the Beta phase of development.
The following benefits of Iceberg tables are currently available in Foundry:
DELETE
, UPDATE
and MERGE INTO
statements, which allow you to conditionally modify rows without the need to re-snapshot.The interface for viewing Iceberg table history can be seen in the following screenshot:
There are several notable differences worth calling out between the behavior of Iceberg tables and Foundry datasets today. Note that Palantir is exploring features to improve consistency on these dimensions and working towards parity across the two formats.
main
, whereas Foundry’s main branch is called master
. In Foundry’s integration with Iceberg, main
and master
are treated as the same, which means a Foundry job running on master
will write to an Iceberg’s table's main
.main
.ALTER TABLE
command.Note that certain Foundry features are not currently supported for Iceberg tables, including:
Iceberg introduces terms that do not always have direct equivalents in Foundry:
Iceberg term | Meaning | Foundry disambiguation |
---|---|---|
Table metadata ↗ | Metadata describing the structure, schema, and state of an Iceberg table. | Foundry also tracks metadata for tables and datasets across all formats, but this metadata is managed internally by Foundry services. For Iceberg tables, Foundry exposes the native Iceberg metadata explicitly as a JSON file via the Iceberg catalog. |
Snapshots ↗ | A record of the table's state at a specific point in time. Iceberg creates a new snapshot with every data modification. | An Iceberg snapshot is roughly equivalent to a Foundry transaction. However, Foundry also uses the term "snapshot" differently: in Foundry, a "snapshot" refers to a transaction that fully replaces all data in a dataset, while an Iceberg snapshot captures all types of data operations, including incremental changes. |
Append snapshot | An operation where only data files were added and no files were removed. | Same as Foundry append transaction. |
Replace snapshot | An operation where data and delete files were added and removed without changing table data (such as compaction, changing the data file format, or relocating data files). | No direct Foundry equivalent |
Overwrite snapshot | An operation where data and delete files were added and removed in a logical overwrite operation. Unlike replace, overwrites change the logical set of records in the table. | Equivalent to a Foundry update transaction. |
Delete snapshots | An operation where data files were removed and their contents logically deleted and/or "delete files" were added to delete rows. In Iceberg, "delete files" are metadata files that store information about the rows deleted from a table. | Logically similar to a Foundry delete transaction, but Iceberg can express row deletions via delete files, unlike Foundry delete transactions. |
You can choose your preferred storage configuration for Iceberg tables according to the following options:
Storage | Storage type | Iceberg status |
---|---|---|
Bring-your-own-bucket (AWS, Azure) | Managed | Beta |
Virtual tables | External | Beta |
Foundry managed storage | Managed | Coming soon |
Palantir support can provide assistance in setting up bring-your-own-bucket storage.
Foundry managed Iceberg tables can be created in two ways: either within Foundry via Transforms, or outside Foundry via an external compute engine.
Foundry exposes Iceberg catalog metadata explicitly as a JSON file via dataset application under the Details tab.
Catalog metadata in dataset application:
Iceberg tables can be used as inputs and outputs in Python transforms using the transforms.tables
API, which can be imported in the transforms-tables
package.
Copied!1 2 3 4 5 6 7 8 9
from transforms.api import transform, TransformContext from transforms.tables import TableOutput, TableTransformOutput @transform( output=TableOutput("/.../Output") ) def compute(ctx: TransformContext, output: TableTransformOutput): df_custom = ctx.spark_session.createDataFrame([["Hello"], ["World"]], schema=["phrase"]) output.write_dataframe(df_custom)
Copied!1 2 3 4 5 6 7 8 9
from transforms.api import transform from transforms.tables import TableInput, TableOutput, TableTransformInput, TableTransformOutput @transform( source_table=TableInput("/.../Input"), output_table=TableOutput("/.../Output") ) def compute(source_table: TableTransformInput, output_table: TableTransformOutput): output_table.write_dataframe(source_table.dataframe())
Copied!1 2 3 4 5 6 7 8 9 10
from transforms.api import transform, Input, TransformInput from transforms.tables import TableOutput, TableTransformOutput @transform( source_dataset=Input("/.../Input"), output_table=TableOutput("/.../Output") ) def compute(source_dataset: TransformInput, output_table: TableTransformOutput): output_table.write_dataframe(source_dataset.dataframe())
Copied!1 2 3 4 5 6 7 8
from transforms.api import transform, Output, TransformOutput from transforms.tables import TableInput, TableTransformInput @transform( source_table=TableInput("/.../Input"), output=Output("/.../Output") ) def compute(source_table: TableTransformInput, output: TransformOutput): output.write_dataframe(source_table.dataframe())
The below instructions provide current details on how to read and write to Iceberg tables in a Jupyter® notebook. This beta workflow will be simplified in the future.
Spark 3.5 with Scala 2.12
and aws-bundle
JARs from Iceberg's releases page ↗. Create a new folder called /libs
, and upload the JARs into this folder.To begin, create a Spark session. Note that running this code will prompt you to enter a user token, which can be generated in your account settings. See User-generated tokens for a step-by-step guide on creating a token.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
from pyspark.sql import SparkSession from getpass import getpass spark = ( SparkSession.builder .master("local[*]") .appName("foundry") .config("spark.jars", "file:///home/user/repo/libs/iceberg-spark-runtime-3.5_2.12-1.9.1.jar,file:///home/user/repo/libs/iceberg-aws-bundle-1.9.1.jar") .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.foundry", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.foundry.type", "rest") .config("spark.sql.catalog.foundry.uri", "https://<your_foundry_url>/iceberg") .config("spark.sql.catalog.foundry.default-namespace", "foundry") .config("spark.sql.catalog.foundry.token", getpass("Foundry token:")) .config("spark.sql.defaultCatalog", "foundry") .getOrCreate() )
Iceberg's documentation ↗ provides more context on the above parameters which are used to establish connectivity to the Iceberg catalog. Remember to update the spark.jars
filepaths using the names of the JARs you uploaded in Step 2.
Of these, the Foundry-specific Iceberg catalog parameters ↗ are:
Parameter | Value | Description |
---|---|---|
spark.sql.catalog.foundry | org.apache.iceberg.spark.SparkCatalog | Catalog implementation class ↗. |
spark.sql.catalog.foundry.type | rest | Underlying catalog implementation type, i.e. REST |
spark.sql.catalog.foundry.uri | https://<your_foundry_url>/iceberg | URL for the REST catalog |
spark.sql.catalog.foundry.default-namespace | foundry | Default namespace for the catalog |
spark.sql.catalog.foundry.token | getpass("Foundry token:") | Prompts for token access credentials |
If your path contains a whitespace, you must ensure that the space is correctly escaped. With Spark, you can use backticks (`
) to escape whitespace, for example `/.../My folder/Iceberg table`
. With PyIceberg, you can use URL encoding.
Now you can use your Spark session to read and write from your Iceberg tables. For example, following the Iceberg documentation's quickstart guide ↗, you can create a table and insert rows.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
from pyspark.sql.types import DoubleType, FloatType, LongType, StructType,StructField, StringType schema = StructType([ StructField("vendor_id", LongType(), True), StructField("trip_id", LongType(), True), StructField("trip_distance", FloatType(), True), StructField("fare_amount", DoubleType(), True), StructField("store_and_fwd_flag", StringType(), True) ]) df = spark.createDataFrame([], schema) df.writeTo("`/.../taxis`").create() schema = spark.table("`/.../taxis`").schema data = [ (1, 1000371, 1.8, 15.32, "N"), (2, 1000372, 2.5, 22.15, "N"), (2, 1000373, 0.9, 9.01, "N"), (1, 1000374, 8.4, 42.13, "Y") ] df = spark.createDataFrame(data, schema) df.writeTo("`/.../taxis`").append()
Iceberg's open table format allows you to read and write Foundry Iceberg tables using external engines.
For example, the below image uses PyIceberg ↗ to create a Foundry table from a Jupyter notebook running on your computer. You could perform the same exercise with any engine that supports Iceberg REST catalogs.
Jupyter®, JupyterLab®, and the Jupyter® logos are trademarks or registered trademarks of NumFOCUS.
All third-party trademarks (including logos and icons) referenced remain the property of their respective owners. No affiliation or endorsement is implied.