The below instructions provide current details on how to read and write to Iceberg tables in a Jupyter® notebook. This beta workflow will be simplified in the future.
Spark 3.5 with Scala 2.12
and aws-bundle
JARs from Iceberg's releases page ↗. Create a new folder called /libs
, and upload the JARs into this folder.To begin, create a Spark session. Note that running this code will prompt you to enter a user token, which can be generated in your account settings. See User-generated tokens for a step-by-step guide on creating a token.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
from pyspark.sql import SparkSession from getpass import getpass spark = ( SparkSession.builder .master("local[*]") .appName("foundry") .config("spark.jars", "file:///home/user/repo/libs/iceberg-spark-runtime-3.5_2.12-1.9.1.jar,file:///home/user/repo/libs/iceberg-aws-bundle-1.9.1.jar") .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.foundry", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.foundry.type", "rest") .config("spark.sql.catalog.foundry.uri", "https://<your_foundry_url>/iceberg") .config("spark.sql.catalog.foundry.default-namespace", "foundry") .config("spark.sql.catalog.foundry.token", getpass("Foundry token:")) .config("spark.sql.defaultCatalog", "foundry") .getOrCreate() )
Iceberg's documentation ↗ provides more context on the above parameters which are used to establish connectivity to the Iceberg catalog. Remember to update the spark.jars
filepaths using the names of the JARs you uploaded in Step 2.
Of these, the Foundry-specific Iceberg catalog parameters ↗ are:
Parameter | Value | Description |
---|---|---|
spark.sql.catalog.foundry | org.apache.iceberg.spark.SparkCatalog | Catalog implementation class ↗. |
spark.sql.catalog.foundry.type | rest | Underlying catalog implementation type, i.e. REST |
spark.sql.catalog.foundry.uri | https://<your_foundry_url>/iceberg | URL for the REST catalog |
spark.sql.catalog.foundry.default-namespace | foundry | Default namespace for the catalog |
spark.sql.catalog.foundry.token | getpass("Foundry token:") | Prompts for token access credentials |
If your path contains a whitespace, you must ensure that the space is correctly escaped. With Spark, you can use backticks (`
) to escape whitespace, for example `/.../My folder/Iceberg table`
. With PyIceberg, you can use URL encoding.
Now you can use your Spark session to read and write from your Iceberg tables. For example, following the Iceberg documentation's quickstart guide ↗, you can create a table and insert rows.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
from pyspark.sql.types import DoubleType, FloatType, LongType, StructType,StructField, StringType schema = StructType([ StructField("vendor_id", LongType(), True), StructField("trip_id", LongType(), True), StructField("trip_distance", FloatType(), True), StructField("fare_amount", DoubleType(), True), StructField("store_and_fwd_flag", StringType(), True) ]) df = spark.createDataFrame([], schema) df.writeTo("`/.../taxis`").create() schema = spark.table("`/.../taxis`").schema data = [ (1, 1000371, 1.8, 15.32, "N"), (2, 1000372, 2.5, 22.15, "N"), (2, 1000373, 0.9, 9.01, "N"), (1, 1000374, 8.4, 42.13, "Y") ] df = spark.createDataFrame(data, schema) df.writeTo("`/.../taxis`").append()
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
spark.sql(""" CREATE TABLE `/.../taxis` ( vendor_id bigint, trip_id bigint, trip_distance float, fare_amount double, store_and_fwd_flag string ) PARTITIONED BY (vendor_id); """) spark.sql(""" INSERT INTO `/.../taxis` VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y'); """)
Jupyter®, JupyterLab®, and the Jupyter® logos are trademarks or registered trademarks of NumFOCUS.
All third-party trademarks (including logos and icons) referenced remain the property of their respective owners. No affiliation or endorsement is implied.