Connect Foundry to Databricks to leverage a range of capabilities on top of data, compute, and models available within Databricks.
Capability | Status |
---|---|
Exploration | 🟢 Generally available |
Bulk import | 🟢 Generally available |
Incremental | 🟢 Generally available |
Virtual tables | 🟢 Generally available |
Compute pushdown | 🟢 Generally available |
External models | 🟢 Generally available |
The Databricks connector now offers enhanced functionality when using virtual tables to expose the features of Delta Lake and Apache Iceberg. Refer to the Virtual Tables section of this documentation for information and details on how to configure the connector to enable this functionality.
Learn more about setting up a connector in Foundry.
The following configuration options are available for the Databricks connector:
Option | Required? | Description |
---|---|---|
Hostname | Yes | The hostname of the Databricks workspace. |
HTTP Path | Yes | The Databricks compute resource’s HTTP Path value. |
Please refer to the official Databricks documentation ↗ for information on how to obtain these values.
You can authenticate with Databricks in the following ways:
Method | Description | Documentation |
---|---|---|
Basic authentication [Legacy] | Authenticate with a user account using username and password. Basic authentication is legacy and not recommended in production. | Basic authentication ↗ |
OAuth machine-to-machine | Authenticate as a service principal using OAuth. Create a service principal in Databricks and generate an OAuth secret to obtain a client ID and secret. | OAuth for service principals (OAuth M2M) ↗ |
Personal access token | Authenticate as a user or service principal using a personal access token. | Personal access tokens (PAT) ↗. |
Workload identity federation [Recommended] | Authenticate as a service principal using workload identity federation. Workload identity federation allows workloads running in Foundry to access Databricks APIs without the need for Databricks secrets. Create a service principal federation policy in Databricks and follow the displayed instructions to allow the source to securely authenticate as a service principal. | Databricks OAuth token federation ↗ Refer to our OIDC documentation for an overview of how OpenID Connect (OIDC) is supported in Foundry. |
If you are using a direct connection for connectivity between Databricks and Foundry, the appropriate egress policies must be added when setting up the source in the Data Connection application. If you are using an agent runtime, the server running the agent must have suitable network access.
The Databricks connector requires network access to the Hostname
provided in Configuration options on port 443. This grants access for Foundry to connect to the Databricks workspace and Unity Catalog REST APIs.
The Virtual Tables section of this documentation provides details on external access in Unity Catalog and the functionality it enables. External access requires network connectivity to a table's storage location (managed or external). Egress policies will need to be created for each storage location to benefit from the features enabled by external access.
Refer to the official Databricks documentation ↗ for more information on external access and how to determine the storage locations of tables.
Below we provide example egress policies that may need to be configured to ensure network connectivity to Databricks.
Type | URL | DNS | Port |
---|---|---|---|
Databricks workspace | https://adb-5555555555555555.19.azuredatabricks.net/ | adb-5555555555555555.19.azuredatabricks.net | 443 |
Azure storage location [1] | abfss://<container-name>@<account-name>.dfs.core.windows.net/<table-directory> | <account-name>.dfs.core.windows.net <account-name>.blob.core.windows.net | 443 |
Google Cloud Storage (GCS) storage location | gs://<bucket-path>/<table-directory> | storage.googleapis.com | 443 |
S3 storage location | s3://<bucket-path>/<table-directory> | <bucket-path>.s3.<region>.amazonaws.com | 443 |
[1] Be sure to include both blob.core.<endpoint>
and dfs.core.<endpoint>
domains when configuring access to Azure storage locations. endpoint
may vary depending on the Azure Cloud environment.
In a limited number of cases (depending on your Foundry and Databricks environments) it may be necessary to establish a connection via PrivateLink. This is typically the case where both Foundry and Databricks are hosted by the same CSP (for example, AWS-AWS or Azure-Azure.) If you believe this applies to your setup, contact your Palantir representative for additional guidance.
For egress policies that depend on an S3 bucket in the same region as your Foundry instance, ensure you have completed the additional configuration steps detailed in our Amazon S3 bucket policy documentation for the affected bucket(s).
You may additionally need to pass in a JDBC property to allow self-signed certificates.
How to identify if this property is needed:
AllowSelfSignedCerts=1
property.How to add the property allowing self-signed certificates:
AllowSelfSignedCerts
as the key and 1
as the value.When the AllowSelfSignedCerts
property is set to 1
, SSL verification is disabled. In this case, the connector does not verify the server certificate against the trust store, and does not verify if the server's host name matches the common name or subject alternative names in the server certificate.
This JDBC property and others are outlined in the Databricks driver documentation ↗. The JDBC properties outlined in this documentation are specific to the Databricks driver and will differ from other source types.
The server must provide the full certificate chain in order for SSL verification to work. The certificate chain for the Databricks server can be obtained by running the command openssl s_client -connect {hostname}:{port} -showcerts
. To verify the certificate chain, use the OpenSSL command line utility or any other available tool.
Virtual tables allow you to connect to data registered in Databricks Unity Catalog. This allows you to both read and write to tables in Databricks from Foundry as well as push down compute to Databricks from pipelines in Foundry. This section provides additional details around using virtual tables with Databricks. This section is not applicable when syncing to Foundry datasets.
The Databricks connector now offers enhanced functionality when using virtual tables to expose the features of Delta Lake and Apache Iceberg. This functionality requires external access to be enabled in Unity Catalog. When enabled, external access allows Foundry to access tables using the Unity REST API and Iceberg REST catalog, and read and write data in the underlying storage locations. Unity Catalog credential vending is used to ensure secure access to cloud object storage. In addition to enhanced functionality, this can also improve the performance of reads and writes against these tables.
Refer to the official Databricks documentation ↗ for more information on external access and how to determine the storage locations of tables. Refer to the Networking section of this documentation for details on enabling network access to storage locations.
If external access is not enabled, or if the format of the Unity Catalog object is not supported (for example, views or materialized views), connections to Databricks will be made using JDBC. JDBC is the same mechanism used for syncs. Refer to the official Databricks documentation ↗ for more information on JDBC connectivity to Databricks.
The table below highlights the virtual table capabilities that are supported for Databricks.
Capability | Status |
---|---|
Bulk registration | 🟡 Beta |
Automatic registration | 🟢 Generally available |
Table inputs | 🟢 Generally available: Code Repositories, Pipeline Builder |
Table outputs | 🟢 Generally available: Code Repositories, Pipeline Builder |
Incremental pipelines | 🟢 Generally available [2] |
Compute pushdown | 🟢 Generally available |
Consult the virtual tables documentation for details on the supported Foundry workflows where Databricks tables can be used as inputs or outputs. Functionality may vary depending on whether external access is enabled.
The following table provides a summary of the supported formats and workflows when external access is or is not enabled.
Unity Catalog object | External access required | Format | Table inputs | Table outputs |
---|---|---|---|---|
Managed table | Yes | Avro ↗, Delta ↗, Parquet ↗ | ✔️ | |
Managed table | Yes | Iceberg ↗ | ✔️ | ✔️ |
External table | Yes | Delta | ✔️ | ✔️ |
External table | Yes | Avro, Parquet | ✔️ | |
Managed table | No | Table ↗, View ↗, Materialized view | ✔️ | |
External table | No | Table, view, materialized view | ✔️ |
[2] To enable incremental support for Spark pipelines backed by Databricks virtual tables, external access must be enabled; incremental computation requires the ability to directly interact with Delta or Iceberg tables. Incremental compute on top of Delta tables relies on Change Data Feed ↗. Incremental compute on top of Iceberg tables relies on Incremental Reads ↗.
When using virtual tables, remember the following source configuration requirements:
HTTP path
field. Refer to the official Databricks documentation ↗ for more information on how to get connection details for a Databricks compute resource.See the Connection Details section above for more details.
Foundry offers the ability to push down compute to Databricks when using virtual tables. When using Databricks virtual tables registered to the same source as inputs and outputs to a pipeline, it is possible to fully federate compute to Databricks. This capability leverages Databricks Connect ↗ and is currently available in Python transforms. See the Python documentation for details on how to push down compute to Databricks.
Databricks models registered in Unity Catalog can be integrated to Foundry via:
Refer to the official Databricks documentation ↗ for more information on making models available in Unity Catalog, and to the guide on setting up Databricks external models in Foundry.