Databricks

Connect Foundry to Databricks to leverage a range of capabilities on top of data, compute, and models available within Databricks.

Supported capabilities

CapabilityStatus
Exploration🟢 Generally available
Bulk import🟢 Generally available
Incremental🟢 Generally available
Virtual tables🟢 Generally available
Compute pushdown🟢 Generally available
External models🟢 Generally available

The Databricks connector now offers enhanced functionality when using virtual tables to expose the features of Delta Lake and Apache Iceberg. Refer to the Virtual Tables section of this documentation for information and details on how to configure the connector to enable this functionality.

Setup

  1. Open the Data Connection application and select + New Source in the upper right corner of the screen.
  2. Select Databricks from the available connector types.
  3. Choose to use a direct connection over the Internet or to connect through an intermediary agent.
  4. Follow the additional configuration prompts to continue the setup of your connector using the information in the sections below.

Learn more about setting up a connector in Foundry.

Connection details

The following configuration options are available for the Databricks connector:

OptionRequired?Description
HostnameYesThe hostname of the Databricks workspace.
HTTP PathYesThe Databricks compute resource’s HTTP Path value.

Please refer to the official Databricks documentation ↗ for information on how to obtain these values.

Authentication

You can authenticate with Databricks in the following ways:

MethodDescriptionDocumentation
Basic authentication [Legacy]Authenticate with a user account using username and password. Basic authentication is legacy and not recommended in production.Basic authentication ↗
OAuth machine-to-machineAuthenticate as a service principal using OAuth. Create a service principal in Databricks and generate an OAuth secret to obtain a client ID and secret.OAuth for service principals (OAuth M2M) ↗
Personal access tokenAuthenticate as a user or service principal using a personal access token.Personal access tokens (PAT) ↗.
Workload identity federation [Recommended]Authenticate as a service principal using workload identity federation. Workload identity federation allows workloads running in Foundry to access Databricks APIs without the need for Databricks secrets. Create a service principal federation policy in Databricks and follow the displayed instructions to allow the source to securely authenticate as a service principal.Databricks OAuth token federation ↗

Refer to our OIDC documentation for an overview of how OpenID Connect (OIDC) is supported in Foundry.

Networking

If you are using a direct connection for connectivity between Databricks and Foundry, the appropriate egress policies must be added when setting up the source in the Data Connection application. If you are using an agent runtime, the server running the agent must have suitable network access.

The Databricks connector requires network access to the Hostname provided in Configuration options on port 443. This grants access for Foundry to connect to the Databricks workspace and Unity Catalog REST APIs.

External access to storage locations (virtual tables only)

The Virtual Tables section of this documentation provides details on external access in Unity Catalog and the functionality it enables. External access requires network connectivity to a table's storage location (managed or external). Egress policies will need to be created for each storage location to benefit from the features enabled by external access.

Refer to the official Databricks documentation ↗ for more information on external access and how to determine the storage locations of tables.

Examples

Below we provide example egress policies that may need to be configured to ensure network connectivity to Databricks.

TypeURLDNSPort
Databricks workspacehttps://adb-5555555555555555.19.azuredatabricks.net/adb-5555555555555555.19.azuredatabricks.net443
Azure storage location [1]abfss://<container-name>@<account-name>.dfs.core.windows.net/<table-directory><account-name>.dfs.core.windows.net

<account-name>.blob.core.windows.net
443
Google Cloud Storage (GCS) storage locationgs://<bucket-path>/<table-directory>storage.googleapis.com443
S3 storage locations3://<bucket-path>/<table-directory><bucket-path>.s3.<region>.amazonaws.com443

[1] Be sure to include both blob.core.<endpoint> and dfs.core.<endpoint> domains when configuring access to Azure storage locations. endpoint may vary depending on the Azure Cloud environment.

In a limited number of cases (depending on your Foundry and Databricks environments) it may be necessary to establish a connection via PrivateLink. This is typically the case where both Foundry and Databricks are hosted by the same CSP (for example, AWS-AWS or Azure-Azure.) If you believe this applies to your setup, contact your Palantir representative for additional guidance.

For egress policies that depend on an S3 bucket in the same region as your Foundry instance, ensure you have completed the additional configuration steps detailed in our Amazon S3 bucket policy documentation for the affected bucket(s).

More options: SSL and hostname validation

You may additionally need to pass in a JDBC property to allow self-signed certificates.

How to identify if this property is needed:

  • SSL connections validate server certificates. Normally, SSL validations happen through a certificate chain. By default, both agent and direct connection run times trust most industry-standard certificate chains.
  • If the server to which you are connecting has a self-signed certificate, or if a firewall performs TLS interception on the connection, the connector must trust the certificate. Learn more about using certificates in agent-based connections.
  • If you are creating a direct connection and are using a self-signed certificate, you will need to add a JDBC property for the AllowSelfSignedCerts=1 property.

How to add the property allowing self-signed certificates:

  • At the bottom of the Connection details page under Connection settings select More options then JDBC properties.
  • Under JDBC properties configuration, select Add property then New property then enter AllowSelfSignedCerts as the key and 1 as the value.

When the AllowSelfSignedCerts property is set to 1, SSL verification is disabled. In this case, the connector does not verify the server certificate against the trust store, and does not verify if the server's host name matches the common name or subject alternative names in the server certificate.

This JDBC property and others are outlined in the Databricks driver documentation ↗. The JDBC properties outlined in this documentation are specific to the Databricks driver and will differ from other source types.

The server must provide the full certificate chain in order for SSL verification to work. The certificate chain for the Databricks server can be obtained by running the command openssl s_client -connect {hostname}:{port} -showcerts. To verify the certificate chain, use the OpenSSL command line utility or any other available tool.

Virtual tables

Virtual tables allow you to connect to data registered in Databricks Unity Catalog. This allows you to both read and write to tables in Databricks from Foundry as well as push down compute to Databricks from pipelines in Foundry. This section provides additional details around using virtual tables with Databricks. This section is not applicable when syncing to Foundry datasets.

The Databricks connector now offers enhanced functionality when using virtual tables to expose the features of Delta Lake and Apache Iceberg. This functionality requires external access to be enabled in Unity Catalog. When enabled, external access allows Foundry to access tables using the Unity REST API and Iceberg REST catalog, and read and write data in the underlying storage locations. Unity Catalog credential vending is used to ensure secure access to cloud object storage. In addition to enhanced functionality, this can also improve the performance of reads and writes against these tables.

Refer to the official Databricks documentation ↗ for more information on external access and how to determine the storage locations of tables. Refer to the Networking section of this documentation for details on enabling network access to storage locations.

If external access is not enabled, or if the format of the Unity Catalog object is not supported (for example, views or materialized views), connections to Databricks will be made using JDBC. JDBC is the same mechanism used for syncs. Refer to the official Databricks documentation ↗ for more information on JDBC connectivity to Databricks.

The table below highlights the virtual table capabilities that are supported for Databricks.

CapabilityStatus
Bulk registration🟡 Beta
Automatic registration🟢 Generally available
Table inputs🟢 Generally available: Code Repositories, Pipeline Builder
Table outputs🟢 Generally available: Code Repositories, Pipeline Builder
Incremental pipelines🟢 Generally available [2]
Compute pushdown🟢 Generally available

Consult the virtual tables documentation for details on the supported Foundry workflows where Databricks tables can be used as inputs or outputs. Functionality may vary depending on whether external access is enabled.

The following table provides a summary of the supported formats and workflows when external access is or is not enabled.

Unity Catalog objectExternal access requiredFormatTable inputsTable outputs
Managed tableYesAvro ↗, Delta ↗, Parquet ↗✔️
Managed tableYesIceberg ↗✔️✔️
External tableYesDelta✔️✔️
External tableYesAvro, Parquet✔️
Managed tableNoTable ↗, View ↗, Materialized view✔️
External tableNoTable, view, materialized view✔️

[2] To enable incremental support for Spark pipelines backed by Databricks virtual tables, external access must be enabled; incremental computation requires the ability to directly interact with Delta or Iceberg tables. Incremental compute on top of Delta tables relies on Change Data Feed ↗. Incremental compute on top of Iceberg tables relies on Incremental Reads ↗.

Source configuration requirements

When using virtual tables, remember the following source configuration requirements:

  • You must set up the source as a direct connection. Virtual tables do not support use of intermediary agents.
  • Ensure that bi-directional connectivity and allowlisting is established as described in the Networking section of this documentation, including the recommended networking to storage locations.
  • If using virtual tables in Code Repositories, refer to the Virtual Tables documentation for details of additional source configuration required.
  • You must specify a warehouse in the connection details, using the HTTP path field. Refer to the official Databricks documentation ↗ for more information on how to get connection details for a Databricks compute resource.
  • The credentials provided must have usage privileges on the warehouse.

See the Connection Details section above for more details.

Compute pushdown

Foundry offers the ability to push down compute to Databricks when using virtual tables. When using Databricks virtual tables registered to the same source as inputs and outputs to a pipeline, it is possible to fully federate compute to Databricks. This capability leverages Databricks Connect ↗ and is currently available in Python transforms. See the Python documentation for details on how to push down compute to Databricks.

External models

Databricks models registered in Unity Catalog can be integrated to Foundry via:

Refer to the official Databricks documentation ↗ for more information on making models available in Unity Catalog, and to the guide on setting up Databricks external models in Foundry.