Connect Foundry to Google Cloud Storage to sync files between Foundry datasets and storage buckets.
Capability | Status |
---|---|
Exploration | 🟢 Generally available |
Bulk import | 🟢 Generally available |
Incremental | 🟢 Generally available |
Virtual tables | 🟢 Generally available |
Export tasks | 🟡 Sunset |
File exports | 🟢 Generally available |
The connector can transfer files of any type into Foundry datasets. File formats are preserved, and no schemas are applied during or after the transfer. Apply any necessary schema to the output dataset, or write a downstream transformation to access the data.
There is no limit to the size of transferable files. However, network issues can result in failures of large-scale transfers. In particular, direct cloud syncs that take more than two days to run will be interrupted. To avoid network issues, we recommend using smaller file sizes and limiting the number of files that are ingested in every execution of the sync. Syncs can be scheduled to run frequently.
Learn more about setting up a connector in Foundry.
You must have a Google Cloud IAM service account ↗ to proceed with Google Cloud Storage authentication and set up.
The following roles are required on the bucket being accessed:
Storage Object Viewer
: Read data.Storage Object Creator
: Export data to Google Cloud Storage.Storage Object Admin
: Required for deleting files from Google Cloud Storage after importing them into Foundry, and also for exports with incremental datasets that use UPDATE transactions and overwrite files.Learn more about required roles in the Google Cloud documentation on access control ↗.
Choose from one of the available authentication methods:
GCP instance: Refer to the Google Cloud documentation ↗ for information on how to set up instance-based authentication.
JSON credentials: Refer to the Google Cloud documentation ↗ for information on how to create and download a JSON service account key file.
PKCS8 auth: Requires entering specific credential information from the JSON service account key file. Refer to the Google Cloud documentation ↗ for information on creating the key file.
Workload Identity Federation (OIDC): Follow the displayed source system configuration instructions to set up OIDC. Refer to the Google Cloud Documentation ↗ for details on Workload Identity Federation and our documentation for details on how OIDC works with Foundry.
The Google Cloud Storage connector requires network access to the following domains on port 443:
storage.googleapis.com
oauth2.googleapis.com
(only required when using JSON credentials or PKCS8 auth)sts.googleapis.com
(only required when using Workload Identity Federation)iamcredentials.googleapis.com
(only required when using Workload Identity Federation)The following configuration options are available for the Google Cloud Storage connector:
Option | Required? | Description |
---|---|---|
Project Id | Yes | The ID of the Project containing the Cloud Storage bucket. |
Bucket name | Yes | The name of the bucket to read/write data to and from. |
Credentials settings | Yes | Configure using the Authentication guidance shown above. |
Proxy settings | No | Enable to use a proxy while connecting to Google Cloud Storage. |
The Google Cloud Storage connector uses the file-based sync interface. See documentation on configuring file-based syncs.
This section provides additional details around using virtual tables from Google Cloud Storage source. This section is not applicable when syncing to Foundry datasets.
The table below highlights the virtual table capabilities that are supported for Google Cloud Storage.
Capability | Status |
---|---|
Bulk registration | 🔴 Not available |
Automatic registration | 🔴 Not available |
Table inputs | 🟢 Generally available: Avro ↗, Delta ↗, Iceberg ↗, Parquet ↗ in Code Repositories, Pipeline Builder |
Table outputs | 🔴 Not available |
Incremental pipelines | 🟢 Generally available for Delta tables: APPEND only (details)🟢 Generally available for Iceberg tables: APPEND only (details)🔴 Not available for Parquet tables |
Compute pushdown | 🔴 Not available |
Consult the virtual tables documentation for details on the supported Foundry workflows where tables stored in Google Cloud Storage can be used as inputs or outputs.
When using virtual tables, remember the following source configuration requirements:
JSON credentials
, PKCS8 auth
or Workload Identity Federation (OIDC)
. Other credential options are not supported when using virtual tables.To enable incremental support for pipelines backed by virtual tables, ensure that Change Data Feed ↗ is enabled on the source Delta table. The current
and added
read modes in Python Transforms are supported. The _change_type
, _commit_version
and _commit_timestamp
columns will be made available in Python Transforms.
An Iceberg catalog is required to load virtual tables backed by an Apache Iceberg table. To learn more about Iceberg catalogs, see the Apache Iceberg documentation ↗. All Iceberg tables registered on a source must use the same Iceberg catalog.
Tables will be created using Iceberg metadata files in GCS. A warehousePath
indicating the location of these metadata files must be provided when registering a table.
Incremental support relies on Iceberg Incremental Reads ↗ and is currently append-only. The current
and added
read modes in Python Transforms are supported.
Virtual tables using Parquet rely on schema inference. At most 100 files will be used to determine the schema.
The connector can copy files from a Foundry dataset to any location on the Google Cloud Storage bucket.
To begin exporting data, you must configure an export task. Navigate to the Project folder that contains the Google Cloud Storage connector to which you want to export. Right select on the connector name, then select Create Data Connection Task
.
In the left panel of the Data Connection view:
Source
name matches the connector you want to use.Input
named inputDataset
. The input dataset is the Foundry dataset being exported.Output
named outputDataset
. The output dataset is used to run, schedule, and monitor the task.The labels for the connector and input dataset that appear in the left side panel do not reflect the names defined in the YAML.
Use the following options when creating the export task YAML:
Option | Required? | Description |
---|---|---|
directoryPath | Yes | The directory in Cloud Storage where files will be written. |
excludePaths | No | A list of regular expressions; files with names matching these expressions will not be exported. |
uploadConfirmation | No | When the value is exportedFiles , the output dataset will contain a list of files that were exported. |
retriesPerFile | No | If experiencing network failures, increase this number to allow the export job to retry uploads to Cloud Storage before failing the entire job. |
createTransactionFolders | No | When enabled, data will be written to a subfolder within the specified directoryPath . Every subfolder is based on the time the transaction was committed in Foundry and has a unique name for every exported transaction. |
threads | No | Set the number of threads used to upload files in parallel. Increase the number to use more resources. Ensure that exports running on agents have enough resources on the agent to handle increased parallelization. |
incrementalType | No | For datasets that are built incrementally, set to incremental to only export transactions that occurred since the previous export. |
Example task configuration:
Copied!1 2 3 4 5 6 7 8 9 10
type: export-google-cloud-storage directoryPath: directory/to/export/to excludePaths: - ^_.* - ^spark/_.* uploadConfirmation: exportedFiles incrementalType: incremental retriesPerFile: 0 createTransactionFolders: true threads: 0
After you configure the export task, select Save in the upper right corner.