After creating a file-based sync using exploration, you can update the configuration in the Configurations tab of the sync page.
While Foundry file-based syncs offer low-level settings for greater flexibility and configuration, most use cases will follow a known mode. The following table documents known modes and the low-level settings required to achieve the desired behavior, as well as settings that could be contradictory with those modes.
SNAPSHOT
(default)SNAPSHOT
None
Each run will ingest all files nested in the external system's subdirectory, including files ingested in previous runs, and commit a SNAPSHOT
transaction to the output dataset containing exactly those files. The output Foundry dataset view will contain a single SNAPSHOT
transaction containing all files.
Exclude files already synced
Limit number of files
At least N files
N
nested files in the specified subfolder of the external system, this setting will yield an empty transaction and result in 0 files being ingested. Otherwise, this setting has no effect.APPEND
APPEND
Exclude files already synced
The output dataset view will contain a collection of APPEND
transactions, which in aggregate contain all nested files ever present during any job run from the specified subfolder. Each run will ingest all files that have not yet been ingested, keyed by file path name, and commit an APPEND
transaction to the output dataset.
Exclude files already synced
with the Last modified date
or File size
option
APPEND
transaction when their Last modified date
or File size
change, respectively. To allow updates to existing files, review the incremental with UPDATE
ingestion mode.UPDATE
UPDATE
Exclude files already synced
Exclude files already synced
with the Last modified date
optionExclude files already synced
with the File size
optionThe output dataset view will contain a collection of UPDATE
transactions, which in aggregate contain the latest version of all nested files ever present during any job run from the specified subfolder. Each run will ingest all files that have not yet been ingested or have since changed, keyed by file path name, and commit an UPDATE
transaction to the output dataset.
Only use this mode if modifications to existing files are a non-negotiable behavior of the external system. While ingestion is incremental in the sense that only files that are new or changed are ingested in a given run, downstream pipelines cannot run incrementally, as the output dataset (input to the downstream pipelines) is not append-only.
SNAPSHOT
SNAPSHOT
Exclude files already synced
The output dataset view will contain a single SNAPSHOT
transaction containing only files that were never present in any previous job run. Each run will ingest all files that have not yet been ingested, keyed by file path name, and commit a SNAPSHOT
transaction to the output dataset, containing exactly those files.
This mode is useful when only "recent" files (files that were created in the external system between the second-to-last and last run) are relevant to downstream pipelines and operations. Files ingested in previous runs will not be visible in the output dataset view.
Limit number of files
It is always safe to specify the subfolder and optional regex, in addition to filters that limit the file types desired in the output. Such filters include Last modified after
to exclude outdated files or Path does not match
to exclude files with a certain file extension, such as .sh
executable files.
Only the Exclude files already synced
, At least N files
, and Limit number of files
filters are tightly coupled to the desired sync mode and might interfere with it.
Configuration options for file-based syncs include the following:
Parameter | Required? | Default | Description |
---|---|---|---|
Subfolder | Yes | Specify the location of files within the connector that will be synced into Foundry. | |
Filters | No | Apply filters to limit the files synced into Foundry. | |
Transformers | No | Apply transformers to data before it is synced into Foundry. | |
Completion strategies | No | Enable to delete files and/or empty parent directories after a successful sync. Requires write permission on the source filesystem. |
Syncs will include all nested files and folders from the specified subfolder.
Filters allow you to filter source files before they are imported into Foundry. The supported filter types are:
Transformers allow you to perform basic file transformations (compression or decryption, for example) before uploading to Foundry. During a sync, the files chosen for ingest will be modified per the chosen transformer.
Rather than using Data Connection transformers, we recommend performing data transformations in Foundry with Pipeline Builder and Code Repositories to benefit from provenance and branching.
The following transformers are supported in Data Connection:
^(.*/)
with /
.Completion strategies provide a method of deleting files and empty parent directories after a successful batch sync of those files into a Foundry dataset. This may be useful when data is synced by writing to an intermediate S3 bucket or other file storage system that Foundry reads from. If the data read by Foundry is already a short-lived copy, it is generally safe to delete once the data has been read and successfully written to Foundry.
Completion strategies are subject to several important limitations and caveats. These limitations and potential mitigations or alternatives are described below.
Completion strategies are only supported when using an agent worker runtime. When using a direct connection or agent proxy runtime, we recommend implementing the functionality provided by completion strategies as a downstream external transform instead.
As an example, assume you have a direct connection to an S3 bucket containing the files foo.txt
and bar.txt
. You want to use a file batch sync to copy them to a dataset, and then delete the files from S3. The recommended way to achieve this doesn't use completion strategies, instead you should do the following:
Note that this approach is retryable if any deletion calls fail, and guarantees that data is successfully committed to Foundry before attempting to perform any deletions. This approach is also compatible with incremental file batch syncs.
Completion strategies are best effort, meaning that they do not guarantee that data will be effectively removed. The following are some situations that may cause completion strategies to fail:
In general, we recommend using an alternative to completion strategies wherever possible. Custom completion strategies are no longer supported.
This guide is recommended for users setting up a new sync or troubleshooting a slow or unreliable sync. If your sync is already working reliably, you do not need to take any action.
Syncing a large number of files into a single dataset can be challenging for many reasons.
Consider a sync intended to upload a million files. After crawling the source system and uploading all but one file, a network issue causes the entire sync to fail. All of the work done up to that point would be lost because syncs are transactional; if the sync fails, the entire transaction also fails.
Network issues are one of several common causes of sync failure, resulting in hours of lost work and unnecessary load on source systems and agents. Even without network issues or errors, syncing a large number of files can take a long time.
If the dataset grows over time, the time to sync the data as a SNAPSHOT
increases. This is because SNAPSHOT
transactions sync all of the data from the dataset into Foundry. Instead, use syncs that are configured with transaction type APPEND
to import your data incrementally. Since you will be syncing smaller, discrete chunks of data, you will create an effective checkpoint; a sync failure will result in a minimal amount of duplicated work rather than requiring a complete re-run. Additionally, your dataset syncs will run faster as you no longer need to upload all of your data for every sync.
APPEND
syncsAPPEND
transactions require additional configuration to run successfully.
By default, files synced into Foundry are not filtered. However,APPEND
syncs require filters to prevent the same files from being imported. We recommend using the Exclude files already synced
and Limit number of files
filters to control how many files get imported into Foundry in a single sync. Finally, schedule your sync to remain up to date with your source system.