File-based syncs

After creating a file-based sync using exploration, you can update the configuration in the Configurations tab of the sync page.

Conceptual file-based ingestion modes

While Foundry file-based syncs offer low-level settings for greater flexibility and configuration, most use cases will follow a known mode. The following table documents known modes and the low-level settings required to achieve the desired behavior, as well as settings that could be contradictory with those modes.

Batch mirror with SNAPSHOT (default)

  • Transaction type: SNAPSHOT
  • Filters: None

Each run will ingest all files nested in the external system's subdirectory, including files ingested in previous runs, and commit a SNAPSHOT transaction to the output dataset containing exactly those files. The output Foundry dataset view will contain a single SNAPSHOT transaction containing all files.

Contradictory settings

  • Filters: Exclude files already synced
  • Filters: Limit number of files
    • Results in the output Foundry dataset view containing only a non-deterministic subset of the desired files if the limit is lower than the total number of available files.
  • Filters: At least N files
    • If there are not N nested files in the specified subfolder of the external system, this setting will yield an empty transaction and result in 0 files being ingested. Otherwise, this setting has no effect.

Incremental mirror with APPEND

  • Transaction type: APPEND
  • Filters: Exclude files already synced

The output dataset view will contain a collection of APPEND transactions, which in aggregate contain all nested files ever present during any job run from the specified subfolder. Each run will ingest all files that have not yet been ingested, keyed by file path name, and commit an APPEND transaction to the output dataset.

Contradictory settings

  • Filters: Exclude files already synced with the Last modified date or File size option
    • These options would attempt to incorrectly re-ingest existing files, keyed by file path name, in an APPEND transaction when their Last modified date or File size change, respectively. To allow updates to existing files, review the incremental with UPDATE ingestion mode.

Incremental mirror with UPDATE

  • Transaction type: UPDATE
  • Filters: Exclude files already synced
  • One or both of:
    • Filters: Exclude files already synced with the Last modified date option
    • Filters: Exclude files already synced with the File size option

The output dataset view will contain a collection of UPDATE transactions, which in aggregate contain the latest version of all nested files ever present during any job run from the specified subfolder. Each run will ingest all files that have not yet been ingested or have since changed, keyed by file path name, and commit an UPDATE transaction to the output dataset.

Caveat

Only use this mode if modifications to existing files are a non-negotiable behavior of the external system. While ingestion is incremental in the sense that only files that are new or changed are ingested in a given run, downstream pipelines cannot run incrementally, as the output dataset (input to the downstream pipelines) is not append-only.

Trailing window with SNAPSHOT

  • Transaction type: SNAPSHOT
  • Filters: Exclude files already synced

The output dataset view will contain a single SNAPSHOT transaction containing only files that were never present in any previous job run. Each run will ingest all files that have not yet been ingested, keyed by file path name, and commit a SNAPSHOT transaction to the output dataset, containing exactly those files.

This mode is useful when only "recent" files (files that were created in the external system between the second-to-last and last run) are relevant to downstream pipelines and operations. Files ingested in previous runs will not be visible in the output dataset view.

Contradictory settings

  • Filters: Limit number of files
    • When the number of files created in the external system during a given window exceeds the specified limit, a non-deterministic subset of those files will be ingested, and the remainder will be deferred to a subsequent window. This number may grow rapidly over time, destroying the "recency" intended in the output dataset.

It is always safe to specify the subfolder and optional regex, in addition to filters that limit the file types desired in the output. Such filters include Last modified after to exclude outdated files or Path does not match to exclude files with a certain file extension, such as .sh executable files.

Only the Exclude files already synced, At least N files, and Limit number of files filters are tightly coupled to the desired sync mode and might interfere with it.

Configure file-based syncs

Configuration options for file-based syncs include the following:

ParameterRequired?DefaultDescription
SubfolderYesSpecify the location of files within the connector that will be synced into Foundry.
FiltersNoApply filters to limit the files synced into Foundry.
TransformersNoApply transformers to data before it is synced into Foundry.
Completion strategiesNoEnable to delete files and/or empty parent directories after a successful sync. Requires write permission on the source filesystem.

Syncs will include all nested files and folders from the specified subfolder.

Filters

Filters allow you to filter source files before they are imported into Foundry. The supported filter types are:

  • Exclude files already synced: Only sync files that were added or modified in size or date since the last sync.
  • Path matches: Only sync files with a path (relative to the root of the connector) that matches the regular expression.
  • Path does not match: Only sync files with a path (relative to the root of the connector) that does not match the regular expression.
  • Last modified after: Only sync files that have been modified after a specified date and time.
  • File size is between: Only sync files with a size between the specified minimum and maximum byte value.
  • Any file has path matching: If any file has a relative path matching the regular expression, sync all files in the subfolder that are not otherwise filtered.
  • At least N files: Sync all filtered files only if there are at least N files remaining.
  • Limit number of files: Limit the number of files to keep per transaction. This option can increase the reliability of incremental syncs.

Transformers

Transformers allow you to perform basic file transformations (compression or decryption, for example) before uploading to Foundry. During a sync, the files chosen for ingest will be modified per the chosen transformer.

Rather than using Data Connection transformers, we recommend performing data transformations in Foundry with Pipeline Builder and Code Repositories to benefit from provenance and branching.

The following transformers are supported in Data Connection:

  • Compress with Gzip
  • Concatenate multiple files
    • Join multiple files into a single file.
  • Rename files
    • Replace all occurrences of a given filename substring with a new substring.
    • Drop the directory path from the filename by replacing ^(.*/) with /.
  • Decrypt with PGP
    • Decrypt files that have been encrypted with PGP encryption.
    • Requires that the agent system has PGP keys configured.
    • Unavailable for syncs running on direct connections.
  • Append timestamp to filenames
    • Add a timestamp in a custom format to the filename of each file ingested.

Completion strategies

Completion strategies provide a method of deleting files and empty parent directories after a successful batch sync of those files into a Foundry dataset. This may be useful when data is synced by writing to an intermediate S3 bucket or other file storage system that Foundry reads from. If the data read by Foundry is already a short-lived copy, it is generally safe to delete once the data has been read and successfully written to Foundry.

Limitations of completion strategies and alternatives

Completion strategies are subject to several important limitations and caveats. These limitations and potential mitigations or alternatives are described below.

Completion strategy support

Completion strategies are only supported when using an agent worker runtime. When using a direct connection or agent proxy runtime, we recommend implementing the functionality provided by completion strategies as a downstream external transform instead.

As an example, assume you have a direct connection to an S3 bucket containing the files foo.txt and bar.txt. You want to use a file batch sync to copy them to a dataset, and then delete the files from S3. The recommended way to achieve this doesn't use completion strategies, instead you should do the following:

  • Configure a batch sync without any completion strategies and schedule it to run.
  • Write a downstream external transform job which is scheduled to run when the sync output dataset is updated, taking the synced data as an input.
  • In that external transform, write python transforms code to iterate through the files that have appeared in the synced dataset, and make calls to S3 to delete those files from the bucket.

Note that this approach is retryable if any deletion calls fail, and guarantees that data is successfully committed to Foundry before attempting to perform any deletions. This approach is also compatible with incremental file batch syncs.

Completion strategies are best effort

Completion strategies are best effort, meaning that they do not guarantee that data will be effectively removed. The following are some situations that may cause completion strategies to fail:

  1. Completion strategies will not be retried if the agent worker runtime crashes or is restarted after the batch sync commits data to Foundry, but before the completion strategies run.
  2. If the credentials used to connect do not have write permissions, the batch sync may successfully read data and commit to Foundry, but fail to perform the deletions specified by the completion strategy.

In general, we recommend using an alternative to completion strategies wherever possible. Custom completion strategies are no longer supported.

Optimize file-based syncs

Warning

This guide is recommended for users setting up a new sync or troubleshooting a slow or unreliable sync. If your sync is already working reliably, you do not need to take any action.

Syncing a large number of files into a single dataset can be challenging for many reasons.

Consider a sync intended to upload a million files. After crawling the source system and uploading all but one file, a network issue causes the entire sync to fail. All of the work done up to that point would be lost because syncs are transactional; if the sync fails, the entire transaction also fails.

Network issues are one of several common causes of sync failure, resulting in hours of lost work and unnecessary load on source systems and agents. Even without network issues or errors, syncing a large number of files can take a long time.

If the dataset grows over time, the time to sync the data as a SNAPSHOT increases. This is because SNAPSHOT transactions sync all of the data from the dataset into Foundry. Instead, use syncs that are configured with transaction type APPEND to import your data incrementally. Since you will be syncing smaller, discrete chunks of data, you will create an effective checkpoint; a sync failure will result in a minimal amount of duplicated work rather than requiring a complete re-run. Additionally, your dataset syncs will run faster as you no longer need to upload all of your data for every sync.

Configure incremental APPEND syncs

APPEND transactions require additional configuration to run successfully.

By default, files synced into Foundry are not filtered. However,APPEND syncs require filters to prevent the same files from being imported. We recommend using the Exclude files already synced and Limit number of files filters to control how many files get imported into Foundry in a single sync. Finally, schedule your sync to remain up to date with your source system.