Documentation

Data connectivity & integrationPython (Spark)API referenceTransforms classes

Transforms classes

Class	Description
`Check`	Wraps up an Expectation such that it can be registered with Data Health.
`FileStatus`	A collections.namedtuple capturing details about a FoundryFS file.
`FileSystem(foundry_fs[, read_only])`	A filesystem object for reading and writing dataset files.
`IncrementalTransformContext`(ctx, is_incremental)	TransformContext with added functionality for incremental computation.
`IncrementalTransformInput`(tinput[, prev_txrid])	TransformInput with added functionality for incremental computation.
`IncrementalTransformOutput`(toutput[, …])	TransformOutput with added functionality for incremental computation.
`Input`(alias)	Specification of a transform input.
`Output`(alias[, sever_permissions])	Specification of a transform output.
`Pipeline`()	An object for grouping a collection of Transform objects.
`Transform`(compute_func[, inputs, outputs, ...])	A callable object that describes single step of computation.
`TransformContext`(foundry_connector[, parameters])	Context object that can optionally be injected into the compute function of a transform.
`TransformInput`(rid, branch, txrange, …)	The input object passed into Transform objects at runtime.
`LightweightInput`(alias)	The input object passed into Lightweight Transform objects at runtime.
`IncrementalLightweightInput`(alias)	The input object passed into an incremental Lightweight Transform objects at runtime.
`TransformOutput`(rid, branch, txrid, …)	The output object passed into Transform objects at runtime.
`LightweightOutput`(alias)	The input object passed into Lightweight Transform objects at runtime.

`Check`

class `transforms.api.Check`

Wraps up an Expectation such that it can be registered with Data Health.

expectation
- Expectation – The expectation to evaluate.
name
- str – The name of the check, used as a stable identifier over time.
is_incremental
- bool – If the transform is running incrementally.
on_error
- (str, Optional) – What action to take if the expectation is not met. Currently 'WARN', 'FAIL'.
description
- (str, Optional) – The description of the check.

`FileStatus`

class transforms.api.FileStatus

A collections.namedtuple capturing details about a FoundryFS file.

Create new instance of FileStatus(path, size, modified)

count(value) → integer -- return number of occurrences of value
index(value[, start[, stop]]) → integer -- return first index of value
- Raises ValueError if the value is not present
modified
- Alias for field number 2
path
- Alias for field number 0
size
- Alias for field number 1

`FileSystem`

class transforms.api.FileSystem(foundry_fs, read_only=False)

A filesystem object for reading and writing dataset files.

files(glob=None, regex='.*', show_hidden=False, packing_heuristic=None)
- Create a DataFrame ↗ containing the paths accessible within this dataset.
- The DataFrame ↗ is partitioned by file size where each partition contains file paths whose combined size is at most spark.files.maxPartitionBytes bytes, or by a single file if it is larger than spark.files.maxPartitionBytes. The size of a file is calculated as its on-disk file size plus the spark.files.openCostInBytes.
- Parameters
  - glob (str ↗, optional) – A unix file matching pattern. Supports globstar; to search for files (e.g. pdf), recursively use **/*.pdf.
  - regex (str ↗, optional) – A regex pattern against which to match filenames.
  - show_hidden (bool ↗, optional) – Include hidden files, those prefixed with . or _.
  - packing_heuristic (str ↗, optional) – Specify a heuristic to use for bin-packing files into Spark partitions. Possible choices are ffd (First Fit Decreasing) or wfd (Worst Fit Decreasing). While wfd tends to produce a less even distribution, it is much faster, so wfd is recommended for datasets containing a very large number of files. If a heuristic is not specified, one will be selected automatically.
- Returns
  - DataFrame of (path, size, modified)
- Return type
  - pyspark.sql.DataFrame
ls(glob=None, regex='.*', show_hidden=False)
- Recurses through all directories and lists all files matching the given patterns, starting from the root directory of the dataset.
- Parameters
  - glob (str ↗, optional) – A unix file matching pattern. Supports globstar; to search for files (e.g. pdf), recursively use **/*.pdf.
  - regex (str ↗, optional) – A regex pattern against which to match filenames.
  - show_hidden (bool ↗, optional) – Include hidden files, those prefixed with . or _.
- Yields
  - FileStatus - The logical path, file size (bytes), modified timestamp (ms since January 1, 1970 UTC).
open(_path, mode='r', kwargs)
- Open a FoundryFS file in the given mode. kwargs are keyword arguments.
- Parameters
  - path (str ↗) – The logical path of the file in the dataset.
  - kwargs – Remaining keyword arguments passed to io.open() ↗
  - show_hidden (bool ↗, optional) – Include hidden files, those prefixed with . or _.
- Returns
  - A Python file-like object attached to the stream.
- Return type
  - File
hadoop_path
- str – The base path of the filesystem, to be used when reading raw files with Spark.

`IncrementalTransformContext`

class `transforms.api.IncrementalTransformContext`(ctx, is_incremental)

TransformContext with added functionality for incremental computation.

auth_header
- str – The auth header used to run the transform.
fallback_branches
- List[str] – The fallback branches configured when running the transform.
is_incremental
- bool – If the transform is running incrementally.
parameters
- dict of (str, any) – Transform parameters.
spark_session
- pyspark.sql.SparkSession – The Spark session used to run the transform.
abort_job
- Aborts the job and ends execution. This will abort all output transactions.

`IncrementalTransformInput`

class `transforms.api.IncrementalTransformInput`(tinput, prev_txrid=None)

TransformInput with added functionality for incremental computation.

dataframe(mode='added')
- Return a pyspark.sql.DataFrame for the given read mode.
- Only current, previous and added modes are supported.
- Parameters
  - mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added
- Returns
  - The dataframe for the dataset.
- Return type
  - Dataframe ↗
filesystem(mode='added')
- Construct a FileSystem object for reading from FoundryFS for the given read mode.
- Only current, previous and added modes are supported.
- Parameters
  - mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added
- Returns
  - A filesystem object for the given view.
- Return type
  - FileSystem
pandas()
- pandas.DataFrame ↗: A Pandas dataframe containing the full view of the input dataset.
branch
- str – The branch of the input dataset.
path
- str – The Project path of the input dataset.
rid
- str – The resource identifier of the dataset.

`IncrementalTransformOutput`

class transforms.api.IncrementalTransformOutput(toutput, prev_txrid=None, mode='replace')

TransformOutput with added functionality for incremental computation.

abort()
- Aborts the transaction, allowing the job to complete successfully without writing any data. See Python Abort for more details.
dataframe(mode='current', schema=None)
- Return a pyspark.sql.DataFrame ↗ for the given read mode.
- Parameters
  - mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to current.
  - schema (pyspark.types.StructType, optional) - A PySpark schema to use when constructing an empty dataframe. Required when using read mode ‘previous’.
- Returns
  - The dataframe for the dataset.
- Return type
  - DataFrame ↗
- Raises
  - ValueError ↗ - If no schema is passed when using mode ‘previous’
filesystem(mode='current')
- Construct a FileSystem object for writing to FoundryFS.
- Parameters
  - mode (str ↗, optional) – The read mode, one of added, current or previous. Defaults to current. Only the current filesystem is writable.
- Raises
  - NotImplementedError ↗ – Not currently supported.
pandas(mode='current')
- pandas.DataFrame ↗: A Pandas dataframe for the given read mode.
set_mode(mode)
- Change the write mode of the dataset.
- Parameters
  - mode (str ↗) – The write mode, one of ‘replace’ or ‘modify’. In modify mode anything written to the output is appended to the dataset. In replace mode, anything written to the output replaces the dataset.

The write mode cannot be changed after data has been written.

write_dataframe(df, partition_cols=None, bucket_cols=None, bucket_count=None, sort_by=None, output_format=None, options=None)
- Write the given DataFrame ↗ to the output dataset.
- Parameters
  - df (_pyspark.sql.DataFrame ↗) – The PySpark dataframe to write.
  - partition_cols (List[str ↗], optional) - Column partitioning to use when writing data.
  - bucket_cols (List[str ↗], optional) – The columns by which to bucket the data. Must be specified if bucket_count is given.
  - bucket_count (int ↗, optional) – The number of buckets. Must be specified if bucket_cols is given.
  - sort_by (List[str ↗], optional) – The columns by which to sort the bucketed data.
  - output_format (str ↗, optional) – The output file format, defaults to ‘parquet’.
  - options (dict ↗, optional) – Extra options to pass through to org.apache.spark.sql.DataFrameWriter#option(String, String).
write_pandas(pandas_df)
- Write the given pandas.DataFrame ↗ to the output dataset.
- Parameters
  - pandas_df (pandas.DataFrame ↗) – The dataframe to write.
branch
- str – The branch of the dataset.
path
- str – The Project path of the dataset.
rid
- str – The resource identifier of the dataset.

`Input`

class transforms.api.Input(alias, branch, stop_propagating, stop_requiring, checks)

Specification of a transform input.

Parameters
- alias (str ↗, optional) – Dataset rid or the absolute Project path of the dataset. If not specified, parameter is unbound.
- branch (str ↗, optional): Branch name to resolve the input dataset to. If not specified, resolved at build-time.
- stop_propagating (Markings, optional): Security markings to stop propagating from this. See the Markings and remove inherited markings documentation.
- stop_requiring (OrgMarkings, optional): Org markings to assume on this input.
- checks (List[Check], Check, optional): One or more :class:Check objects.
- failure_strategy (str ↗, optional): Strategy in case the input fails to update. Must be either continue or fail. If not specified, defaults to fail.

`Output`

class transforms.api.Output(alias=None, sever_permissions=False, checks=None)

Specification of a transform output.

Parameters
- alias (str ↗, optional) - Dataset rid or the absolute Project path of the dataset. If not specified, parameter is unbound.
- sever_permissions (bool ↗, optional) - If true, severs the permissions of the dataset from the permissions of its inputs. Ignored if parameter is unbound
- checks (List[Check], Check, optional) - One or more :class:Check objects.

`Pipeline`

class transforms.api.Pipeline

An object for grouping a collection of Transform objects.

add_transforms(*transforms)
- Register the given Transform objects to the Pipeline instance.
- Parameters
  - transforms (Transform) – The transforms to register.
- Raises
  - ValueError ↗ – If multiple Transform objects write to the same Output alias.
discover_transforms(*modules)
- Recursively find and import modules registering every module-level transform.
- This method recursively finds and imports modules starting at the given module’s path. Each module found is imported and any attribute that is an instance of Transform (as constructed by the transforms decorators) will be registered to the pipeline.
- Parameters
  - modules (module) – The modules to start searching from.

Copied!1
2
3
>>> import myproject
>>> p = Pipeline()
>>> p.discover_transforms(myproject)

Each module found is imported. Try to avoid executing code at the module-level.

transforms
- List[Transform] – The list of transforms registered to the pipeline.

`Transform`

class transforms.api.Transform(compute_func, inputs=None, outputs=None, profile=None)

A callable object that describes single step of computation.

A Transform consists of a number of Input specs, a number of Output specs, and a compute function.

It is idiomatic to construct a Transform object using the provided decorators: transform(), transform_df(), and transform_pandas().

Note: The original compute function is exposed via the Transform’s __call__ method.

Parameters
- compute_func (Callable) – The compute function to wrap.
- inputs (Dict[str ↗, Input]) - A dictionary mapping input names to Input specs.
- outputs (Dict[str ↗, Output]) - A dictionary mapping input names to Output specs.
- profile (str ↗, optional) – The name of the Transforms profile to use at runtime.
compute(ctx=None, _kwargs_)**
- Compute the transform with a context and set of inputs and outputs.
- Parameters
  - ctx (TransformContext, optional) – A context object passed to the transform if it requests one.
  - kwargs (TransformInput or TransformOutput) - A dictionary mapping input names to Input specs. kwarg is a shorthand for keyword arguments.
  - outputs (Dict[str ↗, Output]) - The input, output and context objects to be passed by name into the compute function.
- Returns
  - The output objects after running the transform.
- Return Type
  - dict of (str ↗, TransformOutput)
version
- str – A string that is used to compare two versions of a transform when considering logic staleness.
- For example, a SQL transform may take the hash of the SQL query. Ideally the SQL query would be transformed into a format that yields the same version for transforms of equivalent semantic meaning. I.e. the SQL query select A, B from foo; should be the same version as the SQL query select A, B from (select * from foo);.
- If no version is specified, the version of the repository will be used.
- Raises
  - ValueError ↗ – If fails to compute object hash of compute function

`TransformContext`

class transforms.api.TransformContext(foundry_connector, parameters=None) Context object that can optionally be injected into the compute function of a transform.

auth_header
- str – The auth header used to run the transform. This auth header has a limited scope and only has the permissions required to run the job. It should not be used for API calls.
fallback_branches
- List[str] – The fallback branches configured when running the transform.
parameters
- dict of (str, any) – Transform parameters.
spark_session
- pyspark.sql.SparkSession – The Spark session used to run the transform.
abort_job
- Aborts the job and ends execution. This will abort all output transactions.

`TransformInput`

class transforms.api.TransformInput(rid, branch, txrange, dfreader, fsbuilder)

The input object passed into Transform objects at runtime.

dataframe()
- Return a pyspark.sql.DataFrame ↗ for the given read mode.
- Returns
  - The dataframe for the dataset.
- Return Type
  - pyspark.sql.DataFrame ↗
filesystem()
- Construct a FileSystem object for reading from FoundryFS.
- Returns
  - A FileSystem object for reading from Foundry.
- Return Type
  - FileSystem
pandas()
- pandas.DataFrame ↗: A Pandas dataframe containing the full view of the input dataset.
branch
- str – The branch of the input dataset.
path
- str – The Project path of the input dataset.
rid
- str – The resource identifier of the dataset.
column_descriptions
- Dict<str, str> – The column descriptions of the dataset.
column_typeclasses
- Dict<str, str> – The column type classes of the dataset.

`LightweightInput`

class transforms.api.LightweightInput(alias)

Its aim is to mimic a subset of the API of TransformInput by delegating to the Foundry Data Sidecar while extending it with support for various data formats.

dataframe()
- Return a pandas.DataFrame ↗ with the dataset. It's an alias to pandas()
- Returns
  - The dataframe for the dataset.
- Return Type
  - pandas.DataFrame ↗
filesystem()
- Construct a FileSystem object for reading from FoundryFS.
- Returns
  - A FileSystem object for reading from Foundry.
- Return Type
  - FileSystem
pandas()
- Return a pandas.DataFrame ↗ with the dataset.
- Returns
  - The dataframe for the dataset.
- Return Type
  - pandas.DataFrame ↗
arrow()
- Return a pyarrow.Table ↗ with the dataset.
- Returns
  - The Table for the dataset.
- Return Type
  - pyarrow.Table ↗
polars(lazy: Optional[bool]=False)
- Return a polars.DataFrame ↗ or polars.LazyFrame ↗ with the dataset based on the value of the lazy parameter.
- Returns
  - The dataframe for the dataset.
- Return Type
  - polars.DataFrame ↗ or polars.LazyFrame ↗
path()
- Return a str ↗ with the path to the downloaded dataset's files which may be CSVs, Parquet or Avro files in a folder.
- Returns
  - Path to the directory containing the dataset's files.
- Return Type
  - str ↗

`IncrementalLightweightInput`

class transforms.api.IncrementalLightweightInput(alias)

Its aim is to mimic a subset of the API of IncrementalTransformInput by delegating to the Foundry Data Sidecar while extending it with support for various data formats. It's the incremental counterpart of LightweightInput

dataframe(mode)
- Return a pandas.DataFrame ↗ with the dataset. It's an alias to pandas()
- Parameters
  - mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added.
- Returns
  - The dataframe for the dataset.
- Return Type
  - pandas.DataFrame ↗
filesystem()
- Construct a FileSystem object for reading from FoundryFS.
- Returns
  - A FileSystem object for reading from Foundry.
- Return Type
  - FileSystem
pandas()(mode)
- Return a pandas.DataFrame ↗ with the dataset.
- Parameters
  - mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added.
- Returns
  - The dataframe for the dataset.
- Return Type
  - pandas.DataFrame ↗
arrow()(mode)
- Return a pyarrow.Table ↗ with the dataset.
- Parameters
  - mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added.
- Returns
  - The table for the dataset.
- Return Type
  - pyarrow.Table ↗
polars(lazy=False, mode)
- Return a polars.DataFrame ↗ or polars.LazyFrame ↗ with the dataset based on the value of the lazy parameter.
- Parameters
  - lazy (Optional[bool]) – Choose between a lazy or an eager Polars dataframe.
  - mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added.
- Returns
  - The dataframe for the dataset.
- Return Type
  - polars.DataFrame ↗ or polars.LazyFrame ↗.
path(mode)
- Return a str ↗ with the path to the downloaded dataset's files which may be CSVs, Parquet or Avro files in a folder.
- Parameters
  - mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added
- Returns
  - Path to the directory containing the dataset's files.
- Return Type
  - str ↗

`TransformOutput`

class transforms.api.TransformOutput(rid, branch, txrid, dfreader, dfwriter, fsbuilder)

The output object passed into Transform objects at runtime.

abort()
- Aborts the transaction, allowing the job to complete successfully without writing any data. See Python Abort for more details.
dataframe()
- Return a pyspark.sql.DataFrame ↗ containing the full view of the output dataset.
- Returns
  - The dataframe for the dataset.
- Return Type
  - pyspark.sql.DataFrame ↗
filesystem()
- Construct a FileSystem object for writing to FoundryFS.
- Returns
  - A FileSystem object for writing to Foundry.
Return Type
- FileSystem
pandas()
- pandas.DataFrame ↗: A Pandas dataframe containing the full view of the output dataset.
set_mode(mode)
- Change the write mode of the dataset.
- Parameters
  - mode (str ↗) – The write mode, one of ‘replace’ or ‘modify’. In modify mode anything written to the output is appended to the dataset. In replace mode, anything written to the output replaces the dataset.
write_dataframe(df, partition_cols=None, bucket_cols=None, bucket_count=None, sort_by=None, output_format=None, options=None, column_descriptions=None, column_typeclasses=None)
- Write the given DataFrame ↗ to the output dataset.
- Parameters
  - df (pyspark.sql.DataFrame ↗) – The PySpark dataframe to write.
  - partition_cols (List[str ↗], optional) - Column partitioning to use when writing data.
  - bucket_cols (List[str ↗], optional) - The columns by which to bucket the data. Must be specified if bucket_count is given.
  - bucket_count (int ↗, optional) – The number of buckets. Must be specified if bucket_cols is given.
  - sort_by (List[str ↗], optional) - The columns by which to sort the bucketed data.
  - output_format (str ↗, optional) - The output file format, defaults to 'parquet'. The file formats are based on Spark's DataFrameWriter ↗ and other types include 'csv', 'json', 'orc' and 'text'.
  - options (dict ↗, optional) - Extra options to pass through to org.apache.spark.sql.DataFrameWriter#option(String, String).
  - column_descriptions (Dict[str ↗, str ↗], optional) - Map of column names to their string descriptions. This map is intersected with the columns of the DataFrame, and must include descriptions (max 800 characters).
  - column_typeclasses (Dict[str ↗, List[Dict[str ↗, str ↗]], optional) - Map of column names to their column type classes. Each type class in the List is a Dict[str ↗, str ↗], where only two keys are valid: "name" and "kind". Each of these keys map to the corresponding string the user wants.
write_pandas(pandas_df)
- Write the given pandas.DataFrame ↗ to the output dataset.
- Parameters
  - pandas_df (pandas.DataFrame ↗) – The dataframe to write.
branch
- str – The branch of the dataset.
path
- str – The Project path of the dataset.
rid
- str – The resource identifier of the dataset.

`LightweightOutput`

class transforms.api.LightweightInput(alias)

Its aim is to mimic a subset of the API of TransformOutput by delegating to the Foundry Data Sidecar.

filesystem()
- Construct a FileSystem object for reading from FoundryFS.
- Returns
  - A FileSystem object for reading from Foundry.
- Return Type
  - FileSystem
dataframe()
- Return a pandas.DataFrame ↗ with the dataset. It's an alias to pandas()
- Returns
  - The dataframe for the dataset.
- Return Type
  - pandas.DataFrame ↗
pandas()
- Return a pandas.DataFrame ↗ with the dataset.
- Returns
  - The dataframe for the dataset.
- Return Type
  - pandas.DataFrame ↗
arrow()
- Return a pyarrow.Table ↗ with the dataset.
- Returns
  - The Table for the dataset.
- Return Type
  - pyarrow.Table ↗
polars(lazy: Optional[bool]=False)
- Return a polars.DataFrame ↗ or polars.LazyFrame ↗ with the dataset based on the value of the lazy parameter.
- Returns
  - The dataframe for the dataset.
- Return Type
  - polars.DataFrame ↗ or polars.LazyFrame ↗
path()
- Return a str ↗ with the path to the downloaded dataset's files which may be CSVs, Parquet or Avro files in a folder.
- Returns
  - Path to the directory containing the dataset's files
- Return Type
  - str ↗
write_pandas(pandas_df)
- Write the given pandas.DataFrame ↗ to the output dataset. It delegates to write_table.
- Parameters
  - pandas_df (pandas.DataFrame ↗) – The dataframe to write.
write_table(df)
- Write the given pandas.DataFrame ↗, pyarrow.Table ↗, polars.DataFrame ↗ or polars.LazyFrame ↗, or path to the output dataset. In case a path is given (either as str or pathlib.Path), it's value must match the value returned by path_for_write_table.
- Parameters
  - df (pandas.DataFrame ↗, pyarrow.Table ↗, polars.DataFrame ↗ or polars.LazyFrame ↗, or path) – The dataframe to write.
path_for_write_table
- The path for the dataset's files to be used with write_table.
- Return Type
  - str ↗
set_mode(mode)
- Change the write mode of the dataset.
- Parameters
  - mode (str ↗) – The write mode, one of ‘replace’ or ‘modify’. In modify mode anything written to the output is appended to the dataset. In replace mode, anything written to the output replaces the dataset.

←

PREVIOUSTransforms

NEXTFoundry connectors

→

Transforms classes

Check

class transforms.api.Check

FileStatus

FileSystem

IncrementalTransformContext

class transforms.api.IncrementalTransformContext(ctx, is_incremental)

IncrementalTransformInput

class transforms.api.IncrementalTransformInput(tinput, prev_txrid=None)

IncrementalTransformOutput

Input

Output

Pipeline

Transform

TransformContext

TransformInput

LightweightInput

IncrementalLightweightInput

TransformOutput

LightweightOutput