| Class | Description |
|---|---|
Check | Wraps up an Expectation such that it can be registered with Data Health. |
FileStatus | A collections.namedtuple capturing details about a FoundryFS file. |
FileSystem(foundry_fs[, read_only]) | A filesystem object for reading and writing dataset files. |
IncrementalTransformContext(ctx, is_incremental) | TransformContext with added functionality for incremental computation. |
IncrementalTransformInput(tinput[, prev_txrid]) | TransformInput with added functionality for incremental computation. |
IncrementalTransformOutput(toutput[, …]) | TransformOutput with added functionality for incremental computation. |
Input(alias) | Specification of a transform input. |
Output(alias[, sever_permissions]) | Specification of a transform output. |
Pipeline() | An object for grouping a collection of Transform objects. |
Transform(compute_func[, inputs, outputs, ...]) | A callable object that describes single step of computation. |
TransformContext(foundry_connector[, parameters]) | Context object that can optionally be injected into the compute function of a transform. |
TransformInput(rid, branch, txrange, …) | The input object passed into Transform objects at runtime. |
LightweightInput(alias) | The input object passed into Lightweight Transform objects at runtime. |
IncrementalLightweightInput(alias) | The input object passed into an incremental Lightweight Transform objects at runtime. |
TransformOutput(rid, branch, txrid, …) | The output object passed into Transform objects at runtime. |
LightweightOutput(alias) | The input object passed into Lightweight Transform objects at runtime. |
Checktransforms.api.CheckWraps up an Expectation such that it can be registered with Data Health.
expectation
name
is_incremental
on_error
description
FileStatusclass transforms.api.FileStatus
A collections.namedtuple capturing details about a FoundryFS file.
Create new instance of FileStatus(path, size, modified)
count(value) → integer -- return number of occurrences of valueindex(value[, start[, stop]]) → integer -- return first index of value
modified
path
size
FileSystemclass transforms.api.FileSystem(foundry_fs, read_only=False)
A filesystem object for reading and writing dataset files.
files(glob=None, regex='.*', show_hidden=False, packing_heuristic=None)
DataFrame ↗ containing the paths accessible within this dataset.DataFrame ↗ is partitioned by file size where each partition contains file paths whose combined size is at most spark.files.maxPartitionBytes bytes, or by a single file if it is larger than spark.files.maxPartitionBytes. The size of a file is calculated as its on-disk file size plus the spark.files.openCostInBytes.pdf), recursively use **/*.pdf.. or _.ffd (First Fit Decreasing) or wfd (Worst Fit Decreasing). While wfd tends to produce a less even distribution, it is much faster, so wfd is recommended for datasets containing a very large number of files. If a heuristic is not specified, one will be selected automatically.ls(glob=None, regex='.*', show_hidden=False)
FileStatus - The logical path, file size (bytes), modified timestamp (ms since January 1, 1970 UTC).open(_path, mode='r', kwargs)
kwargs are keyword arguments.io.open() ↗. or _.hadoop_path
IncrementalTransformContexttransforms.api.IncrementalTransformContext(ctx, is_incremental)TransformContext with added functionality for incremental computation.
auth_header
fallback_branches
is_incremental
parameters
spark_session
abort_job
IncrementalTransformInputtransforms.api.IncrementalTransformInput(tinput, prev_txrid=None)TransformInput with added functionality for incremental computation.
dataframe(mode='added')
pyspark.sql.DataFrame for the given read mode.filesystem(mode='added')
pandas()
branch
path
rid
IncrementalTransformOutputclass transforms.api.IncrementalTransformOutput(toutput, prev_txrid=None, mode='replace')
TransformOutput with added functionality for incremental computation.
abort()
dataframe(mode='current', schema=None)
ValueError ↗ - If no schema is passed when using mode ‘previous’filesystem(mode='current')
NotImplementedError ↗ – Not currently supported.pandas(mode='current')
set_mode(mode)
The write mode cannot be changed after data has been written.
write_dataframe(df, partition_cols=None, bucket_cols=None, bucket_count=None, sort_by=None, output_format=None, options=None)
org.apache.spark.sql.DataFrameWriter#option(String, String).write_pandas(pandas_df)
branch
path
rid
Inputclass transforms.api.Input(alias, branch, stop_propagating, stop_requiring, checks)
Specification of a transform input.
Check objects.continue or fail. If not specified, defaults to fail.Outputclass transforms.api.Output(alias=None, sever_permissions=False, checks=None)
Specification of a transform output.
Check objects.Pipelineclass transforms.api.Pipeline
An object for grouping a collection of Transform objects.
add_transforms(*transforms)
ValueError ↗ – If multiple Transform objects write to the same Output alias.discover_transforms(*modules)
Transform (as constructed by the transforms decorators) will be registered to the pipeline.Copied!1 2 3>>> import myproject >>> p = Pipeline() >>> p.discover_transforms(myproject)
Each module found is imported. Try to avoid executing code at the module-level.
transforms
Transformclass transforms.api.Transform(compute_func, inputs=None, outputs=None, profile=None)
A callable object that describes single step of computation.
A Transform consists of a number of Input specs, a number of Output specs, and a compute function.
It is idiomatic to construct a Transform object using the provided decorators: transform(), transform_df(), and transform_pandas().
Note: The original compute function is exposed via the Transform’s __call__ method.
Parameters
compute(ctx=None, _kwargs_)**
Input specs. kwarg is a shorthand for keyword arguments.version
select A, B from foo; should be the same version as the SQL query select A, B from (select * from foo);.ValueError ↗ – If fails to compute object hash of compute functionTransformContextclass transforms.api.TransformContext(foundry_connector, parameters=None)
Context object that can optionally be injected into the compute function of a transform.
auth_header
fallback_branches
parameters
spark_session
abort_job
TransformInputclass transforms.api.TransformInput(rid, branch, txrange, dfreader, fsbuilder)
The input object passed into Transform objects at runtime.
dataframe()
filesystem()
pandas()
branch
path
rid
column_descriptions
column_typeclasses
LightweightInputclass transforms.api.LightweightInput(alias)
Its aim is to mimic a subset of the API of TransformInput by delegating to the Foundry Data Sidecar while extending it with support for various data formats.
dataframe()
pandas()filesystem()
pandas()
arrow()
polars(lazy: Optional[bool]=False)
lazy parameter.path()
IncrementalLightweightInputclass transforms.api.IncrementalLightweightInput(alias)
Its aim is to mimic a subset of the API of IncrementalTransformInput by delegating to the Foundry Data Sidecar while extending it with support for various data formats. It's the incremental counterpart of LightweightInput
dataframe(mode)
pandas()filesystem()
pandas()(mode)
arrow()(mode)
polars(lazy=False, mode)
lazy parameter.path(mode)
TransformOutputclass transforms.api.TransformOutput(rid, branch, txrid, dfreader, dfwriter, fsbuilder)
The output object passed into Transform objects at runtime.
abort()
dataframe()
filesystem()
pandas()
set_mode(mode)
write_dataframe(df, partition_cols=None, bucket_cols=None, bucket_count=None, sort_by=None, output_format=None, options=None, column_descriptions=None, column_typeclasses=None)
bucket_count is given.bucket_cols is given.org.apache.spark.sql.DataFrameWriter#option(String, String).write_pandas(pandas_df)
branch
path
rid
LightweightOutputclass transforms.api.LightweightInput(alias)
Its aim is to mimic a subset of the API of TransformOutput by delegating to the Foundry Data Sidecar.
filesystem()
dataframe()
pandas()pandas()
arrow()
polars(lazy: Optional[bool]=False)
lazy parameter.path()
write_pandas(pandas_df)
write_table.write_table(df)
path to the output dataset. In case a path is given (either as str or pathlib.Path), it's value must match the value returned by path_for_write_table.path) – The dataframe to write.path_for_write_table
write_table.set_mode(mode)