Search documentation
karat

+

K

User Documentation ↗

transforms.api.IncrementalTableTransformInput

class transforms.api.IncrementalTableTransformInput(table_tinput, from_version)

TableTransformInput with added functionality for incremental computation.

property batch_incremental_configuration

The configuration for an incremental input that will be read in batches.

  • Type: BatchIncrementalConfiguration

property branch

The branch of the dataset.

property catalog

Returns the name of the table’s Spark catalog, intended for use in Spark procedures, if supported by the underlying table type.

  • Returns: The name of the table’s catalog.
  • Return type: str ↗

Throws: : ValueError: If the underlying table type does not expose a Spark catalog.

changelog(identifier_columns=None)

Creates a changelog view for the given table from the last processed snapshot ID.

Note: Only supported for Iceberg tables.

If the identifier columns are provided, this creates a identifier-based changelog. This changelog type gives you the last changes performed on rows uniquely identified by the given identifier columns. This is more performant and allows greater flexibility while performing row edits.

Without identifier columns, it creates a net-changes changelog. This changelog type gives you the coalesced changes performed on the rows by cancelling out DELETES and INSERTS over the snapshot range. This leads to a high amount of data shuffling and is slower than an identifier-based changelog.

If this changelog is intended to be used in updating an output table, the identifier columns used when creating this changelog should match the identifier columns used to update the output table.

See the Iceberg create_changelog_view ↗ documentation for more information.

  • Parameters: identifier_columns (List [str ] , optional) – The list of columns that uniquely identify each row, if present.
  • Returns: Temporary changelog view with original table schema along with _change_type, _change_ordinal and _commit_snapshot_id columns. For an identifier-based changelog, _change_type can either be INSERT, DELETE, UPDATE_AFTER or UPDATE_BEFORE. For a net-changes changelog, _change_type can either be INSERT or DELETE.
  • Return type: DataFrame

property column_descriptions

The column descriptions of the dataset.

  • Type: Dict[str, str]

property column_typeclasses

The column typeclasses of the dataset.

  • Type: Dict[str, str]

dataframe(mode='added', options=None)

Return a pyspark.sql.DataFrame for the given read mode.

changelog read mode for Iceberg tables returns the changelog view for the given table from the last processed snapshot ID. Unlike the changelog() method, this always creates a net-changes changelog, which is not very performant but supports tables without identifier columns (a list of columns that uniquely identify each row). Note that this read mode is deprecated. Use changelog() instead.

  • Parameters:
    • mode (str , optional) – The read mode, one of current, previous, added, modified, removed, or changelog. Defaults to added.
    • options (dict , option) – Additional Spark read options to pass when reading the table.
  • Returns: The DataFrame for the table.
  • Return type: DataFrame

property end_transaction_rid

The ending transaction of the input dataset.

filesystem(mode='added')

Construct a FileSystem object for reading from FoundryFS for the given read mode.

Only current, previous and added modes are supported.

  • Parameters: mode (str , optional) – The read mode, one of current, previous, added, modified, or removed. Defaults to added.
  • Returns: A filesystem object for the given view.
  • Return type: FileSystem

property identifier

Returns the full-qualified, catalog-prefixed, Spark V2 identifier of the table, if supported by the underlying table type.

  • Returns: The full-qualified identifier of the table.
  • Return type: str ↗

Throws: : ValueError: If the underlying table type does not expose a Spark V2 identifier.

pandas()

pandas.DataFrame: A pandas dataframe containing the full view of the dataset.

property path

The Compass path of the dataset.

property rid

The resource identifier of the dataset.

property start_transaction_rid

The starting transaction of the input dataset.