Search documentation
karat

+

K

User Documentation ↗

transforms.api.TransformOutput

class transforms.api.TransformOutput(rid, branch, txrid, dfreader, dfwriter, fsbuilder, mode='replace')

The output object passed into Transform objects at runtime.

abort()

Aborts all work on this output. Any work done on writers from this output before or after calling this method will be ignored.

property batch_incremental_configuration

The configuration for an incremental input that will be read in batches.

  • Type: BatchIncrementalConfiguration

property branch

The branch of the dataset.

property column_descriptions

The column descriptions of the dataset.

  • Type: Dict[str, str]

property column_typeclasses

The column typeclasses of the dataset.

  • Type: Dict[str, str]

dataframe()

Return a pyspark.sql.DataFrame containing the full view of the dataset.

property end_transaction_rid

The ending transaction of the input dataset.

filesystem()

Construct a FileSystem object for writing to FoundryFS.

  • Returns: A FileSystem object for writing to Foundry.
  • Return type: FileSystem

classmethod from_transform_output(instance, delegate)

Sets fields in a TransformOutput instance to the values from the delegate TransformOutput.

pandas()

pandas.DataFrame: A pandas dataframe containing the full view of the dataset.

property path

The Compass path of the dataset.

property rid

The resource identifier of the dataset.

set_mode(mode)

Change the write mode of the dataset.

  • Parameters: mode (str) – The write mode, one of replace, modify, or append. In modify mode, anything written is appended to the dataset. In replace mode, anything written replaces the dataset. In append mode, anything written is appended to the dataset and will not override existing files.

The write mode cannot be changed after data has been written.

History

  • Added in version 1.61.0.

property start_transaction_rid

The starting transaction of the input dataset.

write_dataframe(df, partition_cols=None, bucket_cols=None, bucket_count=None, sort_by=None, output_format=None, options=None, column_descriptions=None, column_typeclasses=None)

Write the given DataFrame to the dataset.

  • Parameters:
    • df (pyspark.sql.DataFrame) – The PySpark DataFrame to write.
    • partition_cols (List [str ] , optional) – Column partitioning to use when writing data.
    • bucket_cols (List [str ] , optional) – The columns by which to bucket the data. Must be specified if bucket_count is given.
    • bucket_count (int , optional) – The number of buckets. Must be specified if bucket_cols is given.
    • sort_by (List [str ] , optional) – The columns by which to sort the bucketed data.
    • output_format (str , optional) – The output file format, defaults to parquet.
    • options (dict , optional) – Extra options to pass through to org.apache.spark.sql.DataFrameWriter#option(String, String).
    • column_descriptions (Dict [str , str ] , optional) – Map of column names to their string descriptions. This map is intersected with the columns of the DataFrame, and must include descriptions no longer than 800 characters.
    • column_typeclasses (Dict [str , List *[*Dict [str , str ] ] ] , optional) – Map of column names to their column typeclasses. Each typeclass in the List is a Dict[str, str], where only two keys are valid; name and kind. Each maps to the corresponding string the user wants, up to a maximum of 100 characters. An example column_typeclasses value would be {"my_column": [{"name": "my_typeclass_name", "kind": "my_typeclass_kind"}]}.

write_pandas(pandas_df)

Write the given pandas.DataFrame to the dataset.