Search documentation
karat

+

K

User Documentation ↗

transforms.api.FileSystem

class transforms.api.FileSystem(foundry_fs, read_only=False)

A filesystem object for reading and writing raw dataset files in Spark transforms.

For lightweight, single-node transforms, see transforms.api.FoundryDataSidecarFileSystem.

files(glob=None, regex='.*', show_hidden=False, packing_heuristic=None)

Create a DataFrame containing the paths accessible within this dataset.

The DataFrame is partitioned by file size where each partition contains file paths whose combined size is at most spark.files.maxPartitionBytes bytes, or a single file if that file is larger than spark.files.maxPartitionBytes. The size of a file is calculated as its on-disk file size plus the spark.files.openCostInBytes.

  • Parameters:
    • glob (str , optional) – A unix file-matching pattern. Also supports globstar.
    • regex (str , optional) – A regex pattern against which to match filenames.
    • show_hidden (bool , optional) – Include hidden files, those prefixed with ‘.’ or ‘_’.
    • packing_heuristic (str , optional) – Specify a heuristic to use for bin-packing files into Spark partitions. Possible choices are ffd (first fit decreasing) or wfd (worst fit decreasing). While wfd tends to produce a less even distribution, it is much faster, so wfd is recommended for datasets containing a very large number of files. If a heuristic is not specified, one will be selected automatically.
  • Returns: A DataFrame of (path, size, modified)
  • Return type: pyspark.sql.DataFrame ↗

property hadoop_path

Fetches the Hadoop path of the dataset, which can be used for code that requires direct Hadoop IO.

  • Returns: The Hadoop path of the dataset backing this FileSystem or None
  • Return type: string

ls(glob=None, regex='.*', show_hidden=False)

Recurses through all directories and lists all files matching the given patterns, starting from the root directory of the dataset.

  • Parameters:
    • glob (str , optional) – A unix file-matching pattern. Also supports globstar.
    • regex (str , optional) – A regex pattern against which to match filenames.
    • show_hidden (bool , optional) – Include hidden files, those prefixed with ‘.’ or ‘_’.
  • Yields: FileStatus – The logical path, file size (bytes), and modified timestamp (ms since January 1, 1970 UTC)

open(path, mode='r', **kwargs)

Open a FoundryFS file in the given mode.

  • Parameters:
    • path (str) – The logical path of the file in the dataset.
    • mode (str) – File opening mode, defaults to read.
    • **kwargs – Remaining keyword args passed to io.open().
  • Returns: a Python file-like object attached to the stream.
  • Return type: File