Code-defined input filtering

Apart from the Sampled and Full dataset input strategy configurations in the preview transforms documentation, VS Code preview also supports a code-defined filters option. This option allows you to specify your own custom filtering strategy implemented directly in your code. When applicable, the custom filtering strategies will leverage pushdown predicates ↗ to ensure that only the most relevant data samples are used in preview.

Structured inputs in Spark and lightweight transforms are supported, as well as unstructured inputs, such as raw files, for Spark. You can select any eligible function in your repository from the multi-select dropdown menu in any order you prefer.

Filters will be applied in the order they appear in the selection box. The Palantir extension for Visual Studio Code will automatically discover all eligible filters anywhere in the project codebase, and they will be shown in the selection dropdown menu.

Configure code-defined input filtering.

To create an eligible preview filter from a Python function, the rules listed below must be followed.

The function must be:

  • directly defined in the global scope of its module
  • fully type-annotated with one of the following annotations:
    • For Spark transforms: (pyspark.sql.DataFrame) -> pyspark.sql.DataFrame
    • For lightweight transforms: (polars.LazyFrame) -> polars.LazyFrame
    • For raw files: (collections.abc.Iterator[transforms.api.FileStatus]) -> collections.abc.Iterator[transforms.api.FileStatus]

The function should NOT:

  • be nested
  • be guarded by if, with, for or other statements
  • be part of a class
  • be imported from somewhere else
  • be a variable assigned to a function
  • be an async or private function (that is, its name cannot start with _)
  • have any decorators applied on them

The following example lists some eligible functions that can be used for preview filters:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 from pyspark.sql import DataFrame from pyspark.sql import functions as F from collections.abc import Iterator from transforms.api import FileStatus import itertools as it import polars as pl def limit_files(files: Iterator[FileStatus]) -> Iterator[FileStatus]: """ Limit the number of files in a file system listing.""" return it.islice(files, 10) def ocean_animals_only(df: DataFrame) -> DataFrame: """ Get only animals living in the Ocean """ return df.filter(F.col("Habitat") == "Ocean") def grassland_animals_only_lightweight(df: pl.LazyFrame) -> pl.LazyFrame: """ Get only animals living on Grassland """ return df.filter(pl.col("Habitat") == "Grassland")

You can receive immediate feedback on the eligibility of your functions as code-defined preview filter functions through the CodeLens hint above the functions.

The CodeLens hint displays above eligible filter functions to indicate they are valid preview filters.

Add parameters to preview filters

Code-defined preview filters can also accept parameters, allowing you to create more generic and reusable filter functions that can be easily adapted to different scenarios. To add parameters to your preview filter functions, define them in the function signature along with their type annotations. Then, mark them as keyword-only arguments by separating them from the first parameter with the * symbol. The Palantir extension for Visual Studio Code will automatically detect these parameters and prompt you to provide values for them when you select the filter for preview.

The following example shows a preview filter function with parameters. Note the * symbol before the habitat parameter:

Copied!
1 2 3 4 5 6 from pyspark.sql import DataFrame from pyspark.sql import functions as F def filter_animals_by_habitat(df: DataFrame, *, habitat: str) -> DataFrame: """ Get only animals living in the Ocean """ return df.filter(F.col("Habitat") == habitat)

Currently supported parameter types are str, int, float, and bool. When you select a filter with parameters for preview, a prompt will appear for each parameter, allowing you to enter the desired value. The entered values will be used when applying the filter during the preview run. You can also provide default values for parameters in the function signature. If you provide a default value, this will be used if you do not then enter a value during the prompt.

Copied!
1 2 3 4 5 6 7 8 9 from pyspark.sql import DataFrame from pyspark.sql import functions as F def example_filter_with_params( df: DataFrame, *, int_param: int, float_param: float = 3.3, str_param: str = "Hello world", bool_param: bool = True ) -> DataFrame: """This is an example function with many parameters of different types""" print(f"int_param: {int_param}, float_param: {float_param}, str_param: {str_param}, bool_param: {bool_param}") return df.limit(int_param)

Applying a code-defined filter with parameters.