This guide provides an overview of debugging techniques available in Python transforms. More information on errors and exceptions can be found in the Python documentation ↗.
A debugger is a useful tool for identifying and resolving issues in your Python transforms. You can set breakpoints to pause transform execution and examine variables, view DataFrames, and understand functions and libraries.
Learn more about the debugger for your chosen IDE:
A traceback in Python is an error message that contains the sequence of function calls that led to an error, also known as a stack trace in other programming languages. In Python, any unhandled exceptions will result in a traceback, and the most recent call will be at the bottom.
Most Python transforms runtime failures surface as tracebacks, so it is important to understand how to read them.
Consider the following code example:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
class Stats(object): nums = [] def add(self, n): self.nums.append(n) def sum(self): return sum(self.nums) def avg(self): return self.sum() / len(self.nums) def main(): stats = Statistics() stats.add(1) stats.add(2) stats.add(3) print(stats.avg())
Running this code results in the following traceback:
Copied!1 2 3 4 5 6
Traceback (most recent call last): File "test.py", line 26, in <module> main() File "test.py", line 16, in main stats = Statistics() NameError: global name 'Statistics' is not defined
Unlike stack traces in other programming languages, Python tracebacks show the most recent call last. From the bottom-up, the traceback shows the following:
NameError
↗. There are many built-in Python exception classes ↗, but it is also possible for code to define its own exception classes.global name 'Statistics' is not defined
. This message contains the most useful information for debugging purposes.File "test.py", line 26, in <module>
followed by the line of code in question (line 16).Using this traceback, we can see that the exception occurs at line 16 of test.py
in the main
method. Specifically, the line of code causing the error is stats = Statistics()
, and the exception thrown is NameError
. From this, we can deduce that the name Statistics
does not exist. Looking back at the example code, it appears that the name Stats
should have been used instead of Statistics
.
For logging, you can use the following options:
print
statements. This method is supported for standard (lightweight) transforms, but not for Spark transforms.INFO
-level logs and higher are saved.Logs are available in VS Code under Output and in the Builds application under Actions > View logs.
The following code example demonstrates how you can output logs to help with debugging:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
from transforms.api import transform, Input, Output import polars as pl import logging log = logging.getLogger(__name__) @transform.using( output=Output("/path/output"), input=Input("/path/input"), ) def my_compute_function(output, input): input_df = input.polars() log.info("Number of rows: %d", input_df.height) print("Number of columns: " + str(input_df.width)) output.write_table(input_df)
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
from transforms.api import transform, Input, Output import pandas as pd import logging log = logging.getLogger(__name__) @transform.using( output=Output("/path/output"), input=Input("/path/input"), ) def my_compute_function(output, input): input_df = input.pandas() log.info("Number of rows: %d", input_df.shape[0]) print("Number of columns:" + str(input_df.shape[1])) output.write_table(input_df)
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
from transforms.api import transform_df, Input, Output from myproject.datasets import utils import logging log = logging.getLogger(__name__) @transform_df( Output("/path/output"), input=Input("/path/input"), ) def my_compute_function(input): log.info("Number of rows: %d", input.count()) log.info("Number of columns: %d", len(input.columns)) return input
You can find additional information about working with Spark logs in the Spark documentation.