Python transforms support multiple query engines to handle different data processing needs. Choosing the right engine ensures optimal performance, cost efficiency, and developer productivity for your use case.
The available query engines are pandas ↗, Polars ↗, and Spark ↗. These are all DataFrame APIs that enable data manipulation with simple and intuitive APIs.
Pandas and Polars are available through our standard, single-node compute offering known as lightweight
. Spark can be used for distributed compute approaches. For more information refer to the PySpark documentation.
The following table shows Foundry feature availability across compute paradigms.
Feature | Single node (Lightweight) | Distributed (Spark) |
---|---|---|
Incremental transforms | ✓ | ✓ |
External transforms | ✓ | ✓ |
Python modeling API: snapshot | ✓ | ✓ |
Python modeling API: incremental | ✓ | ✗ |
Media set API: snapshot | ✓ | ✓ |
Media set API: incremental | ✓ | ✓ |
Abort transactions | ✓ | ✓ |
Dataset unmarking 1 | ✓ (Except sever_permissions ) | ✓ |
Source unmarking | ✗ | ✓ |
Data expectations | Limited | ✓ |
Tables API | Compute pushdown only | In-Foundry compute only |
Read output enforcing schema 2 | ✗ | ✓ |
Allowed run duration parameter | ✗ | ✓ |
Run as user parameter (deprecated) | ✗ | ✓ |
Resource metrics | ✓ | ✗ |
1 Single node transforms only support stop_propagating
and stop_requiring
. sever_permissions
are not supported.
2 PySpark transforms allow you read data written to an incremental output with a specific schema. This is necessary during the first dataset transaction as no schema will be committed. It is not supported in single node transforms.
Best for: Quick iteration, exploratory analysis, and small datasets
Pandas ↗ is a common data manipulation library in the Python ecosystem. It excels at rapid prototyping and provides an extensive ecosystem of compatible libraries. Use pandas when getting started with a new transform, or when your team needs to move quickly with familiar tools.
Key characteristics:
Best for: Production data pipelines and medium-scale data processing
Polars ↗ should be your default choice for production transforms. Built on Apache Arrow with a Rust core, it delivers excellent performance through columnar storage and lazy evaluation. Polars combines the ease of DataFrame operations with the performance needed for production workloads.
Key characteristics:
Best for: Large-scale data processing and organizational data foundations
Spark ↗ is designed for distributed computing at scale. While it has higher overhead for small operations, it is the only option when your data exceeds single-node capacity or when building critical organizational datasets that require maximum scalability.
Key characteristics:
The size recommendations here do not apply to all queries. Single-node compute can be more performant and consume fewer resources well into the terabyte scale. Refer to the Polars lazy API documentation on larger-than-memory data transformations for more information on how to use Polars streaming. Queries that do not require all data to be loaded into memory at once will scale to arbitrary size on a single node.
We recommend starting with Polars as your default choice for production transforms. You can switch to pandas when you need quick iteration, or specific ecosystem libraries. Move to Spark only when data scale demands it, typically at more than 50GB.
Characteristic | Pandas | Polars | PySpark |
---|---|---|---|
Optimal (uncompressed) data size | < 1GB | 1-50GB | > 50GB |
Optimal number of rows* | < 1 million | 1-200 million | > 200 million |
Startup overhead | Minimal | Minimal | Significant |
Memory efficiency | Poor | Excellent | Good |
Processing speed (small data) | Fast | Fast | Slow |
Processing speed (medium data) | Poor | Excellent | Good |
Processing speed (large data) | Not suitable | Variable | Excellent |
Parallel execution | No | Single-node | Distributed |
Memory spilling | No | Limited | Automatic |
* The number of rows tolerable to each query engine will vary greatly depending on the schema. These numbers are given as a rough guide for common cases.