Compute engine selection

Python transforms support multiple query engines to handle different data processing needs. Choosing the right engine ensures optimal performance, cost efficiency, and developer productivity for your use case.

The available query engines are pandas ↗, Polars ↗, and Spark ↗. These are all DataFrame APIs that enable data manipulation with simple and intuitive APIs.

Pandas and Polars are available through our standard, single-node compute offering known as lightweight. Spark can be used for distributed compute approaches. For more information refer to the PySpark documentation.

Feature support comparison

The following table shows Foundry feature availability across compute paradigms.

Feature	Single node (Lightweight)	Distributed (Spark)
Incremental transforms	✓	✓
External transforms	✓	✓
Python modeling API: snapshot	✓	✓
Python modeling API: incremental	✓	✗
Media set API: snapshot	✓	✓
Media set API: incremental	✓	✓
Abort transactions	✓	✓
Dataset unmarking ¹	✓ (Except `sever_permissions`)	✓
Source unmarking	✗	✓
Data expectations	Limited	✓
Tables API	Compute pushdown only	In-Foundry compute only
Read output enforcing schema ²	✗	✓
`Allowed run duration` parameter	✗	✓
`Run as user` parameter `(deprecated)`	✗	✓
Resource metrics	✓	✓

¹ Single node transforms only support stop_propagating and stop_requiring. sever_permissions are not supported.
² PySpark transforms allow you read data written to an incremental output with a specific schema. This is necessary during the first dataset transaction as no schema will be committed. It is not supported in single node transforms.

Available query engines

Pandas

Best for: Quick iteration, exploratory analysis, and small datasets

Pandas ↗ is a common data manipulation library in the Python ecosystem. It excels at rapid prototyping and provides an extensive ecosystem of compatible libraries. Use pandas when getting started with a new transform, or when your team needs to move quickly with familiar tools.

Key characteristics:

Immediate feedback during development
Extensive documentation and community support
Rich functionality for time series and statistical operations
Single-threaded execution model

Polars (Recommended)

Best for: Production data pipelines and medium-scale data processing

Polars ↗ should be your default choice for production transforms. Built on Apache Arrow with a Rust core, it delivers excellent performance through columnar storage and lazy evaluation. Polars combines the ease of DataFrame operations with the performance needed for production workloads.

Key characteristics:

Automatic query optimization through lazy evaluation
Multi-threaded execution on single nodes
Memory-efficient columnar storage
Predictable performance characteristics

DuckDB

Best for: Medium-scale data with tight latency bounds

DuckDB ↗ is a highly performant single-node SQL query engine optimized for analytical workloads. DuckDB is particularly well-suited for medium-to-large scale data processing tasks that require low latency and efficient resource usage. However, DuckDB lacks a Python DataFrame API, instead requiring users to write raw SQL queries for data manipulation.

Key characteristics:

Automatic query optimization through lazy evaluation
Automatic memory management with spill-to-disk
Processes raw SQL strings rather than a DataFrame API

Spark

Best for: Large-scale data processing and organizational data foundations

Spark ↗ is designed for distributed computing at scale. While it has higher overhead for small operations, it is the only option when your data exceeds single-node capacity or when building critical organizational datasets that require maximum scalability.

Key characteristics:

Distributed processing across multiple nodes
Automatic memory management with spill-to-disk
Battle-tested at petabyte scale
Catalyst optimizer for complex query planning

Choosing the right engine

The size recommendations below are intended as a general rule of thumb and do not apply to all queries. For the right shapes of transforms, Lightweight engines can process even terabyte-scale inputs on a single node. Refer to the Polars lazy API documentation on larger-than-memory data transformations for more information on how to use Polars streaming. Queries that do not require all data to be loaded into memory at once will scale to arbitrary size on a single node.

We recommend starting with Polars as your default choice for production transforms. You can switch to pandas when you need quick iteration, or specific ecosystem libraries. Move to Spark only when data scale demands it, typically at more than 50GB and when you cannot use optimizations such as filter pushdown.

Characteristic	Pandas	Polars	DuckDB	PySpark
Optimal (uncompressed) data size	< 1GB	1-50GB	1-50GB	> 50GB
Optimal number of rows^*	< 1 million	1-200 million	1-200 million	> 200 million
Startup overhead	Minimal	Minimal	Minimal	Significant
Memory efficiency	Poor	Excellent	Excellent	Good
Processing speed (small data)	Fast	Fast	Excellent	Slow
Processing speed (medium data)	Slow	Excellent	Excellent	Fast
Processing speed (large data)	Not suitable	Variable	Variable	Excellent
Parallel execution	No	Single-node	Single-node	Distributed
Memory spilling	No	Limited	Automatic	Automatic

^* The number of rows tolerable to each query engine will vary greatly depending on the schema. These numbers are given as a rough guide for common cases.