Using scikit-learn and SparkML pipelines

Sunsetted functionality

The below documentation describes the foundry_ml library which is no longer recommended for use in the platform. Instead, use the palantir_models library. You can also learn how to migrate a model from the foundry_ml to the palantir_models framework through an example.

The foundry_ml library will be removed on October 31, 2025, corresponding with the planned deprecation of Python 3.9.

Many machine learning libraries, such as scikit-learn and SparkML, expose a notion of a "Pipeline" for encapsulating a sequence of transformations.

While foundry_ml's native wrappers for both of these packages supports saving pipeline objects, we recommend exploding the pipeline out into individual transformations, wrapping these in Stage objects, and re-combining as a Model. This enables easier re-use of stages across models, easier swapability of individual stages with implementations from other packages, and additional features and transparency viewing the model's preview.

See below for examples of how to perform this explode-wrap-recombine procedure for scikit-learn and for SparkML.

Both examples use Code Workbook syntax and the housing data from the Modeling Objectives tutorial. They show the construction of a trained ML pipeline, conversion into a Model, and parameters for capturing the relevant input and output columns for passing data between stages.

Example with scikit-learn Pipeline

This example shows how to save a scikit-learn Pipeline ↗. Note the explicit use of the output_column_name parameter.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 def sklearn_pipeline_model(housing): from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.preprocessing import StandardScaler from sklearn.compose import make_column_transformer from sklearn.pipeline import Pipeline from foundry_ml import Model, Stage X_train = housing[['median_income', "housing_median_age", "total_rooms"]] y_train = housing['median_house_value'] # Create vectorizer column_transformer = make_column_transformer( ('passthrough', ['median_income', "housing_median_age", "total_rooms"]) ) # Vectorizer -> StandardScaler -> PolynomialExpansion -> LinearRegression pipeline = Pipeline([ ("preprocessing", column_transformer), ("ss", StandardScaler()), ("pf", PolynomialFeatures(degree=2)), ("lr", LinearRegression()) ]) pipeline.fit(X_train, y_train) # Expand out pipeline # can be also created via list comprehension: Model(*[Stage(s[1]) for s in pipeline.steps]) model = Model( Stage(pipeline["preprocessing"]), Stage(pipeline["ss"], output_column_name="features"), Stage(pipeline["pf"], output_column_name="features"), Stage(pipeline["lr"])) return model

Example with SparkML pipeline

This example shows how to save a SparkML pipeline ↗.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 def spark_pipeline(housing): from pyspark.ml.feature import PolynomialExpansion from pyspark.ml.feature import StandardScaler from pyspark.ml.feature import VectorAssembler from pyspark.ml.regression import LinearRegression from pyspark.ml import Pipeline from foundry_ml import Model, Stage assembler = VectorAssembler( inputCols=['median_income', "housing_median_age", "total_rooms"], outputCol="features" ) scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False) polyExpansion = PolynomialExpansion(degree=2, inputCol="scaledFeatures", outputCol="polyFeatures") lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, featuresCol='polyFeatures', labelCol='median_house_value') pipeline = Pipeline(stages=[assembler, scaler, polyExpansion, lr]) # Fit the pipeline to training documents. pipeline = pipeline.fit(housing) # Explode pipeline model = Model(*[Stage(s) for s in pipeline.stages]) return model