Documentation

Using scikit-learn and SparkML pipelines

Sunsetted functionality

The below documentation describes the foundry_ml library which is no longer recommended for use in the platform. Instead, use the palantir_models library. You can also learn how to migrate a model from the foundry_ml to the palantir_models framework through an example.

The foundry_ml library will be removed on October 31, 2025, corresponding with the planned deprecation of Python 3.9.

Many machine learning libraries, such as scikit-learn and SparkML, expose a notion of a "Pipeline" for encapsulating a sequence of transformations.

While foundry_ml's native wrappers for both of these packages supports saving pipeline objects, we recommend exploding the pipeline out into individual transformations, wrapping these in Stage objects, and re-combining as a Model. This enables easier re-use of stages across models, easier swapability of individual stages with implementations from other packages, and additional features and transparency viewing the model's preview.

See below for examples of how to perform this explode-wrap-recombine procedure for scikit-learn and for SparkML.

Both examples use Code Workbook syntax and the housing data from the Modeling Objectives tutorial. They show the construction of a trained ML pipeline, conversion into a Model, and parameters for capturing the relevant input and output columns for passing data between stages.

Example with scikit-learn Pipeline

This example shows how to save a scikit-learn Pipeline ↗. Note the explicit use of the output_column_name parameter.

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def sklearn_pipeline_model(housing):

    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.preprocessing import StandardScaler
    from sklearn.compose import make_column_transformer
    from sklearn.pipeline import Pipeline

    from foundry_ml import Model, Stage

    X_train = housing[['median_income', "housing_median_age", "total_rooms"]]
    y_train = housing['median_house_value']


    # Create vectorizer
    column_transformer = make_column_transformer(
       ('passthrough', ['median_income', "housing_median_age", "total_rooms"])
    )
    # Vectorizer -> StandardScaler -> PolynomialExpansion -> LinearRegression
    pipeline = Pipeline([
            ("preprocessing", column_transformer),
            ("ss", StandardScaler()),
            ("pf", PolynomialFeatures(degree=2)),
            ("lr", LinearRegression())
       ])

    pipeline.fit(X_train, y_train)

    # Expand out pipeline
    # can be also created via list comprehension:  Model(*[Stage(s[1]) for s in pipeline.steps])
    model = Model( Stage(pipeline["preprocessing"]),
                    Stage(pipeline["ss"], output_column_name="features"),
                    Stage(pipeline["pf"], output_column_name="features"),
                    Stage(pipeline["lr"]))

    return model

Example with SparkML pipeline

This example shows how to save a SparkML pipeline ↗.

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def spark_pipeline(housing):
    from pyspark.ml.feature import PolynomialExpansion
    from pyspark.ml.feature import StandardScaler
    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.regression import LinearRegression
    from pyspark.ml import Pipeline
    from foundry_ml import Model, Stage

    assembler = VectorAssembler(
        inputCols=['median_income', "housing_median_age", "total_rooms"],
        outputCol="features"
        )
    scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)

    polyExpansion = PolynomialExpansion(degree=2, inputCol="scaledFeatures", outputCol="polyFeatures")

    lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, featuresCol='polyFeatures', labelCol='median_house_value')

    pipeline = Pipeline(stages=[assembler, scaler, polyExpansion, lr])

    # Fit the pipeline to training documents.
    pipeline = pipeline.fit(housing)

    # Explode pipeline
    model = Model(*[Stage(s) for s in pipeline.stages])

    return model