注意：以下翻译的准确性尚未经过验证。这是使用 AIP ↗ 从原始英文文本进行的机器翻译。

使用 scikit-learn 和 SparkML 管道

功能终止

以下文档描述了 foundry_ml 库，该库不再推荐在平台中使用。相反，请使用 palantir_models 库。您还可以通过一个示例学习如何将模型从 foundry_ml 迁移到 palantir_models 框架。

foundry_ml 库将于2025年10月31日移除，这与计划中的Python 3.9弃用相对应。

许多机器学习库，如 scikit-learn 和 SparkML，提供了一个"Pipeline"的概念，用于封装一系列变换。

虽然 foundry_ml 为这两个包提供的本机包装支持保存管道对象，但我们建议将管道分解为单独的变换，将这些变换包装在 Stage 对象中，并重新组合为一个 Model。这使得跨模型更容易重用阶段，更容易将个别阶段与其他包的实现交换，并增加了查看模型预览的功能和透明度。

请参阅下面的示例，了解如何对 scikit-learn 和 SparkML 执行这种分解-包装-重新组合的过程。

两个示例均使用代码工作簿语法和建模目标教程中的住房数据。它们展示了训练的ML管道的构建，转换为 Model，以及在阶段之间传递数据的相关输入和输出列的参数捕获。

使用 scikit-learn Pipeline 的示例

此示例展示了如何保存一个 scikit-learn Pipeline ↗。请注意 output_column_name 参数的显式使用。

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def sklearn_pipeline_model(housing):

    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.preprocessing import StandardScaler
    from sklearn.compose import make_column_transformer
    from sklearn.pipeline import Pipeline

    from foundry_ml import Model, Stage

    # 选择特征和目标变量
    X_train = housing[['median_income', "housing_median_age", "total_rooms"]]
    y_train = housing['median_house_value']

    # 创建特征转换器
    column_transformer = make_column_transformer(
       ('passthrough', ['median_income', "housing_median_age", "total_rooms"])
    )
    # 创建管道：特征转换 -> 标准化 -> 多项式扩展 -> 线性回归
    pipeline = Pipeline([
            ("preprocessing", column_transformer),
            ("ss", StandardScaler()),  # 标准化处理
            ("pf", PolynomialFeatures(degree=2)),  # 多项式特征扩展
            ("lr", LinearRegression())  # 线性回归模型
       ])

    # 拟合管道
    pipeline.fit(X_train, y_train)

    # 展开管道
    # 也可以通过列表推导式创建：Model(*[Stage(s[1]) for s in pipeline.steps])
    model = Model( Stage(pipeline["preprocessing"]),  # 特征转换阶段
                    Stage(pipeline["ss"], output_column_name="features"),  # 标准化阶段
                    Stage(pipeline["pf"], output_column_name="features"),  # 多项式扩展阶段
                    Stage(pipeline["lr"]))  # 线性回归阶段

    return model

这段代码创建了一个用于房价预测的机器学习模型管道。它首先从housing数据集中提取特征和目标变量，然后通过一系列数据处理步骤（包括特征转换、标准化、多项式特征扩展、线性回归）来拟合模型。最终，使用foundry_ml库中的Model和Stage类来表示和管理模型的不同阶段。

SparkML管道示例

此示例展示了如何保存一个SparkML管道 ↗。

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def spark_pipeline(housing):
    from pyspark.ml.feature import PolynomialExpansion
    from pyspark.ml.feature import StandardScaler
    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.regression import LinearRegression
    from pyspark.ml import Pipeline
    from foundry_ml import Model, Stage

    # 使用VectorAssembler将特定的输入列合并为一个特征列
    assembler = VectorAssembler(
        inputCols=['median_income', "housing_median_age", "total_rooms"],
        outputCol="features"
        )

    # 使用StandardScaler标准化特征列
    scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)

    # 对标准化后的特征列进行多项式扩展
    polyExpansion = PolynomialExpansion(degree=2, inputCol="scaledFeatures", outputCol="polyFeatures")

    # 设置线性回归模型参数，使用多项式特征列作为输入
    lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, featuresCol='polyFeatures', labelCol='median_house_value')

    # 定义一个包含所有数据处理阶段的Pipeline
    pipeline = Pipeline(stages=[assembler, scaler, polyExpansion, lr])

    # 将Pipeline应用于训练数据
    pipeline = pipeline.fit(housing)

    # 将Pipeline中的阶段转换为Model对象
    model = Model(*[Stage(s) for s in pipeline.stages])

    return model

以上代码展示了如何使用PySpark构建一个机器学习Pipeline，以处理房屋数据并进行线性回归预测。其中包括特征组合、标准化、特征多项式扩展以及线性回归模型的训练。