首页 > 行业资讯 > 厉害了！Scikit-Learn 新版再次重磅升级

厉害了！Scikit-Learn 新版再次重磅升级

时间：2023-07-21 来源：浏览：

厉害了！Scikit-Learn 新版再次重磅升级

点击关注 Python架构师

Python架构师

微信号 gh_1d7504e4dee1

功能介绍回复：python，领取Python面试题。分享Python教程，Python架构师教程，Python爬虫，Python编程视频，Python脚本，Pycharm教程，Python微服务架构，Python分布式架构，Pycharm注册码。

收录于合集

#python 190 个

#python技术 166 个

#python程序员 116 个

#python编程 158 个

#Python爬虫 91 个

来源：网络

本次scikit-learn 1.3更新增加了许多错误修复和改进，并引入了一些重要的新功能（增功能：标签编码、决策树缺失值处理等众多新特性）。要查看所有更改的详尽列表，请参阅发布说明。

https://scikit-learn.org/stable/whats_new/v1.3.html#changes-1-3

使用pip安装最新版本：

pip install --upgrade scikit-learn

或者使用conda：

conda install -c conda-forge scikit-learn

特性1：元数据路由

https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_metadata_routing.html

新的元数据路由方式如sample_weight，该方式会影响到像pipeline.Pipeline和model_selection.GridSearchCV这样的元估计器如何路由元数据。

尽管此功能的基础设施已经包含在此版本中，但相关工作仍在进行中，并非所有的元估计器都支持此新功能。您可以在元数据路由用户指南中了解更多关于此功能的信息。

点击领取Python面试题手册

Python从入门到进阶知识手册

特性2：HDBSCAN: hierarchical density-based clustering

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html

HDBSCAN通过同时在多个epsilon值上执行修改版本的cluster.DBSCAN，cluster.HDBSCAN可以找到具有不同密度的聚类，使其比cluster.DBSCAN更具鲁棒性，对于参数选择更加稳健。

import numpy as np from sklearn.cluster import HDBSCAN from sklearn.datasets import load_digits from sklearn.metrics import v_measure_score X, true_labels = load_digits(return_X_y= True ) print( f"数字的数量： {len(np.unique(true_labels))} " ) hdbscan = HDBSCAN(min_cluster_size= 15 ).fit(X) 非噪声标签 = hdbscan.labels_[hdbscan.labels_ != -1 ] print( f"找到的聚类数： {len(np.unique(非噪声标签))} " ) print(v_measure_score(true_labels[hdbscan.labels_ != -1 ], 非噪声标签))

特性3：TargetEncoder

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html

preprocessing.TargetEncoder非常适用于具有高基数的分类特征。它根据属于该类别的观测的平均目标值的缩小估计来对类别进行编码。

import numpy as np from sklearn.preprocessing import TargetEncoder X = np.array([[ "cat" ] * 30 + [ "dog" ] * 20 + [ "snake" ] * 38], dtype=object).T y = [90.3] * 30 + [20.4] * 20 + [21.2] * 38 enc = TargetEncoder(random_state=0) X_trans = enc.fit_transform(X, y) enc.encodings_

特性4：决策树支持缺失值

现在tree.DecisionTreeClassifier和tree.DecisionTreeRegressor类支持缺失值。对于非缺失数据的每个可能阈值，划分器将评估将所有缺失值分配给左节点或右节点的划分。

import numpy as np from sklearn.tree import DecisionTreeClassifier X = np.array([0, 1, 6, np.nan]).reshape(-1, 1) y = [0, 0, 1, 1] tree = DecisionTreeClassifier(random_state=0).fit(X, y) tree.predict(X)

特性5：Validation Curve

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ValidationCurveDisplay.html

现在可以使用from_estimator来创建一个ValidationCurveDisplay实例来可视化验证曲线。

from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import ValidationCurveDisplay X, y = make_classification( 1000 , 10 , random_state= 0 ) _ = ValidationCurveDisplay.from_estimator( LogisticRegression(), X, y, param_name= "C" , param_range=np.geomspace( 1e-5 , 1e3 , num= 9 ), score_type= "both" , score_name= "Accuracy" , )

特性6：Gamma loss

通过loss="gamma"参数，ensemble.HistGradientBoostingRegressor类支持使用Gamma偏差损失函数。该损失函数适用于具有右偏分布的严格正值目标建模。

import numpy as np from sklearn.model_selection import cross_val_score from sklearn.datasets import make_low_rank_matrix from sklearn.ensemble import HistGradientBoostingRegressor n_samples, n_features = 500, 10 rng = np.random.RandomState(0) X = make_low_rank_matrix(n_samples, n_features, random_state=rng) coef = rng.uniform(low=-10, high=20, size=n_features) y = rng.gamma(shape=2, scale=np.exp(X @ coef) / 2) gbdt = HistGradientBoostingRegressor(loss= "gamma" ) cross_val_score(gbdt, X, y).mean()

特性7：长尾类别聚合

preprocessing.OrdinalEncoder现在与preprocessing.OneHotEncoder类似，支持将不常见的类别聚合为每个特征的单个输出。启用聚合不常见类别的参数包括min_frequency和max_categories。

from sklearn.preprocessing import OrdinalEncoder import numpy as np X = np.array( [[ "dog" ] * 5 + [ "cat" ] * 20 + [ "rabbit" ] * 10 + [ "snake" ] * 3 ], dtype=object ).T enc = OrdinalEncoder(min_frequency= 6 ).fit(X) enc.infrequent_categories_

厉害了！Scikit-Learn 新版再次重磅升级

厉害了！Scikit-Learn 新版再次重磅升级

特性1：元数据路由

点击领取Python面试题手册

Python从入门到进阶知识手册

特性3：TargetEncoder

特性4：决策树支持缺失值

特性5：Validation Curve

特性6：Gamma loss

特性7：长尾类别聚合

微信公众号

小编微信

厉害了！Scikit-Learn 新版再次重磅升级

特性1：元数据路由

点击领取Python面试题手册 Python从入门到进阶知识手册

特性3：TargetEncoder

特性4：决策树支持缺失值

特性5：Validation Curve

特性6：Gamma loss

特性7：长尾类别聚合

点击领取Python面试题手册

Python从入门到进阶知识手册