第03章:scikit-learn实战——从线性回归到随机森林
第03章:scikit-learn实战——从线性回归到随机森林
“scikit-learn的设计哲学是:所有算法都有一致的接口(fit/predict/score)。学会了一个,就学会了全部算法的使用方式。这是工程优雅的典范。”
一、sklearn的统一接口
sklearn的核心设计理念是让所有算法遵循同一套接口:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
# 所有模型用法完全一致!
for ModelClass in [LinearRegression, DecisionTreeClassifier, RandomForestClassifier]:
model = ModelClass()
model.fit(X_train, y_train) # 训练
y_pred = model.predict(X_test) # 预测
score = model.score(X_test, y_test) # 评估
print(f"{ModelClass.__name__}: {score:.4f}")
这意味着切换算法几乎零成本,可以快速对比多个算法。
二、线性模型
线性回归
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# 加载数据
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 标准化(线性模型对特征尺度敏感)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 普通线性回归
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred = lr.predict(X_test_scaled)
print(f"R²: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
# 查看系数
feature_names = housing.feature_names
for name, coef in sorted(zip(feature_names, lr.coef_), key=lambda x: abs(x[1]), reverse=True):
print(f" {name}: {coef:.4f}")
# 正则化版本(防止过拟合)
ridge = Ridge(alpha=1.0) # L2正则化:减小所有系数
lasso = Lasso(alpha=0.1) # L1正则化:把一些系数压到0(特征选择)
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5) # 两者混合
逻辑回归(分类)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
lr = LogisticRegression(max_iter=1000, C=1.0) # C是正则化强度的倒数
lr.fit(X_train_s, y_train)
# 预测概率(不只是类别)
proba = lr.predict_proba(X_test_s)
print("预测概率(前5个):", proba[:5].round(3))
print(f"准确率: {lr.score(X_test_s, y_test):.4f}")
三、树模型
决策树
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=42
)
# max_depth控制树的深度,防止过拟合
dt = DecisionTreeClassifier(max_depth=4, random_state=42)
dt.fit(X_train, y_train)
print(f"训练准确率: {dt.score(X_train, y_train):.4f}")
print(f"测试准确率: {dt.score(X_test, y_test):.4f}")
# 可视化决策规则
print("\n决策规则:")
print(export_text(dt, feature_names=iris.feature_names))
# 特征重要性
for name, importance in sorted(
zip(iris.feature_names, dt.feature_importances_),
key=lambda x: x[1], reverse=True
):
print(f" {name}: {importance:.4f}")
随机森林
随机森林是多棵决策树的集成,通常是ML任务的最强基准模型:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
rf = RandomForestClassifier(
n_estimators=100, # 树的数量
max_depth=None, # 每棵树最大深度(None=完全生长)
min_samples_split=2, # 分裂节点所需最少样本数
max_features="sqrt", # 每次分裂考虑的特征数(分类问题用sqrt)
n_jobs=-1, # 并行使用所有CPU
random_state=42
)
rf.fit(X_train, y_train)
print(f"随机森林准确率: {rf.score(X_test, y_test):.4f}")
# 袋外评分(OOB Score):不需要测试集的评估方式
rf_oob = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf_oob.fit(X_train, y_train)
print(f"OOB分数: {rf_oob.oob_score_:.4f}")
四、梯度提升:XGBoost和LightGBM
梯度提升是Kaggle比赛的常胜将军,在表格数据上通常优于随机森林:
# pip install xgboost lightgbm
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score
# XGBoost
xgb = XGBClassifier(
n_estimators=200,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8,
eval_metric="logloss",
random_state=42,
verbosity=0
)
xgb.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False
)
print(f"XGBoost: {xgb.score(X_test, y_test):.4f}")
# LightGBM(比XGBoost更快,大数据集首选)
lgbm = LGBMClassifier(
n_estimators=200,
max_depth=6,
learning_rate=0.1,
random_state=42,
verbose=-1
)
lgbm.fit(X_train, y_train)
print(f"LightGBM: {lgbm.score(X_test, y_test):.4f}")
五、快速对比多个算法
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
import pandas as pd
models = {
"逻辑回归": LogisticRegression(max_iter=1000),
"决策树": DecisionTreeClassifier(max_depth=5),
"随机森林": RandomForestClassifier(n_estimators=100, n_jobs=-1),
"梯度提升": GradientBoostingClassifier(n_estimators=100),
"SVM": SVC(kernel="rbf", C=1.0),
}
results = []
for name, model in models.items():
# 5折交叉验证
scores = cross_val_score(model, X_train_s, y_train, cv=5, scoring="accuracy", n_jobs=-1)
results.append({
"模型": name,
"均值": scores.mean(),
"标准差": scores.std(),
})
df_results = pd.DataFrame(results).sort_values("均值", ascending=False)
print(df_results.to_string(index=False))
本章小结
- sklearn的核心价值是统一接口(fit/predict/score),切换算法几乎零成本。
- 线性模型是最好的起点:可解释性强,训练快,适合高维稀疏数据。
- 随机森林是大多数表格数据任务的最强基准——多棵树,对噪声鲁棒。
- XGBoost/LightGBM是竞赛和工业界的首选:梯度提升+正则化,准确率高。
- 使用交叉验证(cross_val_score)评估模型,比单次train/test划分更可靠。
核心行动建议: 今天在一个真实数据集(Kaggle的Titanic)上跑完"快速对比多个算法"的代码,对比5个算法的交叉验证分数。不要纠结于单个算法,先建立对"哪类问题哪类算法表现好"的直觉。
本章提示词模板
算法选型顾问
我有一个机器学习问题,帮我推荐最适合的算法:
问题类型:[分类/回归/聚类]
数据规模:[行数 × 列数]
特征类型:[数值/类别/文本/图像/混合]
目标指标:[准确率/AUC/RMSE/延迟等]
特殊约束:[可解释性要求/训练时间限制/部署限制等]
请推荐:
1. 首选算法(及理由)
2. 备选算法(2-3个)
3. 不适合这个场景的算法(及原因)
4. 关键超参数建议
→ 继续阅读:第04章——模型评估:你的模型真的学到了什么吗