特征工程实战：从原始数据到高质量特征的系统性构建方法

2026/6/29 14:55:58

特征工程实战：从原始数据到高质量特征的系统性构建方法

一、特征工程的隐性价值：模型上限的决定性因素

在深度学习时代，特征工程的重要性常被低估。"端到端学习让特征工程过时"是一个危险的误解。实际上，在表格数据、时间序列和推荐系统等场景中，特征工程仍然是决定模型上限的关键因素。一个精心设计的特征，可能比模型架构的调整带来更大的性能提升。

一项 Kaggle 竞赛的统计数据显示，在结构化数据竞赛中，排名前 10 的团队使用的模型架构差异不大（主要是 LightGBM 和 XGBoost），但特征工程的差异决定了最终排名。冠军团队的特征数量通常是基线的 3-5 倍，且包含大量基于领域知识构造的交叉特征与统计特征。

特征工程的核心痛点在于三个方面：特征泄露（Feature Leakage）导致离线评测虚高，线上效果骤降；特征维度爆炸引发过拟合与计算开销；特征分布漂移使得历史特征在新数据上失效。这些问题在工业场景中尤为突出，因为生产数据的分布远比竞赛数据复杂。

二、特征工程的流程体系：从数据理解到特征验证的闭环

高质量特征工程不是随机尝试，而是一个有明确流程的系统性工作。从数据理解到特征验证，每一步都有明确的目标与验证标准。

flowchart TB A[数据理解与探索] --> B[特征构造] B --> C[特征编码与变换] C --> D[特征选择与降维] D --> E[特征验证与泄露检测] subgraph 特征构造策略 B1[数值特征: 统计聚合/分箱/交叉] B2[类别特征: 目标编码/频数编码] B3[时间特征: 周期性/趋势/滞后] B4[文本特征: TF-IDF/嵌入/统计] end subgraph 特征验证 E1[特征重要性排序] E2[特征稳定性指标<br/>PSI / CSI] E3[泄露检测<br/>时间穿越检查] E4[消融实验<br/>逐特征剔除验证] end B --> B1 & B2 & B3 & B4 E --> E1 & E2 & E3 & E4 E -.->|迭代优化| B style B fill:#4ecdc4,color:#fff style E fill:#ff6b6b,color:#fff style E3 fill:#ffe66d,color:#333

特征泄露是特征工程中最致命的问题。它发生在特征中包含了目标变量的未来信息，导致模型在训练时"偷看"了答案。常见的泄露场景包括：使用包含未来数据的统计特征（如用全量数据的均值编码类别特征）、时间穿越（用 T+1 的数据构造 T 时刻的特征）、以及数据预处理时的信息泄露（如在全量数据上做标准化后再划分训练/测试集）。

特征稳定性是工业场景中必须关注的维度。一个在训练集上表现优异的特征，如果在线上数据的分布发生漂移，其预测能力会急剧下降。群体稳定性指标（PSI）是衡量特征分布变化的常用工具，PSI > 0.2 通常意味着特征分布发生了显著变化。

三、生产级特征工程方案与代码实现

3.1 特征构造：数值、类别与时间特征

import numpy as np import pandas as pd from typing import List, Dict, Optional from sklearn.model_selection import KFold class FeatureEngineer: """特征工程工具集：覆盖数值、类别、时间三类特征构造""" def __init__(self): self.encoding_maps: Dict[str, dict] = {} self.bin_edges: Dict[str, np.ndarray] = {} # ---- 数值特征 ---- def create_statistical_features( self, df: pd.DataFrame, group_cols: List[str], value_col: str, ) -> pd.DataFrame: """统计聚合特征：按分组列计算目标列的统计量 适用于：用户行为聚合、商品统计、区域指标等 注意：聚合粒度需与预测粒度匹配，避免信息泄露 """ agg_stats = df.groupby(group_cols)[value_col].agg( mean="mean", std="std", median="median", min="min", max="max", skew="skew", count="count", ).reset_index() # 命名规范：原列名_统计量 agg_stats.columns = ( group_cols + [f"{value_col}_{stat}" for stat in agg_stats.columns[len(group_cols):]] ) return agg_stats def create_interaction_features( self, df: pd.DataFrame, col_a: str, col_b: str, operations: List[str] = None, ) -> pd.DataFrame: """交叉特征：两个数值列的交互运算 常见操作：加减乘除、比率、差值 适用于：价格与销量的关系、时长与频率的比率 """ if operations is None: operations = ["multiply", "divide", "subtract"] result = df.copy() if "multiply" in operations: result[f"{col_a}_x_{col_b}"] = df[col_a] * df[col_b] if "divide" in operations: # 加 epsilon 防止除零 result[f"{col_a}_div_{col_b}"] = df[col_a] / (df[col_b] + 1e-8) if "subtract" in operations: result[f"{col_a}_minus_{col_b}"] = df[col_a] - df[col_b] return result # ---- 类别特征 ---- def target_encode_kfold( self, df: pd.DataFrame, col: str, target: str, n_folds: int = 5, smoothing: float = 10.0, ) -> pd.Series: """K-Fold 目标编码：避免特征泄露的标准方法 核心思路：用训练折的目标均值编码验证折 smoothing 参数控制先验均值的权重： encoding = (count * mean + smoothing * global_mean) / (count + smoothing) smoothing 越大，低频类别越趋向全局均值，防止过拟合 """ global_mean = df[target].mean() encoded = pd.Series(index=df.index, dtype=float) kf = KFold(n_splits=n_folds, shuffle=True, random_state=42) for train_idx, val_idx in kf.split(df): train_fold = df.iloc[train_idx] val_fold = df.iloc[val_idx] # 计算每个类别的目标均值与计数 stats = train_fold.groupby(col)[target].agg(["mean", "count"]) # 平滑公式：低频类别向全局均值收缩 smoothed_mean = ( stats["count"] * stats["mean"] + smoothing * global_mean ) / (stats["count"] + smoothing) # 映射到验证折 encoded.iloc[val_idx] = val_fold[col].map(smoothed_mean) # 未见过的类别使用全局均值 encoded.iloc[val_idx] = encoded.iloc[val_idx].fillna(global_mean) # 保存编码映射，用于测试集 full_stats = df.groupby(col)[target].agg(["mean", "count"]) self.encoding_maps[col] = ( (full_stats["count"] * full_stats["mean"] + smoothing * global_mean) / (full_stats["count"] + smoothing) ).to_dict() return encoded # ---- 时间特征 ---- def create_time_features( self, df: pd.DataFrame, time_col: str, ) -> pd.DataFrame: """时间特征提取：周期性、趋势与时间间隔 时间特征的核心是捕捉周期性与趋势性： - 周期性：小时、星期、月份的 sin/cos 编码 - 趋势性：距某个基准点的时间差 - 间隔性：与上次事件的时间差 """ result = df.copy() dt = pd.to_datetime(df[time_col]) # 基础时间组件 result[f"{time_col}_hour"] = dt.dt.hour result[f"{time_col}_dayofweek"] = dt.dt.dayofweek result[f"{time_col}_month"] = dt.dt.month result[f"{time_col}_is_weekend"] = (dt.dt.dayofweek >= 5).astype(int) # 周期性编码：sin/cos 保持周期连续性 # 例如 hour=23 和 hour=0 在原始编码中距离 23， # 但 sin/cos 编码中距离为 1 result[f"{time_col}_hour_sin"] = np.sin(2 * np.pi * dt.dt.hour / 24) result[f"{time_col}_hour_cos"] = np.cos(2 * np.pi * dt.dt.hour / 24) result[f"{time_col}_dow_sin"] = np.sin(2 * np.pi * dt.dt.dayofweek / 7) result[f"{time_col}_dow_cos"] = np.cos(2 * np.pi * dt.dt.dayofweek / 7) return result

3.2 特征选择与泄露检测

from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import roc_auc_score class FeatureValidator: """特征验证工具：重要性排序、稳定性检测与泄露检查""" @staticmethod def feature_importance_ranking( X: pd.DataFrame, y: pd.Series, top_k: int = 20, ) -> pd.DataFrame: """基于随机森林的特征重要性排序 优势：能捕捉非线性关系与特征交互 局限：对高基数类别特征有偏好，需结合其他方法 """ rf = RandomForestClassifier( n_estimators=100, max_depth=10, random_state=42, n_jobs=-1, ) rf.fit(X, y) importance = pd.DataFrame({ "feature": X.columns, "importance": rf.feature_importances_, }).sort_values("importance", ascending=False) return importance.head(top_k) @staticmethod def calculate_psi( expected: np.ndarray, actual: np.ndarray, n_bins: int = 10, ) -> float: """群体稳定性指标（PSI）：衡量特征分布变化 PSI < 0.1: 分布稳定 0.1 <= PSI < 0.2: 分布略有变化，需关注 PSI >= 0.2: 分布显著变化，特征可能失效 """ # 使用等频分箱 breakpoints = np.quantile(expected, np.linspace(0, 1, n_bins + 1)) breakpoints[0] = -np.inf breakpoints[-1] = np.inf expected_counts = np.histogram(expected, bins=breakpoints)[0] actual_counts = np.histogram(actual, bins=breakpoints)[0] # 转为比例，加 epsilon 防止除零 expected_pct = (expected_counts + 1) / (expected_counts.sum() + n_bins) actual_pct = (actual_counts + 1) / (actual_counts.sum() + n_bins) psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct)) return psi @staticmethod def detect_leakage( X: pd.DataFrame, y: pd.Series, threshold: float = 0.95, ) -> List[str]: """特征泄露检测：识别与目标变量相关性过高的特征 单特征 AUC > 0.95 通常意味着特征泄露 但需人工判断：某些强信号特征确实合理（如医学指标） """ leakage_features = [] for col in X.columns: if X[col].dtype in ["object", "category"]: continue try: auc = roc_auc_score(y, X[col].fillna(0)) if auc > threshold or auc < (1 - threshold): leakage_features.append( f"{col} (AUC={auc:.4f})" ) except ValueError: pass return leakage_features