Label Leakage

The inadvertent inclusion of output information in training data labels, which can inflate performance metrics and conceal true model generalization issues.

Definition

Occurs when features inadvertently encode the target label (e.g., timestamp or ID that correlates perfectly), causing the model to “cheat” rather than learn real patterns. Label leakage leads to overly optimistic validation scores and subsequent production failure. Governance demands rigorous feature-label correlation analysis, hold-out test sets from different timeframes, and pipeline checks to prevent leakage in feature engineering.

Real-World Example

A churn-prediction dataset includes a “cancellation_reason” column labeled only after the customer left, perfectly predicting churn. After discovering this leakage, the team removes the column, retrains with only pre-cancellation features, and validates performance on an entirely new cohort—revealing the true predictive power.