Handling Missing Data

Techniques (e.g., imputation, deletion, modeling) for addressing gaps in datasets to maintain model integrity and fairness.

Definition

Missingness can bias models or reduce accuracy. Governance covers strategies: Deletion (remove incomplete records), Imputation (mean, median, model-based), or explicit Missing-Indicator features. Each choice must be documented, its impact on downstream fairness evaluated, and pipelines configured to handle missing values consistently in production.

Real-World Example

A credit-risk dataset has 15% missing income values. The team compares mean-imputation, KNN-imputation, and a predictive-imputation model. They choose KNN-imputation (lowest RMSE), add a “was_income_missing” binary feature, and validate that the imputation does not skew approval rates for disadvantaged groups.