Root Cause Analysis
A structured investigation to determine the underlying reasons for AI system failures or unexpected behaviors, guiding corrective actions.
Definition
A post-incident methodology—5 Whys, fishbone diagrams—that traces failure events (e.g., model misclassifications, system outages) back through data pipelines, model logic, and infrastructure layers to identify the primary defect. Root-cause analysis documents findings and remediation plans, ensures corrective actions address systemic issues, and prevents recurrence. Governance requires RCA for any high-severity incident, with executive review of outcomes.
Real-World Example
After a recommendation engine began suggesting inappropriate content, the team performed root-cause analysis: they found a data-ingestion script had merged test and production data, skewing training. They fixed the script, revalidated the model, and added unit tests to the ingestion pipeline—eliminating the risk of future cross-environment contamination.