Operational Resilience

The ability of AI systems and their supporting infrastructure to anticipate, withstand, recover from, and adapt to disruptions or adverse events.

Definition

Involves redundancy (failover servers), disaster-recovery plans (backups, warm-standby clusters), autoscaling to handle traffic spikes, and chaos-engineering drills to test system robustness. Governance requires defining resilience SLAs (RTO, RPO), conducting regular drills, and embedding resilience requirements into system design and procurement.

Real-World Example

A healthcare AI for critical-care monitoring runs in an active-active cloud deployment across two regions. If one region fails, traffic seamlessly shifts to the other. Quarterly chaos tests simulate region outages, verifying automatic failover within the one-minute SLA—ensuring uninterrupted patient monitoring.