Fault Tolerance
The ability of an AI system to continue operating correctly even when some components fail or produce errors.
Definition
Involves architectural patterns—redundant components, graceful degradation, checkpointing, and transaction rollbacks—that ensure AI services remain available and safe under partial failures. Governance prescribes fault-injection testing (chaos engineering), failure-mode analyses, and clear service-level objectives for recovery time and service continuity.
Real-World Example
A cloud-based image-classification API runs on multiple container instances behind a load balancer. If one instance crashes during a high-volume event, traffic automatically shifts to healthy pods, and crashed instances restart without user impact. The ops team regularly performs chaos tests to validate fault-tolerance mechanisms.