Guardrails

Predefined constraints or checks (technical and policy) embedded in AI systems to prevent unsafe or non-compliant behavior at runtime.

Definition

Defensive layers—both algorithmic (confidence thresholds, input sanitization, adversarial detectors) and procedural (approval gates, human-in-the-loop)—that enforce policy boundaries. Guardrails limit outputs (e.g., no hate speech), restrict decision domains, and trigger fail-safe actions. Effective guardrail governance requires reviewing and updating constraints as new threats emerge and monitoring bypass attempts.

Real-World Example

A content-moderation AI has guardrails that block profanity and extremist content. When the NLP filter’s confidence is below 70%, the system routes the post to a human moderator rather than posting automatically—ensuring unsafe content never reaches the feed without oversight.