Guardrails
Guardrails are organisation-wide safety rules that validate every AI Colleague conversation — both what employees send and what your AI Colleague responds with — before execution continues. They are configured once by an AI Colleague Admin and apply across all AI Colleagues automatically.
Navigate to Settings → Orchestrator → Guardrails to access this page.
Overview
Three guardrail types are available out of the box:
| Guardrail | What it protects against | Applies to |
|---|---|---|
| PII Detection | Personally identifiable information in conversations | User inputs & AI outputs |
| Moderation | Harmful, hateful, or explicit content | User inputs & AI outputs |
| Jailbreak | Prompt injection and attempts to override AI safety rules | User inputs only |
Enabling Guardrails
If guardrails have never been enabled for your organisation, you will see a "Guardrails are disabled" state with a single Enable guardrails button. Clicking it activates all three guardrails at once.
Once enabled, you will see three guardrail cards — each showing the guardrail name, its current status (Active / Inactive), a description, and who last updated it(for PII Detection).
NoteEnabling guardrails is an org-wide action. All three guardrails are turned on together. After enabling, PII Detection can be individually toggled — Moderation and Jailbreak cannot.
Managing PII Detection
PII Detection is the only guardrail that supports individual activation/deactivation. Use the ⋮ (three-dot menu) on the PII Detection card to enable or disable it independently. All changes are logged to the audit trail.
Viewing and editing PII Detection settings
Click the PII Detection card to open the detail view, which shows the current action and selected entities as chips. To make changes, click Edit in the top right.
Action on detection
From the Action on detection dropdown, choose one of:
| Action | What happens |
|---|---|
| Mask & continue (default) | Detected PII is replaced with a redaction placeholder (e.g., [EMAIL REDACTED]) and the conversation continues. |
| Block | The request is halted and an error message is returned to the user. The violation is logged. |
| Log & continue | The conversation proceeds unmodified. The violation is recorded. |
PII entities
In edit mode, use the checkboxes to select which data types to detect. Entities are organised by region:
Common: Credit card number, IBAN, Cryptocurrency wallet address, Nationality / religion / political group, Medical license number
USA: US bank account number, US driver license number, US individual taxpayer identification number (ITIN), US passport number, US Social Security number
UK: National Insurance number, NHS number
Europe: Spanish NIF number, Spanish NIE number, Italian fiscal code, Italian VAT code, Italian passport number and other Italian identity documents
Asia-Pacific: Singapore NRIC/FIN, Singapore UEN; Australian ABN, ACN, TFN, Medicare number; Indian Aadhaar, PAN
When done, click Save (top right) to apply changes, or Cancel to exit without saving.
Viewing Moderation Settings
Click the Moderation card to view the currently active content categories. Moderation cannot be individually enabled or disabled — it is managed as part of the org-wide guardrails toggle.
NoteWhen a moderation category is triggered, the request is always blocked. This is the only available action for Moderation.
| Group | Categories |
|---|---|
| Sexual content | Sexual, Sexual/minor |
| Hate & harassment | Hate, Hate/threatening, Harassment, Harassment/threatening |
| Violence | Violence, Violence instructions |
| Self-harm | Self-harm, Self-harm intent, Self-harm instructions |
| Illegal activities | Illegal activities, Illegal weapons |
Jailbreak Detection
The Jailbreak guardrail blocks attempts to manipulate your AI Colleague into bypassing its safety rules — including prompt injection, role-hijacking, and system prompt override attempts.
When guardrails are enabled, Jailbreak Detection is active and running. All detection patterns are on by default. There is no separate configuration view for Jailbreak — it works automatically in the background, and all blocked attempts are logged.
Understanding Actions
| Action | What happens | Best for |
|---|---|---|
| Block | Halts execution. Returns an error to the user. Violation is logged. | Severe violations where continuing would cause harm or a compliance breach. |
| Mask & continue | Replaces detected content with a redaction placeholder. Conversation continues. | PII detected in flows where the masked version is still useful. |
| Log & continue | Records the violation. Conversation proceeds normally. | Informational monitoring where blocking would hurt the employee experience. |
Updated about 2 hours ago
