Guardrails

Guardrails are organisation-wide safety rules that validate every AI Colleague conversation — both what employees send and what your AI Colleague responds with — before execution continues. They are configured once by an AI Colleague Admin and apply across all AI Colleagues automatically.

Navigate to Settings → Orchestrator → Guardrails to access this page.

Overview

Three guardrail types are available out of the box:

GuardrailWhat it protects againstApplies to
PII DetectionPersonally identifiable information in conversationsUser inputs & AI outputs
ModerationHarmful, hateful, or explicit contentUser inputs & AI outputs
JailbreakPrompt injection and attempts to override AI safety rulesUser inputs only

Enabling Guardrails

If guardrails have never been enabled for your organisation, you will see a "Guardrails are disabled" state with a single Enable guardrails button. Clicking it activates all three guardrails at once.

Once enabled, you will see three guardrail cards — each showing the guardrail name, its current status (Active / Inactive), a description, and who last updated it(for PII Detection).

📘

Note

Enabling guardrails is an org-wide action. All three guardrails are turned on together. After enabling, PII Detection can be individually toggled — Moderation and Jailbreak cannot.


Managing PII Detection

PII Detection is the only guardrail that supports individual activation/deactivation. Use the ⋮ (three-dot menu) on the PII Detection card to enable or disable it independently. All changes are logged to the audit trail.

Viewing and editing PII Detection settings

Click the PII Detection card to open the detail view, which shows the current action and selected entities as chips. To make changes, click Edit in the top right.

Action on detection

From the Action on detection dropdown, choose one of:

ActionWhat happens
Mask & continue (default)Detected PII is replaced with a redaction placeholder (e.g., [EMAIL REDACTED]) and the conversation continues.
BlockThe request is halted and an error message is returned to the user. The violation is logged.
Log & continueThe conversation proceeds unmodified. The violation is recorded.

PII entities

In edit mode, use the checkboxes to select which data types to detect. Entities are organised by region:

Common: Credit card number, IBAN, Cryptocurrency wallet address, Nationality / religion / political group, Medical license number

USA: US bank account number, US driver license number, US individual taxpayer identification number (ITIN), US passport number, US Social Security number

UK: National Insurance number, NHS number

Europe: Spanish NIF number, Spanish NIE number, Italian fiscal code, Italian VAT code, Italian passport number and other Italian identity documents

Asia-Pacific: Singapore NRIC/FIN, Singapore UEN; Australian ABN, ACN, TFN, Medicare number; Indian Aadhaar, PAN

When done, click Save (top right) to apply changes, or Cancel to exit without saving.


Viewing Moderation Settings

Click the Moderation card to view the currently active content categories. Moderation cannot be individually enabled or disabled — it is managed as part of the org-wide guardrails toggle.

📘

Note

When a moderation category is triggered, the request is always blocked. This is the only available action for Moderation.

GroupCategories
Sexual contentSexual, Sexual/minor
Hate & harassmentHate, Hate/threatening, Harassment, Harassment/threatening
ViolenceViolence, Violence instructions
Self-harmSelf-harm, Self-harm intent, Self-harm instructions
Illegal activitiesIllegal activities, Illegal weapons

Jailbreak Detection

The Jailbreak guardrail blocks attempts to manipulate your AI Colleague into bypassing its safety rules — including prompt injection, role-hijacking, and system prompt override attempts.

When guardrails are enabled, Jailbreak Detection is active and running. All detection patterns are on by default. There is no separate configuration view for Jailbreak — it works automatically in the background, and all blocked attempts are logged.


Understanding Actions

ActionWhat happensBest for
BlockHalts execution. Returns an error to the user. Violation is logged.Severe violations where continuing would cause harm or a compliance breach.
Mask & continueReplaces detected content with a redaction placeholder. Conversation continues.PII detected in flows where the masked version is still useful.
Log & continueRecords the violation. Conversation proceeds normally.Informational monitoring where blocking would hurt the employee experience.