Guardrails

Guardrails are organisation-wide safety rules that validate every AI Colleague conversation — both what employees send and what your AI Colleague responds with — before execution continues. They are configured once by an AI Colleague Admin and apply across all AI Colleagues automatically.

Navigate to Settings → Orchestrator → Guardrails to access this page.

Overview

Three guardrail types are available out of the box:

Guardrail	What it protects against	Applies to
PII Detection	Personally identifiable information in conversations	User inputs & AI outputs
Moderation	Harmful, hateful, or explicit content	User inputs & AI outputs
Jailbreak	Prompt injection and attempts to override AI safety rules	User inputs only

Enabling Guardrails

If guardrails have never been enabled for your organisation, you will see a "Guardrails are disabled" state with a single Enable guardrails button. Clicking it activates all three guardrails at once.

Once enabled, you will see three guardrail cards — each showing the guardrail name, its current status (Active / Inactive), a description, and who last updated it(for PII Detection).

📘
Note
Enabling guardrails is an org-wide action. All three guardrails are turned on together. After enabling, PII Detection can be individually toggled — Moderation and Jailbreak cannot.

Managing PII Detection

PII Detection is the only guardrail that supports individual activation/deactivation. Use the ⋮ (three-dot menu) on the PII Detection card to enable or disable it independently. All changes are logged to the audit trail.

Viewing and editing PII Detection settings

Click the PII Detection card to open the detail view, which shows the current action and selected entities as chips. To make changes, click Edit in the top right.

Action on detection

From the Action on detection dropdown, choose one of:

Action	What happens
Mask & continue (default)	Detected PII is replaced with a redaction placeholder (e.g., `[EMAIL REDACTED]`) and the conversation continues.
Block	The request is halted and an error message is returned to the user. The violation is logged.
Log & continue	The conversation proceeds unmodified. The violation is recorded.

PII entities

In edit mode, use the checkboxes to select which data types to detect. Entities are organised by region:

Common: Credit card number, IBAN, Cryptocurrency wallet address, Nationality / religion / political group, Medical license number

USA: US bank account number, US driver license number, US individual taxpayer identification number (ITIN), US passport number, US Social Security number

UK: National Insurance number, NHS number

Europe: Spanish NIF number, Spanish NIE number, Italian fiscal code, Italian VAT code, Italian passport number and other Italian identity documents

Asia-Pacific: Singapore NRIC/FIN, Singapore UEN; Australian ABN, ACN, TFN, Medicare number; Indian Aadhaar, PAN

When done, click Save (top right) to apply changes, or Cancel to exit without saving.

Viewing Moderation Settings

Click the Moderation card to view the currently active content categories. Moderation cannot be individually enabled or disabled — it is managed as part of the org-wide guardrails toggle.

📘
Note
When a moderation category is triggered, the request is always blocked. This is the only available action for Moderation.

Group	Categories
Sexual content	Sexual, Sexual/minor
Hate & harassment	Hate, Hate/threatening, Harassment, Harassment/threatening
Violence	Violence, Violence instructions
Self-harm	Self-harm, Self-harm intent, Self-harm instructions
Illegal activities	Illegal activities, Illegal weapons

Jailbreak Detection

The Jailbreak guardrail blocks attempts to manipulate your AI Colleague into bypassing its safety rules — including prompt injection, role-hijacking, and system prompt override attempts.

When guardrails are enabled, Jailbreak Detection is active and running. All detection patterns are on by default. There is no separate configuration view for Jailbreak — it works automatically in the background, and all blocked attempts are logged.

Understanding Actions

Action	What happens	Best for
Block	Halts execution. Returns an error to the user. Violation is logged.	Severe violations where continuing would cause harm or a compliance breach.
Mask & continue	Replaces detected content with a redaction placeholder. Conversation continues.	PII detected in flows where the masked version is still useful.
Log & continue	Records the violation. Conversation proceeds normally.	Informational monitoring where blocking would hurt the employee experience.

Guardrails Violation in Run History

Overview

Enabling Guardrails

Note

Managing PII Detection

Viewing and editing PII Detection settings

Action on detection

PII entities

Viewing Moderation Settings

Note

Jailbreak Detection

Understanding Actions

Related Article