Jailbreak Detection

Detects attempts to manipulate an AI into bypassing its safety guidelines — persona overrides ("you are DAN"), instruction suppression ("ignore previous instructions"), developer mode requests, and similar adversarial prompts.

Engine: Local Model — runs entirely on-device. No LLM call, no tokens consumed, no API key needed.

Policy configuration

Field	Type	Default	Description
`name`	string	—	Must be `"Jailbreak"` (registry name).
`suppress_enforcement`	bool	`true`	If `true`, a jailbreak detection does not raise `GuardrailEnforcementTriggered`; if `false`, enforcement can stop the pipeline.
`config.confidence_threshold`	float	`0.5`	Minimum probability (0.0–1.0) for the jailbreak class to count as a hit.

{
  "version": 1,
  "input": {
    "version": 1,
    "guardrails": [
      {
        "name": "Jailbreak",
        "suppress_enforcement": false,
        "config": {
          "confidence_threshold": 0.5
        }
      }
    ]
  }
}

Supported stages

Stage	Supported	Notes
`input`	✅	Recommended. Checks the user's message before the response is generated
`output`	—	Not applicable — jailbreak patterns appear in user messages, not LLM responses

What it detects

Persona overrides — "You are DAN (Do Anything Now)", "act as an unrestricted AI"
Instruction suppression — "ignore all previous instructions", "forget your guidelines"
Mode manipulation — "enter developer mode", "disable content filters"
Role-playing exploits — fictional framings designed to justify restricted content
Encoded / obfuscated attacks — Unicode exploits, encoded instructions

What it does NOT flag

Direct questions about sensitive topics without manipulation tactics
Gaming or fiction context ("how do I poison an enemy in WoW?")
Absurd or clearly humorous requests
Strong word choices that aren't adversarial ("help me decimate my debt")

What it returns

{
    "guardrail_name": "Jailbreak",
    "jailbreak_detected": true,
    "duration_ms": 42.3
}

Field	Description
`jailbreak_detected`	`true` if the confidence score met or exceeded the threshold
`duration_ms`	Inference time in milliseconds