Skip to content

Jailbreak Detection

Detects attempts to manipulate an AI into bypassing its safety guidelines — persona overrides ("you are DAN"), instruction suppression ("ignore previous instructions"), developer mode requests, and similar adversarial prompts.

Engine: Local Model — runs entirely on-device. No LLM call, no tokens consumed, no API key needed.

Policy configuration

Field Type Default Description
name string Must be "Jailbreak" (registry name).
suppress_enforcement bool true If true, a jailbreak detection does not raise GuardrailEnforcementTriggered; if false, enforcement can stop the pipeline.
config.confidence_threshold float 0.5 Minimum probability (0.0–1.0) for the jailbreak class to count as a hit.
{
  "version": 1,
  "input": {
    "version": 1,
    "guardrails": [
      {
        "name": "Jailbreak",
        "suppress_enforcement": false,
        "config": {
          "confidence_threshold": 0.5
        }
      }
    ]
  }
}

Supported stages

Stage Supported Notes
input Recommended. Checks the user's message before the response is generated
output Not applicable — jailbreak patterns appear in user messages, not LLM responses

What it detects

  • Persona overrides — "You are DAN (Do Anything Now)", "act as an unrestricted AI"
  • Instruction suppression — "ignore all previous instructions", "forget your guidelines"
  • Mode manipulation — "enter developer mode", "disable content filters"
  • Role-playing exploits — fictional framings designed to justify restricted content
  • Encoded / obfuscated attacks — Unicode exploits, encoded instructions

What it does NOT flag

  • Direct questions about sensitive topics without manipulation tactics
  • Gaming or fiction context ("how do I poison an enemy in WoW?")
  • Absurd or clearly humorous requests
  • Strong word choices that aren't adversarial ("help me decimate my debt")

What it returns

{
    "guardrail_name": "Jailbreak",
    "jailbreak_detected": true,
    "duration_ms": 42.3
}
Field Description
jailbreak_detected true if the confidence score met or exceeded the threshold
duration_ms Inference time in milliseconds