Jailbreak Detection
Detects attempts to manipulate an AI into bypassing its safety guidelines — persona overrides ("you are DAN"), instruction suppression ("ignore previous instructions"), developer mode requests, and similar adversarial prompts.
Engine: Local Model — runs entirely on-device. No LLM call, no tokens consumed, no API key needed.
Policy configuration
| Field |
Type |
Default |
Description |
name |
string |
— |
Must be "Jailbreak" (registry name). |
suppress_enforcement |
bool |
true |
If true, a jailbreak detection does not raise GuardrailEnforcementTriggered; if false, enforcement can stop the pipeline. |
config.confidence_threshold |
float |
0.5 |
Minimum probability (0.0–1.0) for the jailbreak class to count as a hit. |
{
"version": 1,
"input": {
"version": 1,
"guardrails": [
{
"name": "Jailbreak",
"suppress_enforcement": false,
"config": {
"confidence_threshold": 0.5
}
}
]
}
}
Supported stages
| Stage |
Supported |
Notes |
input |
✅ |
Recommended. Checks the user's message before the response is generated |
output |
— |
Not applicable — jailbreak patterns appear in user messages, not LLM responses |
What it detects
- Persona overrides — "You are DAN (Do Anything Now)", "act as an unrestricted AI"
- Instruction suppression — "ignore all previous instructions", "forget your guidelines"
- Mode manipulation — "enter developer mode", "disable content filters"
- Role-playing exploits — fictional framings designed to justify restricted content
- Encoded / obfuscated attacks — Unicode exploits, encoded instructions
What it does NOT flag
- Direct questions about sensitive topics without manipulation tactics
- Gaming or fiction context ("how do I poison an enemy in WoW?")
- Absurd or clearly humorous requests
- Strong word choices that aren't adversarial ("help me decimate my debt")
What it returns
{
"guardrail_name": "Jailbreak",
"jailbreak_detected": true,
"duration_ms": 42.3
}
| Field |
Description |
jailbreak_detected |
true if the confidence score met or exceeded the threshold |
duration_ms |
Inference time in milliseconds |