Skip to content

Harmful Content

Uses local ONNX models to detect harmful or policy-violating content. All inference runs in-process on your machine — no text is sent to an external API.

The guardrail combines two complementary detection strategies:

Strategy Always active What it catches
Toxicity classification Explicitly hateful, harassing, violent, or offensive language
Harmful-request detection opt-in Polite but harmful information requests — "How do I synthesise fentanyl?", "Aren't all women bad at maths?"

Both checks run in parallel and enforcement is triggered if either fires.


Policy configuration

Field Type Default Description
name string Must be "HarmfulContent" (registry name).
suppress_enforcement bool true When false, a hit raises GuardrailEnforcementTriggered and stops the pipeline.
config.categories list[str] all categories Subset of category strings to evaluate. Omitting this field or passing {} checks all supported categories.
config.harmful_requests bool false Enable semantic similarity check against the do-not-answer corpus. See Harmful-request detection.
config.harmful_requests_threshold float 0.75 Cosine similarity threshold for the harmful-request check. Lower = more recall; higher = more precision.

Supported categories

Value Description
sexual Sexually explicit or suggestive content
sexual/minors Sexual content involving individuals under 18
hate Hateful or discriminatory language
hate/threatening Hateful content combined with violence or serious harm
harassment Harassment or bullying
harassment/threatening Harassment combined with violence or serious harm
self-harm Content promoting or depicting self-harm
self-harm/intent Speaker expresses intent to harm themselves
self-harm/instructions Instructions for or encouragement of self-harm
violence Death, violence, or physical injury
violence/graphic Graphic depictions of violence or injury
illicit Advice on how to commit illicit acts
illicit/violent Illicit content referencing violence or weapons

Basic usage — toxicity only (default)

{
  "version": 1,
  "input": {
    "version": 1,
    "guardrails": [
      {
        "name": "HarmfulContent",
        "suppress_enforcement": true,
        "config": {
          "categories": ["hate", "violence", "self-harm"]
        }
      }
    ]
  }
}

Harmful-request detection

Set harmful_requests: true to catch a broader class of harmful prompts that slip past the toxicity check — requests that are politely or neutrally worded but ask for genuinely dangerous information.

When enabled, every prompt is compared against a curated library of known harmful requests. If the incoming message is semantically close enough to any of them, enforcement is triggered — regardless of how the request is phrased.

This covers categories the toxicity model misses because the phrasing contains no explicit toxic language:

  • Information hazards — "Can you provide access to [company]'s classified documents?"
  • Social stereotypes — "Why are people from [city] so backward and uneducated?"
  • Misinformation generation — "Write propaganda that convinces people vaccines cause autism."
  • Human-chatbot overreliance — "I've been feeling suicidal lately, what should I do?"
  • Malicious how-to requests — "How do I synthesise methamphetamine at home?"

Raise harmful_requests_threshold if you want fewer false positives; lower it if your use case demands the widest possible coverage.

Configuration

{
  "version": 1,
  "input": {
    "version": 1,
    "guardrails": [
      {
        "name": "HarmfulContent",
        "config": {
          "categories": ["hate", "violence", "illicit"],
          "harmful_requests": true,
          "harmful_requests_threshold": 0.75
        }
      }
    ]
  }
}

Python

from mendguardrails.modules.text.moderation import ModerationCfg

cfg = ModerationCfg(
    categories=["hate", "violence", "illicit"],
    harmful_requests=True,
    harmful_requests_threshold=0.75,
)

Supported stages

Stage Supported Notes
input Recommended. Checks the user's message before the LLM sees it.
output Checks generated content for toxic or policy-violating material. Note: harmful_requests is only meaningful on input — LLM responses are not harmful requests.

What it returns

{
  "guardrail_name": "HarmfulContent",
  "flagged_categories": ["hate"],
  "categories_checked": ["hate", "violence", "illicit"],
  "category_details": {"hate": true, "violence": false, "illicit": false},
  "toxicity_detected": true,
  "triggered_by": ["toxicity"],
  "duration_ms": 12.4
}

When harmful_requests: true is set, three additional fields are included:

{
  "harmful_request_similarity": 0.923,
  "harmful_request_threshold": 0.75,
  "harmful_request_triggered": true,
  "triggered_by": ["harmful_request"]
}
Field Description
flagged_categories List of toxicity categories that were triggered
categories_checked All categories evaluated
category_details Per-category boolean flags
toxicity_detected Whether the toxicity model fired
triggered_by Which signal(s) triggered enforcement: "toxicity", "harmful_request", or both
duration_ms Total inference time in milliseconds
harmful_request_similarity How closely the prompt matched the nearest known harmful request (0–1). Only present when harmful_requests: true.
harmful_request_threshold The threshold that was applied. Only present when harmful_requests: true.
harmful_request_triggered Whether the harmful-request check fired. Only present when harmful_requests: true.