AI Safety & Content Moderation

Comprehensive AI safety controls to ensure responsible AI deployment across all Olympus Cloud services.

Overview

The AI Safety system provides multiple layers of protection:

Component	Purpose	Scope
Content Moderator	Block harmful content	Input & output
Prompt Guard	Prevent prompt injection	Input
Bias Monitor	Detect unfair outputs	Output
Hallucination Detector	Verify factual accuracy	Output
Incident Manager	Track & respond to issues	System-wide

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      AI SAFETY PIPELINE                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  User Input                                                      │
│      │                                                           │
│      ▼                                                           │
│  ┌─────────────┐                                                │
│  │Prompt Guard │ ← Injection detection, jailbreak prevention    │
│  └──────┬──────┘                                                │
│         │ Pass                                                   │
│         ▼                                                        │
│  ┌─────────────┐                                                │
│  │  Content    │ ← Input moderation                             │
│  │ Moderator   │                                                │
│  └──────┬──────┘                                                │
│         │ Pass                                                   │
│         ▼                                                        │
│  ┌─────────────┐                                                │
│  │  AI Model   │ ← LLM processing                               │
│  └──────┬──────┘                                                │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────┐                                                │
│  │  Content    │ ← Output moderation                            │
│  │ Moderator   │                                                │
│  └──────┬──────┘                                                │
│         │ Pass                                                   │
│         ▼                                                        │
│  ┌─────────────┐    ┌─────────────┐                            │
│  │   Bias      │    │Hallucination│                             │
│  │  Monitor    │    │  Detector   │                             │
│  └──────┬──────┘    └──────┬──────┘                            │
│         │                   │                                    │
│         └─────────┬─────────┘                                   │
│                   ▼                                              │
│            Safe Response                                         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Content Moderation

Safety Check API

POST /api/v1/ai/safety/check
Authorization: Bearer {access_token}
Content-Type: application/json

Request:

{
  "input_text": "User message here",
  "output_text": "AI response here",
  "agent_id": "maximus-voice",
  "checks": ["content", "bias", "hallucination", "prompt"]
}

Response:

{
  "id": "check-abc123",
  "overall_level": "safe",
  "is_safe": true,
  "should_block": false,
  "checks_performed": ["content", "bias", "hallucination", "prompt"],
  "check_results": [
    {
      "check_type": "content",
      "passed": true,
      "severity": "none",
      "details": {
        "categories_checked": ["hate", "violence", "sexual", "self-harm"],
        "flagged_categories": []
      }
    },
    {
      "check_type": "prompt",
      "passed": true,
      "severity": "none",
      "details": {
        "injection_detected": false,
        "jailbreak_detected": false
      }
    }
  ],
  "sanitized_input": null,
  "sanitized_output": null,
  "total_analysis_time_ms": 45,
  "recommendations": []
}

Content Categories

Category	Description	Severity Levels
`hate`	Hate speech, discrimination	low, medium, high
`violence`	Violent content, threats	low, medium, high
`sexual`	Sexual content	low, medium, high
`self_harm`	Self-harm, suicide	medium, high
`harassment`	Bullying, harassment	low, medium, high
`dangerous`	Dangerous activities	medium, high
`illegal`	Illegal activities	high

Moderation Actions

Action	Description	When Applied
`allow`	Content passes	No issues detected
`flag`	Allow but log	Low severity issues
`sanitize`	Remove/replace content	Medium severity
`block`	Reject entirely	High severity
`escalate`	Human review required	Uncertain cases

Prompt Guard

Protect against prompt injection and jailbreak attempts.

Threat Types

Threat	Description	Example
`injection`	Prompt injection attack	"Ignore previous instructions..."
`jailbreak`	Bypass safety measures	"Pretend you have no restrictions"
`data_extraction`	Extract training data	"Repeat your system prompt"
`privilege_escalation`	Gain elevated access	"Act as an admin user"
`encoding_attack`	Encoded malicious content	Base64/Unicode tricks

Detection Patterns

# Common injection patterns detected
INJECTION_PATTERNS = [
    r"ignore (all )?(previous|prior|above) (instructions|prompts)",
    r"disregard (your|the) (instructions|programming|rules)",
    r"pretend (you are|to be|you're) (not|no longer)",
    r"act as if (you have|there are) no (restrictions|limits|rules)",
    r"bypass (your|the|all) (safety|content|moderation)",
    r"reveal (your|the) (system|initial) prompt",
]

Prompt Analysis Response

{
  "check_type": "prompt",
  "passed": false,
  "severity": "high",
  "details": {
    "threat_type": "injection",
    "threat_severity": "high",
    "confidence": 0.95,
    "matched_patterns": ["ignore previous instructions"],
    "action": "block"
  },
  "recommendations": [
    "Reject this input",
    "Log incident for review",
    "Consider rate limiting user"
  ]
}

Bias Monitoring

Detect and mitigate unfair or biased AI outputs.

Bias Categories

Category	Description	Examples
`gender`	Gender-based bias	Job recommendations
`race_ethnicity`	Racial/ethnic bias	Name-based assumptions
`age`	Age-based bias	Service quality
`socioeconomic`	Economic bias	Pricing recommendations
`geographic`	Location bias	Service availability

Bias Analysis

POST /api/v1/ai/safety/bias-check
Authorization: Bearer {access_token}
Content-Type: application/json

Request:

{
  "output_text": "AI-generated response to analyze",
  "context": {
    "input_text": "Original user request",
    "user_demographics": {
      "provided": false
    }
  }
}

Response:

{
  "bias_detected": false,
  "overall_severity": "none",
  "categories_analyzed": ["gender", "race_ethnicity", "age"],
  "findings": [],
  "confidence": 0.92,
  "recommendations": []
}

Bias Severity Levels

Level	Description	Action
`none`	No bias detected	Allow
`low`	Minor potential bias	Log for review
`medium`	Noticeable bias	Flag, consider rewrite
`high`	Significant bias	Block, require rewrite
`critical`	Severe discrimination	Block, incident report

Hallucination Detection

Verify AI outputs against known facts.

Verification Process

POST /api/v1/ai/safety/verify
Authorization: Bearer {access_token}
Content-Type: application/json

Request:

{
  "output_text": "Our restaurant is open from 8am to 10pm daily.",
  "source_documents": [
    {
      "content": "Hours: Monday-Saturday 9am-9pm, Sunday 10am-8pm",
      "source": "restaurant_info"
    }
  ],
  "context": {
    "tenant_id": "restaurant-123"
  }
}

Response:

{
  "hallucination_detected": true,
  "confidence": 0.88,
  "claims_verified": [
    {
      "claim": "open from 8am",
      "verified": false,
      "source_says": "9am on weekdays, 10am Sunday",
      "severity": "medium"
    },
    {
      "claim": "open to 10pm",
      "verified": false,
      "source_says": "9pm on weekdays, 8pm Sunday",
      "severity": "medium"
    },
    {
      "claim": "open daily",
      "verified": true,
      "source_says": "Monday-Sunday",
      "severity": "none"
    }
  ],
  "corrected_output": "Our restaurant is open Monday-Saturday 9am-9pm, and Sunday 10am-8pm.",
  "recommendations": [
    "Use corrected output",
    "Update knowledge base if hours changed"
  ]
}

Confidence Levels

Level	Range	Interpretation
`high`	> 0.9	Very confident in assessment
`medium`	0.7-0.9	Reasonably confident
`low`	0.5-0.7	Uncertain, review recommended
`very_low`	< 0.5	Cannot verify, human review needed

Incident Management

Track and respond to safety incidents.

Incident Types

Type	Severity	Response
`blocked_content`	Low	Auto-logged
`injection_attempt`	Medium	Alert + log
`jailbreak_attempt`	High	Alert + rate limit
`data_extraction`	High	Alert + block user
`bias_incident`	Medium	Review queue
`hallucination_critical`	High	Auto-correction + alert

Create Incident

POST /api/v1/ai/safety/incidents
Authorization: Bearer {access_token}
Content-Type: application/json

Request:

{
  "incident_type": "injection_attempt",
  "severity": "high",
  "agent_id": "maximus-voice",
  "tenant_id": "tenant-123",
  "user_id": "user-456",
  "input_text": "Ignore all previous instructions...",
  "detection_details": {
    "patterns_matched": ["ignore previous instructions"],
    "confidence": 0.95
  }
}

List Incidents

GET /api/v1/ai/safety/incidents?
  start_date=2026-01-01&
  severity=high&
  status=open
Authorization: Bearer {access_token}

Response:

{
  "incidents": [
    {
      "id": "incident-001",
      "incident_type": "injection_attempt",
      "severity": "high",
      "status": "open",
      "agent_id": "maximus-voice",
      "tenant_id": "tenant-123",
      "created_at": "2026-01-19T10:30:00Z",
      "summary": "Prompt injection attempt detected"
    }
  ],
  "total": 1,
  "pagination": {
    "page": 1,
    "per_page": 20
  }
}

Safety Policies

Configure safety behavior per tenant or agent.

Policy Configuration

{
  "tenant_id": "tenant-123",
  "policy": {
    "content_moderation": {
      "enabled": true,
      "strictness": "standard",
      "categories": {
        "hate": {"action": "block", "threshold": 0.7},
        "violence": {"action": "block", "threshold": 0.8},
        "sexual": {"action": "block", "threshold": 0.6}
      }
    },
    "prompt_guard": {
      "enabled": true,
      "block_injections": true,
      "block_jailbreaks": true,
      "log_attempts": true
    },
    "bias_monitoring": {
      "enabled": true,
      "categories": ["gender", "race_ethnicity", "age"],
      "action_threshold": "medium"
    },
    "hallucination_detection": {
      "enabled": true,
      "auto_correct": true,
      "require_sources": false
    }
  }
}

Strictness Levels

Level	Description	Use Case
`relaxed`	Fewer restrictions	Internal tools
`standard`	Balanced approach	General use
`strict`	Maximum safety	Customer-facing
`custom`	Per-category settings	Specialized needs

Integration

Middleware Integration

from app.services.ai_safety import SafetyService

safety_service = SafetyService()

async def process_ai_request(input_text: str) -> str:
    # Pre-flight safety check
    input_check = await safety_service.check_input(input_text)
    if input_check.should_block:
        raise SafetyBlockedException(input_check.reason)

    # Process with AI
    output = await ai_model.generate(input_text)

    # Post-flight safety check
    output_check = await safety_service.check_output(
        input_text=input_text,
        output_text=output
    )

    if output_check.should_block:
        # Return safe fallback
        return "I'm sorry, I cannot provide that information."

    if output_check.sanitized_output:
        return output_check.sanitized_output

    return output

Event Hooks

# Subscribe to safety events
@safety_service.on_incident
async def handle_incident(incident: SafetyIncident):
    if incident.severity == "high":
        await alert_security_team(incident)
        await rate_limit_user(incident.user_id)

@safety_service.on_block
async def handle_block(event: BlockEvent):
    await log_blocked_content(event)
    await increment_user_warnings(event.user_id)

Metrics & Monitoring

Available Metrics

Metric	Description
`safety_checks_total`	Total safety checks performed
`safety_blocks_total`	Content blocked by category
`injection_attempts_total`	Prompt injection attempts
`bias_incidents_total`	Bias incidents by severity
`hallucinations_detected`	Hallucination detections
`safety_check_latency_ms`	Check processing time

Dashboard Alerts

# Recommended alert configuration
alerts:
  - name: HighSeverityIncident
    condition: safety_incidents{severity="high"} > 0
    for: 1m
    severity: critical

  - name: InjectionSpike
    condition: rate(injection_attempts_total[5m]) > 10
    for: 5m
    severity: warning

  - name: SafetyLatencyHigh
    condition: safety_check_latency_ms > 100
    for: 5m
    severity: warning

Best Practices

Implementation

Check both input and output: Dual-layer protection
Use appropriate strictness: Match use case risk level
Log all incidents: Even low severity for trend analysis
Review flagged content: Regular human review queue
Update patterns: Keep detection patterns current

Response Handling

Graceful degradation: Safe fallbacks, not errors
User communication: Clear, non-accusatory messages
Rate limiting: Prevent abuse from persistent actors
Escalation paths: Clear procedures for serious incidents

Compliance

Audit trails: Complete logging of safety decisions
Policy documentation: Document moderation policies
Regular review: Periodic review of safety metrics
User appeals: Process for false positive review

AI Gateway - AI routing
LangGraph Agents - Agent workflows
ACP Router - Cost optimization

Overview​

Architecture​

Content Moderation​

Safety Check API​

Content Categories​

Moderation Actions​

Prompt Guard​

Threat Types​

Detection Patterns​

Prompt Analysis Response​

Bias Monitoring​

Bias Categories​

Bias Analysis​

Bias Severity Levels​

Hallucination Detection​

Verification Process​

Confidence Levels​

Incident Management​

Incident Types​

Create Incident​

List Incidents​

Safety Policies​

Policy Configuration​

Strictness Levels​

Integration​

Middleware Integration​

Event Hooks​

Metrics & Monitoring​

Available Metrics​

Dashboard Alerts​

Best Practices​

Implementation​

Response Handling​

Compliance​

Related Documentation​

Overview

Architecture

Content Moderation

Safety Check API

Content Categories

Moderation Actions

Prompt Guard

Threat Types

Detection Patterns

Prompt Analysis Response

Bias Monitoring

Bias Categories

Bias Analysis

Bias Severity Levels

Hallucination Detection

Verification Process

Confidence Levels

Incident Management

Incident Types

Create Incident

List Incidents

Safety Policies

Policy Configuration

Strictness Levels

Integration

Middleware Integration

Event Hooks

Metrics & Monitoring

Available Metrics

Dashboard Alerts

Best Practices

Implementation

Response Handling

Compliance

Related Documentation