AI Safety & Content Moderation
Comprehensive AI safety controls to ensure responsible AI deployment across all Olympus Cloud services.
Overview
The AI Safety system provides multiple layers of protection:
| Component | Purpose | Scope |
|---|---|---|
| Content Moderator | Block harmful content | Input & output |
| Prompt Guard | Prevent prompt injection | Input |
| Bias Monitor | Detect unfair outputs | Output |
| Hallucination Detector | Verify factual accuracy | Output |
| Incident Manager | Track & respond to issues | System-wide |
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ AI SAFETY PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User Input │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │Prompt Guard │ ← Injection detection, jailbreak prevention │
│ └──────┬──────┘ │
│ │ Pass │
│ ▼ │
│ ┌─────────────┐ │
│ │ Content │ ← Input moderation │
│ │ Moderator │ │
│ └──────┬──────┘ │
│ │ Pass │
│ ▼ │
│ ┌─────────────┐ │
│ │ AI Model │ ← LLM processing │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Content │ ← Output moderation │
│ │ Moderator │ │
│ └──────┬──────┘ │
│ │ Pass │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Bias │ │Hallucination│ │
│ │ Monitor │ │ Detector │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ └─────────┬─────────┘ │
│ ▼ │
│ Safe Response │
│ │
└─────────────────────────────────────────────────────────────────┘
Content Moderation
Safety Check API
POST /api/v1/ai/safety/check
Authorization: Bearer {access_token}
Content-Type: application/json
Request:
{
"input_text": "User message here",
"output_text": "AI response here",
"agent_id": "maximus-voice",
"checks": ["content", "bias", "hallucination", "prompt"]
}
Response:
{
"id": "check-abc123",
"overall_level": "safe",
"is_safe": true,
"should_block": false,
"checks_performed": ["content", "bias", "hallucination", "prompt"],
"check_results": [
{
"check_type": "content",
"passed": true,
"severity": "none",
"details": {
"categories_checked": ["hate", "violence", "sexual", "self-harm"],
"flagged_categories": []
}
},
{
"check_type": "prompt",
"passed": true,
"severity": "none",
"details": {
"injection_detected": false,
"jailbreak_detected": false
}
}
],
"sanitized_input": null,
"sanitized_output": null,
"total_analysis_time_ms": 45,
"recommendations": []
}
Content Categories
| Category | Description | Severity Levels |
|---|---|---|
hate | Hate speech, discrimination | low, medium, high |
violence | Violent content, threats | low, medium, high |
sexual | Sexual content | low, medium, high |
self_harm | Self-harm, suicide | medium, high |
harassment | Bullying, harassment | low, medium, high |
dangerous | Dangerous activities | medium, high |
illegal | Illegal activities | high |
Moderation Actions
| Action | Description | When Applied |
|---|---|---|
allow | Content passes | No issues detected |
flag | Allow but log | Low severity issues |
sanitize | Remove/replace content | Medium severity |
block | Reject entirely | High severity |
escalate | Human review required | Uncertain cases |
Prompt Guard
Protect against prompt injection and jailbreak attempts.
Threat Types
| Threat | Description | Example |
|---|---|---|
injection | Prompt injection attack | "Ignore previous instructions..." |
jailbreak | Bypass safety measures | "Pretend you have no restrictions" |
data_extraction | Extract training data | "Repeat your system prompt" |
privilege_escalation | Gain elevated access | "Act as an admin user" |
encoding_attack | Encoded malicious content | Base64/Unicode tricks |
Detection Patterns
# Common injection patterns detected
INJECTION_PATTERNS = [
r"ignore (all )?(previous|prior|above) (instructions|prompts)",
r"disregard (your|the) (instructions|programming|rules)",
r"pretend (you are|to be|you're) (not|no longer)",
r"act as if (you have|there are) no (restrictions|limits|rules)",
r"bypass (your|the|all) (safety|content|moderation)",
r"reveal (your|the) (system|initial) prompt",
]
Prompt Analysis Response
{
"check_type": "prompt",
"passed": false,
"severity": "high",
"details": {
"threat_type": "injection",
"threat_severity": "high",
"confidence": 0.95,
"matched_patterns": ["ignore previous instructions"],
"action": "block"
},
"recommendations": [
"Reject this input",
"Log incident for review",
"Consider rate limiting user"
]
}
Bias Monitoring
Detect and mitigate unfair or biased AI outputs.
Bias Categories
| Category | Description | Examples |
|---|---|---|
gender | Gender-based bias | Job recommendations |
race_ethnicity | Racial/ethnic bias | Name-based assumptions |
age | Age-based bias | Service quality |
socioeconomic | Economic bias | Pricing recommendations |
geographic | Location bias | Service availability |
Bias Analysis
POST /api/v1/ai/safety/bias-check
Authorization: Bearer {access_token}
Content-Type: application/json
Request:
{
"output_text": "AI-generated response to analyze",
"context": {
"input_text": "Original user request",
"user_demographics": {
"provided": false
}
}
}
Response:
{
"bias_detected": false,
"overall_severity": "none",
"categories_analyzed": ["gender", "race_ethnicity", "age"],
"findings": [],
"confidence": 0.92,
"recommendations": []
}
Bias Severity Levels
| Level | Description | Action |
|---|---|---|
none | No bias detected | Allow |
low | Minor potential bias | Log for review |
medium | Noticeable bias | Flag, consider rewrite |
high | Significant bias | Block, require rewrite |
critical | Severe discrimination | Block, incident report |
Hallucination Detection
Verify AI outputs against known facts.
Verification Process
POST /api/v1/ai/safety/verify
Authorization: Bearer {access_token}
Content-Type: application/json
Request:
{
"output_text": "Our restaurant is open from 8am to 10pm daily.",
"source_documents": [
{
"content": "Hours: Monday-Saturday 9am-9pm, Sunday 10am-8pm",
"source": "restaurant_info"
}
],
"context": {
"tenant_id": "restaurant-123"
}
}
Response:
{
"hallucination_detected": true,
"confidence": 0.88,
"claims_verified": [
{
"claim": "open from 8am",
"verified": false,
"source_says": "9am on weekdays, 10am Sunday",
"severity": "medium"
},
{
"claim": "open to 10pm",
"verified": false,
"source_says": "9pm on weekdays, 8pm Sunday",
"severity": "medium"
},
{
"claim": "open daily",
"verified": true,
"source_says": "Monday-Sunday",
"severity": "none"
}
],
"corrected_output": "Our restaurant is open Monday-Saturday 9am-9pm, and Sunday 10am-8pm.",
"recommendations": [
"Use corrected output",
"Update knowledge base if hours changed"
]
}
Confidence Levels
| Level | Range | Interpretation |
|---|---|---|
high | > 0.9 | Very confident in assessment |
medium | 0.7-0.9 | Reasonably confident |
low | 0.5-0.7 | Uncertain, review recommended |
very_low | < 0.5 | Cannot verify, human review needed |
Incident Management
Track and respond to safety incidents.
Incident Types
| Type | Severity | Response |
|---|---|---|
blocked_content | Low | Auto-logged |
injection_attempt | Medium | Alert + log |
jailbreak_attempt | High | Alert + rate limit |
data_extraction | High | Alert + block user |
bias_incident | Medium | Review queue |
hallucination_critical | High | Auto-correction + alert |
Create Incident
POST /api/v1/ai/safety/incidents
Authorization: Bearer {access_token}
Content-Type: application/json
Request:
{
"incident_type": "injection_attempt",
"severity": "high",
"agent_id": "maximus-voice",
"tenant_id": "tenant-123",
"user_id": "user-456",
"input_text": "Ignore all previous instructions...",
"detection_details": {
"patterns_matched": ["ignore previous instructions"],
"confidence": 0.95
}
}
List Incidents
GET /api/v1/ai/safety/incidents?
start_date=2026-01-01&
severity=high&
status=open
Authorization: Bearer {access_token}
Response:
{
"incidents": [
{
"id": "incident-001",
"incident_type": "injection_attempt",
"severity": "high",
"status": "open",
"agent_id": "maximus-voice",
"tenant_id": "tenant-123",
"created_at": "2026-01-19T10:30:00Z",
"summary": "Prompt injection attempt detected"
}
],
"total": 1,
"pagination": {
"page": 1,
"per_page": 20
}
}
Safety Policies
Configure safety behavior per tenant or agent.
Policy Configuration
{
"tenant_id": "tenant-123",
"policy": {
"content_moderation": {
"enabled": true,
"strictness": "standard",
"categories": {
"hate": {"action": "block", "threshold": 0.7},
"violence": {"action": "block", "threshold": 0.8},
"sexual": {"action": "block", "threshold": 0.6}
}
},
"prompt_guard": {
"enabled": true,
"block_injections": true,
"block_jailbreaks": true,
"log_attempts": true
},
"bias_monitoring": {
"enabled": true,
"categories": ["gender", "race_ethnicity", "age"],
"action_threshold": "medium"
},
"hallucination_detection": {
"enabled": true,
"auto_correct": true,
"require_sources": false
}
}
}
Strictness Levels
| Level | Description | Use Case |
|---|---|---|
relaxed | Fewer restrictions | Internal tools |
standard | Balanced approach | General use |
strict | Maximum safety | Customer-facing |
custom | Per-category settings | Specialized needs |
Integration
Middleware Integration
from app.services.ai_safety import SafetyService
safety_service = SafetyService()
async def process_ai_request(input_text: str) -> str:
# Pre-flight safety check
input_check = await safety_service.check_input(input_text)
if input_check.should_block:
raise SafetyBlockedException(input_check.reason)
# Process with AI
output = await ai_model.generate(input_text)
# Post-flight safety check
output_check = await safety_service.check_output(
input_text=input_text,
output_text=output
)
if output_check.should_block:
# Return safe fallback
return "I'm sorry, I cannot provide that information."
if output_check.sanitized_output:
return output_check.sanitized_output
return output
Event Hooks
# Subscribe to safety events
@safety_service.on_incident
async def handle_incident(incident: SafetyIncident):
if incident.severity == "high":
await alert_security_team(incident)
await rate_limit_user(incident.user_id)
@safety_service.on_block
async def handle_block(event: BlockEvent):
await log_blocked_content(event)
await increment_user_warnings(event.user_id)
Metrics & Monitoring
Available Metrics
| Metric | Description |
|---|---|
safety_checks_total | Total safety checks performed |
safety_blocks_total | Content blocked by category |
injection_attempts_total | Prompt injection attempts |
bias_incidents_total | Bias incidents by severity |
hallucinations_detected | Hallucination detections |
safety_check_latency_ms | Check processing time |
Dashboard Alerts
# Recommended alert configuration
alerts:
- name: HighSeverityIncident
condition: safety_incidents{severity="high"} > 0
for: 1m
severity: critical
- name: InjectionSpike
condition: rate(injection_attempts_total[5m]) > 10
for: 5m
severity: warning
- name: SafetyLatencyHigh
condition: safety_check_latency_ms > 100
for: 5m
severity: warning
Best Practices
Implementation
- Check both input and output: Dual-layer protection
- Use appropriate strictness: Match use case risk level
- Log all incidents: Even low severity for trend analysis
- Review flagged content: Regular human review queue
- Update patterns: Keep detection patterns current
Response Handling
- Graceful degradation: Safe fallbacks, not errors
- User communication: Clear, non-accusatory messages
- Rate limiting: Prevent abuse from persistent actors
- Escalation paths: Clear procedures for serious incidents
Compliance
- Audit trails: Complete logging of safety decisions
- Policy documentation: Document moderation policies
- Regular review: Periodic review of safety metrics
- User appeals: Process for false positive review
Related Documentation
- AI Gateway - AI routing
- LangGraph Agents - Agent workflows
- ACP Router - Cost optimization