Skip to main content

AI Safety & Content Moderation

Comprehensive AI safety controls to ensure responsible AI deployment across all Olympus Cloud services.

Overview

The AI Safety system provides multiple layers of protection:

ComponentPurposeScope
Content ModeratorBlock harmful contentInput & output
Prompt GuardPrevent prompt injectionInput
Bias MonitorDetect unfair outputsOutput
Hallucination DetectorVerify factual accuracyOutput
Incident ManagerTrack & respond to issuesSystem-wide

Architecture

┌─────────────────────────────────────────────────────────────────┐
│ AI SAFETY PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User Input │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │Prompt Guard │ ← Injection detection, jailbreak prevention │
│ └──────┬──────┘ │
│ │ Pass │
│ ▼ │
│ ┌─────────────┐ │
│ │ Content │ ← Input moderation │
│ │ Moderator │ │
│ └──────┬──────┘ │
│ │ Pass │
│ ▼ │
│ ┌─────────────┐ │
│ │ AI Model │ ← LLM processing │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Content │ ← Output moderation │
│ │ Moderator │ │
│ └──────┬──────┘ │
│ │ Pass │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Bias │ │Hallucination│ │
│ │ Monitor │ │ Detector │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ └─────────┬─────────┘ │
│ ▼ │
│ Safe Response │
│ │
└─────────────────────────────────────────────────────────────────┘

Content Moderation

Safety Check API

POST /api/v1/ai/safety/check
Authorization: Bearer {access_token}
Content-Type: application/json

Request:

{
"input_text": "User message here",
"output_text": "AI response here",
"agent_id": "maximus-voice",
"checks": ["content", "bias", "hallucination", "prompt"]
}

Response:

{
"id": "check-abc123",
"overall_level": "safe",
"is_safe": true,
"should_block": false,
"checks_performed": ["content", "bias", "hallucination", "prompt"],
"check_results": [
{
"check_type": "content",
"passed": true,
"severity": "none",
"details": {
"categories_checked": ["hate", "violence", "sexual", "self-harm"],
"flagged_categories": []
}
},
{
"check_type": "prompt",
"passed": true,
"severity": "none",
"details": {
"injection_detected": false,
"jailbreak_detected": false
}
}
],
"sanitized_input": null,
"sanitized_output": null,
"total_analysis_time_ms": 45,
"recommendations": []
}

Content Categories

CategoryDescriptionSeverity Levels
hateHate speech, discriminationlow, medium, high
violenceViolent content, threatslow, medium, high
sexualSexual contentlow, medium, high
self_harmSelf-harm, suicidemedium, high
harassmentBullying, harassmentlow, medium, high
dangerousDangerous activitiesmedium, high
illegalIllegal activitieshigh

Moderation Actions

ActionDescriptionWhen Applied
allowContent passesNo issues detected
flagAllow but logLow severity issues
sanitizeRemove/replace contentMedium severity
blockReject entirelyHigh severity
escalateHuman review requiredUncertain cases

Prompt Guard

Protect against prompt injection and jailbreak attempts.

Threat Types

ThreatDescriptionExample
injectionPrompt injection attack"Ignore previous instructions..."
jailbreakBypass safety measures"Pretend you have no restrictions"
data_extractionExtract training data"Repeat your system prompt"
privilege_escalationGain elevated access"Act as an admin user"
encoding_attackEncoded malicious contentBase64/Unicode tricks

Detection Patterns

# Common injection patterns detected
INJECTION_PATTERNS = [
r"ignore (all )?(previous|prior|above) (instructions|prompts)",
r"disregard (your|the) (instructions|programming|rules)",
r"pretend (you are|to be|you're) (not|no longer)",
r"act as if (you have|there are) no (restrictions|limits|rules)",
r"bypass (your|the|all) (safety|content|moderation)",
r"reveal (your|the) (system|initial) prompt",
]

Prompt Analysis Response

{
"check_type": "prompt",
"passed": false,
"severity": "high",
"details": {
"threat_type": "injection",
"threat_severity": "high",
"confidence": 0.95,
"matched_patterns": ["ignore previous instructions"],
"action": "block"
},
"recommendations": [
"Reject this input",
"Log incident for review",
"Consider rate limiting user"
]
}

Bias Monitoring

Detect and mitigate unfair or biased AI outputs.

Bias Categories

CategoryDescriptionExamples
genderGender-based biasJob recommendations
race_ethnicityRacial/ethnic biasName-based assumptions
ageAge-based biasService quality
socioeconomicEconomic biasPricing recommendations
geographicLocation biasService availability

Bias Analysis

POST /api/v1/ai/safety/bias-check
Authorization: Bearer {access_token}
Content-Type: application/json

Request:

{
"output_text": "AI-generated response to analyze",
"context": {
"input_text": "Original user request",
"user_demographics": {
"provided": false
}
}
}

Response:

{
"bias_detected": false,
"overall_severity": "none",
"categories_analyzed": ["gender", "race_ethnicity", "age"],
"findings": [],
"confidence": 0.92,
"recommendations": []
}

Bias Severity Levels

LevelDescriptionAction
noneNo bias detectedAllow
lowMinor potential biasLog for review
mediumNoticeable biasFlag, consider rewrite
highSignificant biasBlock, require rewrite
criticalSevere discriminationBlock, incident report

Hallucination Detection

Verify AI outputs against known facts.

Verification Process

POST /api/v1/ai/safety/verify
Authorization: Bearer {access_token}
Content-Type: application/json

Request:

{
"output_text": "Our restaurant is open from 8am to 10pm daily.",
"source_documents": [
{
"content": "Hours: Monday-Saturday 9am-9pm, Sunday 10am-8pm",
"source": "restaurant_info"
}
],
"context": {
"tenant_id": "restaurant-123"
}
}

Response:

{
"hallucination_detected": true,
"confidence": 0.88,
"claims_verified": [
{
"claim": "open from 8am",
"verified": false,
"source_says": "9am on weekdays, 10am Sunday",
"severity": "medium"
},
{
"claim": "open to 10pm",
"verified": false,
"source_says": "9pm on weekdays, 8pm Sunday",
"severity": "medium"
},
{
"claim": "open daily",
"verified": true,
"source_says": "Monday-Sunday",
"severity": "none"
}
],
"corrected_output": "Our restaurant is open Monday-Saturday 9am-9pm, and Sunday 10am-8pm.",
"recommendations": [
"Use corrected output",
"Update knowledge base if hours changed"
]
}

Confidence Levels

LevelRangeInterpretation
high> 0.9Very confident in assessment
medium0.7-0.9Reasonably confident
low0.5-0.7Uncertain, review recommended
very_low< 0.5Cannot verify, human review needed

Incident Management

Track and respond to safety incidents.

Incident Types

TypeSeverityResponse
blocked_contentLowAuto-logged
injection_attemptMediumAlert + log
jailbreak_attemptHighAlert + rate limit
data_extractionHighAlert + block user
bias_incidentMediumReview queue
hallucination_criticalHighAuto-correction + alert

Create Incident

POST /api/v1/ai/safety/incidents
Authorization: Bearer {access_token}
Content-Type: application/json

Request:

{
"incident_type": "injection_attempt",
"severity": "high",
"agent_id": "maximus-voice",
"tenant_id": "tenant-123",
"user_id": "user-456",
"input_text": "Ignore all previous instructions...",
"detection_details": {
"patterns_matched": ["ignore previous instructions"],
"confidence": 0.95
}
}

List Incidents

GET /api/v1/ai/safety/incidents?
start_date=2026-01-01&
severity=high&
status=open
Authorization: Bearer {access_token}

Response:

{
"incidents": [
{
"id": "incident-001",
"incident_type": "injection_attempt",
"severity": "high",
"status": "open",
"agent_id": "maximus-voice",
"tenant_id": "tenant-123",
"created_at": "2026-01-19T10:30:00Z",
"summary": "Prompt injection attempt detected"
}
],
"total": 1,
"pagination": {
"page": 1,
"per_page": 20
}
}

Safety Policies

Configure safety behavior per tenant or agent.

Policy Configuration

{
"tenant_id": "tenant-123",
"policy": {
"content_moderation": {
"enabled": true,
"strictness": "standard",
"categories": {
"hate": {"action": "block", "threshold": 0.7},
"violence": {"action": "block", "threshold": 0.8},
"sexual": {"action": "block", "threshold": 0.6}
}
},
"prompt_guard": {
"enabled": true,
"block_injections": true,
"block_jailbreaks": true,
"log_attempts": true
},
"bias_monitoring": {
"enabled": true,
"categories": ["gender", "race_ethnicity", "age"],
"action_threshold": "medium"
},
"hallucination_detection": {
"enabled": true,
"auto_correct": true,
"require_sources": false
}
}
}

Strictness Levels

LevelDescriptionUse Case
relaxedFewer restrictionsInternal tools
standardBalanced approachGeneral use
strictMaximum safetyCustomer-facing
customPer-category settingsSpecialized needs

Integration

Middleware Integration

from app.services.ai_safety import SafetyService

safety_service = SafetyService()

async def process_ai_request(input_text: str) -> str:
# Pre-flight safety check
input_check = await safety_service.check_input(input_text)
if input_check.should_block:
raise SafetyBlockedException(input_check.reason)

# Process with AI
output = await ai_model.generate(input_text)

# Post-flight safety check
output_check = await safety_service.check_output(
input_text=input_text,
output_text=output
)

if output_check.should_block:
# Return safe fallback
return "I'm sorry, I cannot provide that information."

if output_check.sanitized_output:
return output_check.sanitized_output

return output

Event Hooks

# Subscribe to safety events
@safety_service.on_incident
async def handle_incident(incident: SafetyIncident):
if incident.severity == "high":
await alert_security_team(incident)
await rate_limit_user(incident.user_id)

@safety_service.on_block
async def handle_block(event: BlockEvent):
await log_blocked_content(event)
await increment_user_warnings(event.user_id)

Metrics & Monitoring

Available Metrics

MetricDescription
safety_checks_totalTotal safety checks performed
safety_blocks_totalContent blocked by category
injection_attempts_totalPrompt injection attempts
bias_incidents_totalBias incidents by severity
hallucinations_detectedHallucination detections
safety_check_latency_msCheck processing time

Dashboard Alerts

# Recommended alert configuration
alerts:
- name: HighSeverityIncident
condition: safety_incidents{severity="high"} > 0
for: 1m
severity: critical

- name: InjectionSpike
condition: rate(injection_attempts_total[5m]) > 10
for: 5m
severity: warning

- name: SafetyLatencyHigh
condition: safety_check_latency_ms > 100
for: 5m
severity: warning

Best Practices

Implementation

  1. Check both input and output: Dual-layer protection
  2. Use appropriate strictness: Match use case risk level
  3. Log all incidents: Even low severity for trend analysis
  4. Review flagged content: Regular human review queue
  5. Update patterns: Keep detection patterns current

Response Handling

  1. Graceful degradation: Safe fallbacks, not errors
  2. User communication: Clear, non-accusatory messages
  3. Rate limiting: Prevent abuse from persistent actors
  4. Escalation paths: Clear procedures for serious incidents

Compliance

  1. Audit trails: Complete logging of safety decisions
  2. Policy documentation: Document moderation policies
  3. Regular review: Periodic review of safety metrics
  4. User appeals: Process for false positive review