Alerting & On-Call Platform

Custom, edge-first alerting and on-call management platform with AIOps capabilities for intelligent incident response.

Overview

The Olympus Cloud Alerting Platform is a fully custom, edge-first alerting and on-call management system that replaces external services like PagerDuty. The system runs primarily on Cloudflare Edge for GCP-independence, with GCP Cloud Run fallback for Cloudflare-independence, ensuring 100% uptime even during major cloud provider outages.

Cost Savings

Metric	Before	After	Savings
Annual Cost	$6,000-12,000	$0	100%
MTTA	5-10 min	Under 1 min (AI auto-ack)	80%+
False Positive Rate	15-20%	Under 5% (AI filtering)	70%+
L1 Auto-Resolved	0%	40%+	New capability

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Alert Sources                             │
│ Cloud Monitoring │ Prometheus │ Sentry │ Custom │ Edge Metrics  │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                ┌─────────────────┴─────────────────┐
                ▼                                   ▼
┌───────────────────────────┐     ┌───────────────────────────┐
│   CLOUDFLARE EDGE         │     │   GCP CLOUD RUN           │
│   (Primary - Always On)   │◄───►│   (Failover - Hot Standby)│
├───────────────────────────┤sync ├───────────────────────────┤
│ Alert Ingestion Worker    │     │ Alert Ingestion Service   │
│ AlertStateDO (state)      │     │ Spanner (persistent)      │
│ ScheduleDO (on-call)      │     │ Schedule Service          │
│ EscalationDO (policies)   │     │ Escalation Service        │
│ NotificationQueue (DO)    │     │ Pub/Sub (notifications)   │
│ AIOps Worker (edge AI)    │     │ AIOps Service (Vertex)    │
└───────────────┬───────────┘     └───────────────┬───────────┘
                │                                   │
                └─────────────────┬─────────────────┘
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                      AIOps Engine                                │
│  Anomaly Detection │ Predictive Alerting │ Alert Correlation    │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                ┌─────────────────┴─────────────────┐
                ▼                                   ▼
┌───────────────────────────┐     ┌───────────────────────────┐
│   Notification Hub        │     │   AI Support Agent        │
│   (Multi-Channel)         │     │   Integration             │
├───────────────────────────┤     ├───────────────────────────┤
│ Twilio SMS/Voice          │     │ L1 Auto-Response          │
│ SendGrid Email            │     │ Runbook Execution         │
│ Slack/Teams Webhooks      │     │ RAG Knowledge Base        │
│ Push Notifications        │     │ Human Escalation          │
│ Cockpit Real-time         │     │ Postmortem Generation     │
└───────────────────────────┘     └───────────────────────────┘

Alert Severity Levels

Severity	Definition	Response Time	Notification
P1-Critical	Service outage, data loss risk	Under 5 min	Voice + SMS + All
P2-High	Major feature degraded	Under 15 min	SMS + Slack
P3-Medium	Minor feature impacted	Under 1 hour	Slack + Email
P4-Low	Performance degraded	Under 4 hours	Email
P5-Info	Informational only	Best effort	Log only

Alert Sources

Supported Integrations

Source	Method	Alert Types
Cloud Monitoring	Webhook	GCP metrics, uptime
Prometheus	AlertManager webhook	Custom metrics
Sentry	Webhook	Application errors
Custom	REST API	Any source
Edge Metrics	Workers	Edge performance

Alert Ingestion API

POST /api/v1/alerts/ingest
Content-Type: application/json

{
  "source": "prometheus",
  "severity": "P2",
  "title": "High Memory Usage on auth-service",
  "description": "Memory usage exceeded 90% threshold",
  "service": "auth-service",
  "environment": "production",
  "labels": {
    "pod": "auth-service-abc123",
    "region": "us-central1"
  },
  "runbook_url": "https://docs.olympuscloud.ai/runbooks/memory"
}

On-Call Management

Schedule Types

Type	Description	Use Case
Weekly Rotation	7-day shifts	Primary coverage
Daily Rotation	24-hour shifts	High-frequency
Follow-the-Sun	Timezone-based	Global teams
Custom	User-defined	Special needs

Schedule Configuration

┌─────────────────────────────────────────────────────────────────┐
│  ON-CALL SCHEDULE: Platform Team              Week of Jan 20    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  PRIMARY                                                        │
│  ───────────────────────────────────────────────────────────    │
│  Mon-Sun:  Alice Chen (alice@nebusai.com)                       │
│            Phone: (555) 123-4567                                │
│                                                                  │
│  SECONDARY                                                      │
│  ───────────────────────────────────────────────────────────    │
│  Mon-Sun:  Bob Kim (bob@nebusai.com)                            │
│            Phone: (555) 234-5678                                │
│                                                                  │
│  OVERRIDES                                                      │
│  ───────────────────────────────────────────────────────────    │
│  Wed 6PM - Thu 6AM:  Charlie (covering for Alice)               │
│                                                                  │
│  NEXT WEEK                                                      │
│  ───────────────────────────────────────────────────────────    │
│  Primary: Bob Kim                                               │
│  Secondary: Dana Lopez                                          │
│                                                                  │
│  [Edit Schedule] [Add Override] [Swap Shift]                    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Override Management

Action	How
PTO Coverage	Create override for date range
Shift Swap	Request swap, other confirms
Emergency	Manager override
Holiday	Calendar-based auto-override

Escalation Policies

Multi-Tier Escalation

┌─────────────────────────────────────────────────────────────────┐
│  ESCALATION POLICY: Platform Critical                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  LEVEL 1 (0 min)                                                │
│  ───────────────────────────────────────────────────────────    │
│  Target: Platform Team On-Call                                  │
│  Notify: SMS + Voice                                            │
│  Timeout: 5 minutes                                             │
│                                                                  │
│      │ If not acknowledged in 5 min                             │
│      ▼                                                          │
│                                                                  │
│  LEVEL 2 (5 min)                                                │
│  ───────────────────────────────────────────────────────────    │
│  Target: Platform Team Secondary + AI Support Agent             │
│  Notify: SMS + Voice + Slack                                    │
│  Timeout: 5 minutes                                             │
│                                                                  │
│      │ If not acknowledged in 5 min                             │
│      ▼                                                          │
│                                                                  │
│  LEVEL 3 (10 min)                                               │
│  ───────────────────────────────────────────────────────────    │
│  Target: Engineering Manager                                    │
│  Notify: SMS + Voice + Email                                    │
│  Timeout: 10 minutes                                            │
│                                                                  │
│      │ If not acknowledged in 10 min                            │
│      ▼                                                          │
│                                                                  │
│  LEVEL 4 (20 min)                                               │
│  ───────────────────────────────────────────────────────────    │
│  Target: VP Engineering + CTO                                   │
│  Notify: All channels                                           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Escalation Settings

Setting	Description
Timeout	Time before escalating to next level
Skip if Ack'd	Don't escalate if acknowledged
Parallel Mode	Notify all at level simultaneously
Sequential Mode	Try each person in order
Round-Robin	Rotate who gets notified first

Notification Hub

Supported Channels

Channel	Provider	Use Case	Cost
SMS	Twilio	P1/P2 primary	$0.0079/msg
Voice	Twilio	P1 escalation	$0.013/min
Email	SendGrid	All alerts	Free-100/day
Slack	Webhook	Team alerts	Free
Teams	Webhook	Enterprise	Free
Push	FCM/APNs	Mobile app	Free
WebSocket	Internal	Cockpit	Free

Notification Templates

SMS Template (P1):
───────────────────────────────────────────────────────────────
[P1-CRITICAL] {service}: {title}
Ack: Reply ACK
Details: {short_url}
───────────────────────────────────────────────────────────────

Voice Script (TTS):
───────────────────────────────────────────────────────────────
"Priority one alert for {service}. {title}.
Press 1 to acknowledge. Press 2 to escalate.
Press 3 to hear more details."
───────────────────────────────────────────────────────────────

Delivery Guarantees

SLA	Target
SMS Delivery	Under 5 seconds
Voice Call	Under 10 seconds
Email Delivery	Under 30 seconds
Retry Attempts	5 with backoff
Success Rate	99.9%

AIOps Engine

Anomaly Detection

ML-powered baseline learning and anomaly detection:

Method	Description	Use Case
Z-Score	Statistical deviation	Simple metrics
Isolation Forest	Multivariate anomalies	Complex patterns
Prophet	Seasonal patterns	Time-series
IQR	Interquartile range	Outlier detection

Predictive Alerting

Predict issues before they occur:

Feature	Example
Capacity	"Disk will reach 90% in 4 hours"
SLO Burn	"Error budget will exhaust in 2 days"
Traffic	"Unusual traffic spike predicted"

Alert Correlation

Reduce noise by grouping related alerts:

┌─────────────────────────────────────────────────────────────────┐
│  CORRELATED INCIDENT #4521                                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ROOT CAUSE (AI Identified)                                     │
│  ───────────────────────────────────────────────────────────    │
│  Database connection pool exhaustion on orders-db               │
│  Confidence: 92%                                                │
│                                                                  │
│  RELATED ALERTS (4 grouped)                                     │
│  ───────────────────────────────────────────────────────────    │
│  ✓ P1: Order Service - High Latency (symptom)                  │
│  ✓ P2: Payment Service - Timeout Errors (symptom)              │
│  ✓ P2: API Gateway - 504 Errors (symptom)                      │
│  ⭐ P1: Orders-DB - Connection Pool Exhausted (root cause)      │
│                                                                  │
│  AI RECOMMENDATION                                              │
│  ───────────────────────────────────────────────────────────    │
│  Execute runbook: db-connection-pool-increase                   │
│  Confidence: 87%                                                │
│                                                                  │
│  [Execute Runbook] [Acknowledge All] [View Details]            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

AI Support Agent Integration

L1 Automated Response

The AI Support Agent automatically handles L1 incidents:

Receive Alert - Agent gets alert context
Run Diagnostics - Execute diagnostic tools
Search Knowledge Base - Query runbooks via RAG
Attempt Remediation - Execute safe actions
Escalate if Needed - Pass to human if unresolved

Auto-Remediation Actions

Action	Risk Level	Approval
Restart Service	Low	Auto
Clear Cache	Low	Auto
Scale Up	Medium	Auto
Rotate Logs	Low	Auto
Scale Down	Medium	Required
Failover	High	Required
Database Query	High	Required

Hey Maximus Commands

Voice commands in Cockpit:

"Hey Maximus, what's the current P1 status?"
"Hey Maximus, who is on-call for platform?"
"Hey Maximus, acknowledge the auth service alert"
"Hey Maximus, show me similar incidents"
"Hey Maximus, execute the memory runbook"
"Hey Maximus, snooze this alert for 30 minutes"

Cockpit Dashboard

Alert Overview

┌─────────────────────────────────────────────────────────────────┐
│  ALERTING DASHBOARD                                   Live      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ACTIVE ALERTS                                                  │
│  ───────────────────────────────────────────────────────────    │
│  🔴 P1: 1    🟠 P2: 3    🟡 P3: 8    🔵 P4: 12                 │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 🔴 P1 │ orders-db Connection Pool Exhausted             │   │
│  │       │ Started: 5 min ago │ Status: AI Investigating   │   │
│  │       │ On-Call: Alice Chen │ [Ack] [Escalate] [Snooze] │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 🟠 P2 │ auth-service High Memory                        │   │
│  │       │ Started: 12 min ago │ Status: Acknowledged      │   │
│  │       │ Ack'd by: Bob Kim │ [Resolve] [Escalate]        │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  TODAY'S METRICS                                                │
│  ───────────────────────────────────────────────────────────    │
│  Alerts Fired: 45        AI Auto-Resolved: 18 (40%)            │
│  MTTA: 47 sec            MTTR: 8 min 24 sec                    │
│  False Positives: 2      Noise Reduction: 68%                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Info	Display
Current Primary	Name, phone, avatar
Current Secondary	Name, phone, avatar
Next Rotation	Date and who's next
Your Schedule	Your upcoming shifts

Maintenance Windows

Creating Windows

Suppress alerts during planned maintenance:

Go to Alerting > Maintenance
Click Schedule Maintenance
Select services to suppress
Set start/end time
Add description for audit

Window Types

Type	Behavior
Full Suppress	No alerts at all
Reduce Severity	P1→P3, P2→P4
Notify Only	Alert but don't page

API Reference

List Active Alerts

GET /api/v1/alerts?status=active

# Response
{
  "alerts": [
    {
      "id": "alert-123",
      "severity": "P1",
      "title": "Database Connection Pool Exhausted",
      "service": "orders-db",
      "status": "firing",
      "started_at": "2026-01-18T14:30:00Z",
      "acknowledged_by": null,
      "escalation_level": 1
    }
  ]
}

Acknowledge Alert

POST /api/v1/alerts/{id}/acknowledge
{
  "user_id": "user-123",
  "note": "Investigating now"
}

Get On-Call Schedule

GET /api/v1/oncall/schedule/{team}

# Response
{
  "team": "platform",
  "current": {
    "primary": {"name": "Alice Chen", "email": "alice@nebusai.com"},
    "secondary": {"name": "Bob Kim", "email": "bob@nebusai.com"}
  },
  "next_rotation": "2026-01-27T00:00:00Z"
}

Best Practices

Set appropriate severities - Not everything is P1
Link runbooks - Every alert should have a runbook
Tune thresholds - Reduce false positives
Review on-call load - Spread alerts evenly
Use maintenance windows - Planned work shouldn't page
Trust the AI - Let it handle L1 issues

Incident Response Runbook - How to respond
Cockpit Operations - Cockpit dashboard
ACP AI Router - AI infrastructure

Overview​

Cost Savings​

Architecture​

Alert Severity Levels​

Alert Sources​

Supported Integrations​

Alert Ingestion API​

On-Call Management​

Schedule Types​

Schedule Configuration​

Override Management​

Escalation Policies​

Multi-Tier Escalation​

Escalation Settings​

Notification Hub​

Supported Channels​

Notification Templates​

Delivery Guarantees​

AIOps Engine​

Anomaly Detection​

Predictive Alerting​

Alert Correlation​

AI Support Agent Integration​

L1 Automated Response​

Auto-Remediation Actions​

Hey Maximus Commands​

Cockpit Dashboard​

Alert Overview​

On-Call Widget​

Maintenance Windows​

Creating Windows​

Window Types​

API Reference​

List Active Alerts​

Acknowledge Alert​

Get On-Call Schedule​

Best Practices​

Related Documentation​