Skip to main content

Alerting & On-Call Platform

Custom, edge-first alerting and on-call management platform with AIOps capabilities for intelligent incident response.

Overview

The Olympus Cloud Alerting Platform is a fully custom, edge-first alerting and on-call management system that replaces external services like PagerDuty. The system runs primarily on Cloudflare Edge for GCP-independence, with GCP Cloud Run fallback for Cloudflare-independence, ensuring 100% uptime even during major cloud provider outages.

Cost Savings

MetricBeforeAfterSavings
Annual Cost$6,000-12,000$0100%
MTTA5-10 minUnder 1 min (AI auto-ack)80%+
False Positive Rate15-20%Under 5% (AI filtering)70%+
L1 Auto-Resolved0%40%+New capability

Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Alert Sources │
│ Cloud Monitoring │ Prometheus │ Sentry │ Custom │ Edge Metrics │
└─────────────────────────────────┬───────────────────────────────┘

┌─────────────────┴─────────────────┐
▼ ▼
┌───────────────────────────┐ ┌───────────────────────────┐
│ CLOUDFLARE EDGE │ │ GCP CLOUD RUN │
│ (Primary - Always On) │◄───►│ (Failover - Hot Standby)│
├───────────────────────────┤sync ├───────────────────────────┤
│ Alert Ingestion Worker │ │ Alert Ingestion Service │
│ AlertStateDO (state) │ │ Spanner (persistent) │
│ ScheduleDO (on-call) │ │ Schedule Service │
│ EscalationDO (policies) │ │ Escalation Service │
│ NotificationQueue (DO) │ │ Pub/Sub (notifications) │
│ AIOps Worker (edge AI) │ │ AIOps Service (Vertex) │
└───────────────┬───────────┘ └───────────────┬───────────┘
│ │
└─────────────────┬─────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ AIOps Engine │
│ Anomaly Detection │ Predictive Alerting │ Alert Correlation │
└─────────────────────────────────┬───────────────────────────────┘

┌─────────────────┴─────────────────┐
▼ ▼
┌───────────────────────────┐ ┌───────────────────────────┐
│ Notification Hub │ │ AI Support Agent │
│ (Multi-Channel) │ │ Integration │
├───────────────────────────┤ ├───────────────────────────┤
│ Twilio SMS/Voice │ │ L1 Auto-Response │
│ SendGrid Email │ │ Runbook Execution │
│ Slack/Teams Webhooks │ │ RAG Knowledge Base │
│ Push Notifications │ │ Human Escalation │
│ Cockpit Real-time │ │ Postmortem Generation │
└───────────────────────────┘ └───────────────────────────┘

Alert Severity Levels

SeverityDefinitionResponse TimeNotification
P1-CriticalService outage, data loss riskUnder 5 minVoice + SMS + All
P2-HighMajor feature degradedUnder 15 minSMS + Slack
P3-MediumMinor feature impactedUnder 1 hourSlack + Email
P4-LowPerformance degradedUnder 4 hoursEmail
P5-InfoInformational onlyBest effortLog only

Alert Sources

Supported Integrations

SourceMethodAlert Types
Cloud MonitoringWebhookGCP metrics, uptime
PrometheusAlertManager webhookCustom metrics
SentryWebhookApplication errors
CustomREST APIAny source
Edge MetricsWorkersEdge performance

Alert Ingestion API

POST /api/v1/alerts/ingest
Content-Type: application/json

{
"source": "prometheus",
"severity": "P2",
"title": "High Memory Usage on auth-service",
"description": "Memory usage exceeded 90% threshold",
"service": "auth-service",
"environment": "production",
"labels": {
"pod": "auth-service-abc123",
"region": "us-central1"
},
"runbook_url": "https://docs.olympuscloud.ai/runbooks/memory"
}

On-Call Management

Schedule Types

TypeDescriptionUse Case
Weekly Rotation7-day shiftsPrimary coverage
Daily Rotation24-hour shiftsHigh-frequency
Follow-the-SunTimezone-basedGlobal teams
CustomUser-definedSpecial needs

Schedule Configuration

┌─────────────────────────────────────────────────────────────────┐
│ ON-CALL SCHEDULE: Platform Team Week of Jan 20 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PRIMARY │
│ ─────────────────────────────────────────────────────────── │
│ Mon-Sun: Alice Chen (alice@nebusai.com) │
│ Phone: (555) 123-4567 │
│ │
│ SECONDARY │
│ ─────────────────────────────────────────────────────────── │
│ Mon-Sun: Bob Kim (bob@nebusai.com) │
│ Phone: (555) 234-5678 │
│ │
│ OVERRIDES │
│ ─────────────────────────────────────────────────────────── │
│ Wed 6PM - Thu 6AM: Charlie (covering for Alice) │
│ │
│ NEXT WEEK │
│ ─────────────────────────────────────────────────────────── │
│ Primary: Bob Kim │
│ Secondary: Dana Lopez │
│ │
│ [Edit Schedule] [Add Override] [Swap Shift] │
│ │
└─────────────────────────────────────────────────────────────────┘

Override Management

ActionHow
PTO CoverageCreate override for date range
Shift SwapRequest swap, other confirms
EmergencyManager override
HolidayCalendar-based auto-override

Escalation Policies

Multi-Tier Escalation

┌─────────────────────────────────────────────────────────────────┐
│ ESCALATION POLICY: Platform Critical │
├─────────────────────────────────────────────────────────────────┤
│ │
│ LEVEL 1 (0 min) │
│ ─────────────────────────────────────────────────────────── │
│ Target: Platform Team On-Call │
│ Notify: SMS + Voice │
│ Timeout: 5 minutes │
│ │
│ │ If not acknowledged in 5 min │
│ ▼ │
│ │
│ LEVEL 2 (5 min) │
│ ─────────────────────────────────────────────────────────── │
│ Target: Platform Team Secondary + AI Support Agent │
│ Notify: SMS + Voice + Slack │
│ Timeout: 5 minutes │
│ │
│ │ If not acknowledged in 5 min │
│ ▼ │
│ │
│ LEVEL 3 (10 min) │
│ ─────────────────────────────────────────────────────────── │
│ Target: Engineering Manager │
│ Notify: SMS + Voice + Email │
│ Timeout: 10 minutes │
│ │
│ │ If not acknowledged in 10 min │
│ ▼ │
│ │
│ LEVEL 4 (20 min) │
│ ─────────────────────────────────────────────────────────── │
│ Target: VP Engineering + CTO │
│ Notify: All channels │
│ │
└─────────────────────────────────────────────────────────────────┘

Escalation Settings

SettingDescription
TimeoutTime before escalating to next level
Skip if Ack'dDon't escalate if acknowledged
Parallel ModeNotify all at level simultaneously
Sequential ModeTry each person in order
Round-RobinRotate who gets notified first

Notification Hub

Supported Channels

ChannelProviderUse CaseCost
SMSTwilioP1/P2 primary$0.0079/msg
VoiceTwilioP1 escalation$0.013/min
EmailSendGridAll alertsFree-100/day
SlackWebhookTeam alertsFree
TeamsWebhookEnterpriseFree
PushFCM/APNsMobile appFree
WebSocketInternalCockpitFree

Notification Templates

SMS Template (P1):
───────────────────────────────────────────────────────────────
[P1-CRITICAL] {service}: {title}
Ack: Reply ACK
Details: {short_url}
───────────────────────────────────────────────────────────────

Voice Script (TTS):
───────────────────────────────────────────────────────────────
"Priority one alert for {service}. {title}.
Press 1 to acknowledge. Press 2 to escalate.
Press 3 to hear more details."
───────────────────────────────────────────────────────────────

Delivery Guarantees

SLATarget
SMS DeliveryUnder 5 seconds
Voice CallUnder 10 seconds
Email DeliveryUnder 30 seconds
Retry Attempts5 with backoff
Success Rate99.9%

AIOps Engine

Anomaly Detection

ML-powered baseline learning and anomaly detection:

MethodDescriptionUse Case
Z-ScoreStatistical deviationSimple metrics
Isolation ForestMultivariate anomaliesComplex patterns
ProphetSeasonal patternsTime-series
IQRInterquartile rangeOutlier detection

Predictive Alerting

Predict issues before they occur:

FeatureExample
Capacity"Disk will reach 90% in 4 hours"
SLO Burn"Error budget will exhaust in 2 days"
Traffic"Unusual traffic spike predicted"

Alert Correlation

Reduce noise by grouping related alerts:

┌─────────────────────────────────────────────────────────────────┐
│ CORRELATED INCIDENT #4521 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ROOT CAUSE (AI Identified) │
│ ─────────────────────────────────────────────────────────── │
│ Database connection pool exhaustion on orders-db │
│ Confidence: 92% │
│ │
│ RELATED ALERTS (4 grouped) │
│ ─────────────────────────────────────────────────────────── │
│ ✓ P1: Order Service - High Latency (symptom) │
│ ✓ P2: Payment Service - Timeout Errors (symptom) │
│ ✓ P2: API Gateway - 504 Errors (symptom) │
│ ⭐ P1: Orders-DB - Connection Pool Exhausted (root cause) │
│ │
│ AI RECOMMENDATION │
│ ─────────────────────────────────────────────────────────── │
│ Execute runbook: db-connection-pool-increase │
│ Confidence: 87% │
│ │
│ [Execute Runbook] [Acknowledge All] [View Details] │
│ │
└─────────────────────────────────────────────────────────────────┘

AI Support Agent Integration

L1 Automated Response

The AI Support Agent automatically handles L1 incidents:

  1. Receive Alert - Agent gets alert context
  2. Run Diagnostics - Execute diagnostic tools
  3. Search Knowledge Base - Query runbooks via RAG
  4. Attempt Remediation - Execute safe actions
  5. Escalate if Needed - Pass to human if unresolved

Auto-Remediation Actions

ActionRisk LevelApproval
Restart ServiceLowAuto
Clear CacheLowAuto
Scale UpMediumAuto
Rotate LogsLowAuto
Scale DownMediumRequired
FailoverHighRequired
Database QueryHighRequired

Hey Maximus Commands

Voice commands in Cockpit:

  • "Hey Maximus, what's the current P1 status?"
  • "Hey Maximus, who is on-call for platform?"
  • "Hey Maximus, acknowledge the auth service alert"
  • "Hey Maximus, show me similar incidents"
  • "Hey Maximus, execute the memory runbook"
  • "Hey Maximus, snooze this alert for 30 minutes"

Cockpit Dashboard

Alert Overview

┌─────────────────────────────────────────────────────────────────┐
│ ALERTING DASHBOARD Live │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ACTIVE ALERTS │
│ ─────────────────────────────────────────────────────────── │
│ 🔴 P1: 1 🟠 P2: 3 🟡 P3: 8 🔵 P4: 12 │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 🔴 P1 │ orders-db Connection Pool Exhausted │ │
│ │ │ Started: 5 min ago │ Status: AI Investigating │ │
│ │ │ On-Call: Alice Chen │ [Ack] [Escalate] [Snooze] │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 🟠 P2 │ auth-service High Memory │ │
│ │ │ Started: 12 min ago │ Status: Acknowledged │ │
│ │ │ Ack'd by: Bob Kim │ [Resolve] [Escalate] │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ TODAY'S METRICS │
│ ─────────────────────────────────────────────────────────── │
│ Alerts Fired: 45 AI Auto-Resolved: 18 (40%) │
│ MTTA: 47 sec MTTR: 8 min 24 sec │
│ False Positives: 2 Noise Reduction: 68% │
│ │
└─────────────────────────────────────────────────────────────────┘

On-Call Widget

InfoDisplay
Current PrimaryName, phone, avatar
Current SecondaryName, phone, avatar
Next RotationDate and who's next
Your ScheduleYour upcoming shifts

Maintenance Windows

Creating Windows

Suppress alerts during planned maintenance:

  1. Go to Alerting > Maintenance
  2. Click Schedule Maintenance
  3. Select services to suppress
  4. Set start/end time
  5. Add description for audit

Window Types

TypeBehavior
Full SuppressNo alerts at all
Reduce SeverityP1→P3, P2→P4
Notify OnlyAlert but don't page

API Reference

List Active Alerts

GET /api/v1/alerts?status=active

# Response
{
"alerts": [
{
"id": "alert-123",
"severity": "P1",
"title": "Database Connection Pool Exhausted",
"service": "orders-db",
"status": "firing",
"started_at": "2026-01-18T14:30:00Z",
"acknowledged_by": null,
"escalation_level": 1
}
]
}

Acknowledge Alert

POST /api/v1/alerts/{id}/acknowledge
{
"user_id": "user-123",
"note": "Investigating now"
}

Get On-Call Schedule

GET /api/v1/oncall/schedule/{team}

# Response
{
"team": "platform",
"current": {
"primary": {"name": "Alice Chen", "email": "alice@nebusai.com"},
"secondary": {"name": "Bob Kim", "email": "bob@nebusai.com"}
},
"next_rotation": "2026-01-27T00:00:00Z"
}

Best Practices

  1. Set appropriate severities - Not everything is P1
  2. Link runbooks - Every alert should have a runbook
  3. Tune thresholds - Reduce false positives
  4. Review on-call load - Spread alerts evenly
  5. Use maintenance windows - Planned work shouldn't page
  6. Trust the AI - Let it handle L1 issues