Alerting & On-Call Platform
Custom, edge-first alerting and on-call management platform with AIOps capabilities for intelligent incident response.
Overview
The Olympus Cloud Alerting Platform is a fully custom, edge-first alerting and on-call management system that replaces external services like PagerDuty. The system runs primarily on Cloudflare Edge for GCP-independence, with GCP Cloud Run fallback for Cloudflare-independence, ensuring 100% uptime even during major cloud provider outages.
Cost Savings
| Metric | Before | After | Savings |
|---|---|---|---|
| Annual Cost | $6,000-12,000 | $0 | 100% |
| MTTA | 5-10 min | Under 1 min (AI auto-ack) | 80%+ |
| False Positive Rate | 15-20% | Under 5% (AI filtering) | 70%+ |
| L1 Auto-Resolved | 0% | 40%+ | New capability |
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Alert Sources │
│ Cloud Monitoring │ Prometheus │ Sentry │ Custom │ Edge Metrics │
└─────────────────────────────────┬───────────────────────────────┘
│
┌─────────────────┴─────────────────┐
▼ ▼
┌───────────────────────────┐ ┌───────────────────────────┐
│ CLOUDFLARE EDGE │ │ GCP CLOUD RUN │
│ (Primary - Always On) │◄───►│ (Failover - Hot Standby)│
├───────────────────────────┤sync ├───────────────────────────┤
│ Alert Ingestion Worker │ │ Alert Ingestion Service │
│ AlertStateDO (state) │ │ Spanner (persistent) │
│ ScheduleDO (on-call) │ │ Schedule Service │
│ EscalationDO (policies) │ │ Escalation Service │
│ NotificationQueue (DO) │ │ Pub/Sub (notifications) │
│ AIOps Worker (edge AI) │ │ AIOps Service (Vertex) │
└───────────────┬───────────┘ └───────────────┬───────────┘
│ │
└─────────────────┬─────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ AIOps Engine │
│ Anomaly Detection │ Predictive Alerting │ Alert Correlation │
└─────────────────────────────────┬───────────────────────────────┘
│
┌─────────────────┴─────────────────┐
▼ ▼
┌───────────────────────────┐ ┌───────────────────────────┐
│ Notification Hub │ │ AI Support Agent │
│ (Multi-Channel) │ │ Integration │
├───────────────────────────┤ ├───────────────────────────┤
│ Twilio SMS/Voice │ │ L1 Auto-Response │
│ SendGrid Email │ │ Runbook Execution │
│ Slack/Teams Webhooks │ │ RAG Knowledge Base │
│ Push Notifications │ │ Human Escalation │
│ Cockpit Real-time │ │ Postmortem Generation │
└───────────────────────────┘ └───────────────────────────┘
Alert Severity Levels
| Severity | Definition | Response Time | Notification |
|---|---|---|---|
| P1-Critical | Service outage, data loss risk | Under 5 min | Voice + SMS + All |
| P2-High | Major feature degraded | Under 15 min | SMS + Slack |
| P3-Medium | Minor feature impacted | Under 1 hour | Slack + Email |
| P4-Low | Performance degraded | Under 4 hours | |
| P5-Info | Informational only | Best effort | Log only |
Alert Sources
Supported Integrations
| Source | Method | Alert Types |
|---|---|---|
| Cloud Monitoring | Webhook | GCP metrics, uptime |
| Prometheus | AlertManager webhook | Custom metrics |
| Sentry | Webhook | Application errors |
| Custom | REST API | Any source |
| Edge Metrics | Workers | Edge performance |
Alert Ingestion API
POST /api/v1/alerts/ingest
Content-Type: application/json
{
"source": "prometheus",
"severity": "P2",
"title": "High Memory Usage on auth-service",
"description": "Memory usage exceeded 90% threshold",
"service": "auth-service",
"environment": "production",
"labels": {
"pod": "auth-service-abc123",
"region": "us-central1"
},
"runbook_url": "https://docs.olympuscloud.ai/runbooks/memory"
}
On-Call Management
Schedule Types
| Type | Description | Use Case |
|---|---|---|
| Weekly Rotation | 7-day shifts | Primary coverage |
| Daily Rotation | 24-hour shifts | High-frequency |
| Follow-the-Sun | Timezone-based | Global teams |
| Custom | User-defined | Special needs |
Schedule Configuration
┌─────────────────────────────────────────────────────────────────┐
│ ON-CALL SCHEDULE: Platform Team Week of Jan 20 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PRIMARY │
│ ─────────────────────────────────────────────────────────── │
│ Mon-Sun: Alice Chen (alice@nebusai.com) │
│ Phone: (555) 123-4567 │
│ │
│ SECONDARY │
│ ─────────────────────────────────────────────────────────── │
│ Mon-Sun: Bob Kim (bob@nebusai.com) │
│ Phone: (555) 234-5678 │
│ │
│ OVERRIDES │
│ ─────────────────────────────────────────────────────────── │
│ Wed 6PM - Thu 6AM: Charlie (covering for Alice) │
│ │
│ NEXT WEEK │
│ ─────────────────────────────────────────────────────────── │
│ Primary: Bob Kim │
│ Secondary: Dana Lopez │
│ │
│ [Edit Schedule] [Add Override] [Swap Shift] │
│ │
└─────────────────────────────────────────────────────────────────┘
Override Management
| Action | How |
|---|---|
| PTO Coverage | Create override for date range |
| Shift Swap | Request swap, other confirms |
| Emergency | Manager override |
| Holiday | Calendar-based auto-override |
Escalation Policies
Multi-Tier Escalation
┌─────────────────────────────────────────────────────────────────┐
│ ESCALATION POLICY: Platform Critical │
├─────────────────────────────────────────────────────────────────┤
│ │
│ LEVEL 1 (0 min) │
│ ─────────────────────────────────────────────────────────── │
│ Target: Platform Team On-Call │
│ Notify: SMS + Voice │
│ Timeout: 5 minutes │
│ │
│ │ If not acknowledged in 5 min │
│ ▼ │
│ │
│ LEVEL 2 (5 min) │
│ ─────────────────────────────────────────────────────────── │
│ Target: Platform Team Secondary + AI Support Agent │
│ Notify: SMS + Voice + Slack │
│ Timeout: 5 minutes │
│ │
│ │ If not acknowledged in 5 min │
│ ▼ │
│ │
│ LEVEL 3 (10 min) │
│ ─────────────────────────────────────────────────────────── │
│ Target: Engineering Manager │
│ Notify: SMS + Voice + Email │
│ Timeout: 10 minutes │
│ │
│ │ If not acknowledged in 10 min │
│ ▼ │
│ │
│ LEVEL 4 (20 min) │
│ ─────────────────────────────────────────────────────────── │
│ Target: VP Engineering + CTO │
│ Notify: All channels │
│ │
└─────────────────────────────────────────────────────────────────┘
Escalation Settings
| Setting | Description |
|---|---|
| Timeout | Time before escalating to next level |
| Skip if Ack'd | Don't escalate if acknowledged |
| Parallel Mode | Notify all at level simultaneously |
| Sequential Mode | Try each person in order |
| Round-Robin | Rotate who gets notified first |
Notification Hub
Supported Channels
| Channel | Provider | Use Case | Cost |
|---|---|---|---|
| SMS | Twilio | P1/P2 primary | $0.0079/msg |
| Voice | Twilio | P1 escalation | $0.013/min |
| SendGrid | All alerts | Free-100/day | |
| Slack | Webhook | Team alerts | Free |
| Teams | Webhook | Enterprise | Free |
| Push | FCM/APNs | Mobile app | Free |
| WebSocket | Internal | Cockpit | Free |
Notification Templates
SMS Template (P1):
───────────────────────────────────────────────────────────────
[P1-CRITICAL] {service}: {title}
Ack: Reply ACK
Details: {short_url}
───────────────────────────────────────────────────────────────
Voice Script (TTS):
───────────────────────────────────────────────────────────────
"Priority one alert for {service}. {title}.
Press 1 to acknowledge. Press 2 to escalate.
Press 3 to hear more details."
───────────────────────────────────────────────────────────────
Delivery Guarantees
| SLA | Target |
|---|---|
| SMS Delivery | Under 5 seconds |
| Voice Call | Under 10 seconds |
| Email Delivery | Under 30 seconds |
| Retry Attempts | 5 with backoff |
| Success Rate | 99.9% |
AIOps Engine
Anomaly Detection
ML-powered baseline learning and anomaly detection:
| Method | Description | Use Case |
|---|---|---|
| Z-Score | Statistical deviation | Simple metrics |
| Isolation Forest | Multivariate anomalies | Complex patterns |
| Prophet | Seasonal patterns | Time-series |
| IQR | Interquartile range | Outlier detection |
Predictive Alerting
Predict issues before they occur:
| Feature | Example |
|---|---|
| Capacity | "Disk will reach 90% in 4 hours" |
| SLO Burn | "Error budget will exhaust in 2 days" |
| Traffic | "Unusual traffic spike predicted" |
Alert Correlation
Reduce noise by grouping related alerts:
┌─────────────────────────────────────────────────────────────────┐
│ CORRELATED INCIDENT #4521 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ROOT CAUSE (AI Identified) │
│ ─────────────────────────────────────────────────────────── │
│ Database connection pool exhaustion on orders-db │
│ Confidence: 92% │
│ │
│ RELATED ALERTS (4 grouped) │
│ ─────────────────────────────────────────────────────────── │
│ ✓ P1: Order Service - High Latency (symptom) │
│ ✓ P2: Payment Service - Timeout Errors (symptom) │
│ ✓ P2: API Gateway - 504 Errors (symptom) │
│ ⭐ P1: Orders-DB - Connection Pool Exhausted (root cause) │
│ │
│ AI RECOMMENDATION │
│ ─────────────────────────────────────────────────────────── │
│ Execute runbook: db-connection-pool-increase │
│ Confidence: 87% │
│ │
│ [Execute Runbook] [Acknowledge All] [View Details] │
│ │
└─────────────────────────────────────────────────────────────────┘
AI Support Agent Integration
L1 Automated Response
The AI Support Agent automatically handles L1 incidents:
- Receive Alert - Agent gets alert context
- Run Diagnostics - Execute diagnostic tools
- Search Knowledge Base - Query runbooks via RAG
- Attempt Remediation - Execute safe actions
- Escalate if Needed - Pass to human if unresolved
Auto-Remediation Actions
| Action | Risk Level | Approval |
|---|---|---|
| Restart Service | Low | Auto |
| Clear Cache | Low | Auto |
| Scale Up | Medium | Auto |
| Rotate Logs | Low | Auto |
| Scale Down | Medium | Required |
| Failover | High | Required |
| Database Query | High | Required |
Hey Maximus Commands
Voice commands in Cockpit:
- "Hey Maximus, what's the current P1 status?"
- "Hey Maximus, who is on-call for platform?"
- "Hey Maximus, acknowledge the auth service alert"
- "Hey Maximus, show me similar incidents"
- "Hey Maximus, execute the memory runbook"
- "Hey Maximus, snooze this alert for 30 minutes"
Cockpit Dashboard
Alert Overview
┌─────────────────────────────────────────────────────────────────┐
│ ALERTING DASHBOARD Live │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ACTIVE ALERTS │
│ ─────────────────────────────────────────────────────────── │
│ 🔴 P1: 1 🟠 P2: 3 🟡 P3: 8 🔵 P4: 12 │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 🔴 P1 │ orders-db Connection Pool Exhausted │ │
│ │ │ Started: 5 min ago │ Status: AI Investigating │ │
│ │ │ On-Call: Alice Chen │ [Ack] [Escalate] [Snooze] │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 🟠 P2 │ auth-service High Memory │ │
│ │ │ Started: 12 min ago │ Status: Acknowledged │ │
│ │ │ Ack'd by: Bob Kim │ [Resolve] [Escalate] │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ TODAY'S METRICS │
│ ─────────────────────────────────────────────────────────── │
│ Alerts Fired: 45 AI Auto-Resolved: 18 (40%) │
│ MTTA: 47 sec MTTR: 8 min 24 sec │
│ False Positives: 2 Noise Reduction: 68% │
│ │
└─────────────────────────────────────────────────────────────────┘
On-Call Widget
| Info | Display |
|---|---|
| Current Primary | Name, phone, avatar |
| Current Secondary | Name, phone, avatar |
| Next Rotation | Date and who's next |
| Your Schedule | Your upcoming shifts |
Maintenance Windows
Creating Windows
Suppress alerts during planned maintenance:
- Go to Alerting > Maintenance
- Click Schedule Maintenance
- Select services to suppress
- Set start/end time
- Add description for audit
Window Types
| Type | Behavior |
|---|---|
| Full Suppress | No alerts at all |
| Reduce Severity | P1→P3, P2→P4 |
| Notify Only | Alert but don't page |
API Reference
List Active Alerts
GET /api/v1/alerts?status=active
# Response
{
"alerts": [
{
"id": "alert-123",
"severity": "P1",
"title": "Database Connection Pool Exhausted",
"service": "orders-db",
"status": "firing",
"started_at": "2026-01-18T14:30:00Z",
"acknowledged_by": null,
"escalation_level": 1
}
]
}
Acknowledge Alert
POST /api/v1/alerts/{id}/acknowledge
{
"user_id": "user-123",
"note": "Investigating now"
}
Get On-Call Schedule
GET /api/v1/oncall/schedule/{team}
# Response
{
"team": "platform",
"current": {
"primary": {"name": "Alice Chen", "email": "alice@nebusai.com"},
"secondary": {"name": "Bob Kim", "email": "bob@nebusai.com"}
},
"next_rotation": "2026-01-27T00:00:00Z"
}
Best Practices
- Set appropriate severities - Not everything is P1
- Link runbooks - Every alert should have a runbook
- Tune thresholds - Reduce false positives
- Review on-call load - Spread alerts evenly
- Use maintenance windows - Planned work shouldn't page
- Trust the AI - Let it handle L1 issues
Related Documentation
- Incident Response Runbook - How to respond
- Cockpit Operations - Cockpit dashboard
- ACP AI Router - AI infrastructure