Incident Response Runbook
Complete guide for handling production incidents in the Olympus Cloud platform.
Overview
This runbook provides step-by-step procedures for identifying, responding to, and resolving production incidents. All on-call engineers must be familiar with these procedures.
Incident Response Flow
┌─────────────────────────────────────────────────────────────────┐
│ Incident Response Lifecycle │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DETECT TRIAGE RESPOND RESOLVE LEARN │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │Alert│ ──► │Assess│ ──► │Engage│ ──► │Fix│ ──► │Review│ │
│ │Fires│ │Impact│ │Team │ │Issue│ │Learn│ │
│ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │
│ │
│ < 5 min < 15 min < 30 min < 4 hrs < 3 days │
│ │
└─────────────────────────────────────────────────────────────────┘
Severity Levels
Severity Classification
| Severity | Impact | Response Time | Examples |
|---|---|---|---|
| SEV-1 | Complete outage, all users affected | < 15 min | Platform down, data loss |
| SEV-2 | Major degradation, many users affected | < 30 min | Payment failures, slow response |
| SEV-3 | Partial degradation, some users affected | < 1 hour | Single feature broken |
| SEV-4 | Minor issue, few users affected | < 4 hours | Cosmetic issues, edge cases |
Severity Indicators
SEV-1 (Critical)
- Error rate > 50%
- Latency p99 > 30 seconds
- Complete service unavailability
- Data corruption or loss
- Security breach
SEV-2 (High)
- Error rate > 10%
- Latency p99 > 10 seconds
- Major feature unavailable
- Payment processing issues
- Voice AI service down
SEV-3 (Medium)
- Error rate > 5%
- Latency p99 > 5 seconds
- Single feature degraded
- Non-critical integrations failing
SEV-4 (Low)
- Error rate < 5%
- Minor UI issues
- Documentation errors
- Performance not meeting SLA but functional
On-Call Responsibilities
Primary On-Call
- Monitor PagerDuty for alerts
- Acknowledge incidents within 5 minutes
- Perform initial triage and assessment
- Engage secondary on-call if needed
- Escalate to Incident Commander for SEV-1/SEV-2
Secondary On-Call
- Available as backup for primary
- Join incident bridges when escalated
- Assist with investigation and remediation
- Take over if primary unavailable
Incident Commander (SEV-1/SEV-2 only)
- Coordinate response across teams
- Make decisions on rollback vs. fix-forward
- Manage external communications
- Ensure proper documentation
Detection and Alerting
Alert Sources
| Source | Monitors | Escalation |
|---|---|---|
| Cloud Monitoring | GCP metrics, uptime | PagerDuty |
| Datadog | APM, custom metrics | PagerDuty |
| Sentry | Application errors | Slack + PagerDuty |
| Cloudflare | Edge health, WAF | Slack + PagerDuty |
| Synthetic Monitoring | End-user flows | PagerDuty |
Critical Alerts
┌─────────────────────────────────────────────────────────────────┐
│ Alert: High Error Rate │
├─────────────────────────────────────────────────────────────────┤
│ Service: api-gateway │
│ Metric: error_rate > 10% │
│ Current Value: 15.3% │
│ Duration: 5 minutes │
│ Impact: SEV-2 │
│ Runbook: /docs/operations/runbooks/incident-response │
│ │
│ Recent Deployments: │
│ - api-gateway v1.24.5 (deployed 15 min ago) │
│ - platform-service v2.8.1 (deployed 2 hours ago) │
│ │
│ Quick Actions: │
│ [View Logs] [View Traces] [Rollback] [Escalate] │
└─────────────────────────────────────────────────────────────────┘
Initial Response
Step 1: Acknowledge (< 5 minutes)
-
Acknowledge in PagerDuty
Click "Acknowledge" in PagerDuty mobile app or web UI -
Join Slack channel
#incident-response (for all incidents)
#incident-sev1 (for SEV-1 only) -
Post initial status
@here SEV-2 incident: High error rate on api-gateway
Investigating. Will provide update in 15 minutes.
Step 2: Triage (< 15 minutes)
-
Identify scope
- Which services are affected?
- Which regions?
- How many users impacted?
-
Check recent changes
# View recent deployments
gcloud run revisions list --service=api-gateway --limit=5
# View recent config changes
git log --oneline --since="2 hours ago" -- infrastructure/ -
Determine severity
- Update severity if initial classification was wrong
- Escalate if SEV-1 or SEV-2
-
Initial hypothesis
- Recent deployment?
- External dependency issue?
- Traffic spike?
- Infrastructure problem?
Step 3: Engage Team (SEV-1/SEV-2)
-
Page Incident Commander
In PagerDuty: Escalate to "Incident Commander" schedule -
Start incident bridge
Google Meet: meet.google.com/inc-olympus-cloud
Or use Slack Huddle in #incident-sev1 -
Assign roles
- Incident Commander: Coordinates response
- Communications: Updates stakeholders
- Technical Lead: Drives investigation
- Scribe: Documents timeline
Investigation
Log Analysis
# View Cloud Run logs for errors
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
--limit=50 --format=json
# View specific service logs
gcloud logging read "resource.labels.service_name=api-gateway" \
--limit=100 --format="table(timestamp,severity,jsonPayload.message)"
# Search for specific error patterns
gcloud logging read 'textPayload=~"database connection"' \
--limit=50
Metrics Analysis
-
Check dashboards
- Main dashboard:
https://console.cloud.google.com/monitoring/dashboards/custom/olympus-main - Service dashboards: Per-service dashboards
- Main dashboard:
-
Key metrics to check
- Error rate
- Latency percentiles (p50, p95, p99)
- Request rate
- CPU/Memory utilization
- Database connections
-
Correlation
- Compare metrics across services
- Look for coincident changes
- Check external dependencies
Trace Analysis
# Find slow traces
# In Cloud Trace, filter by:
# - Latency > 5s
# - Has error
# - Service name = api-gateway
Common Investigation Patterns
| Symptom | Check First | Common Cause |
|---|---|---|
| High error rate | Recent deployments | Bad code deploy |
| High latency | Database metrics | Query performance |
| Connection errors | Network/firewall | Config change |
| Memory issues | Container metrics | Memory leak |
| Intermittent failures | External deps | Third-party issue |
Remediation
Quick Fixes
Rollback Deployment
# List recent revisions
gcloud run revisions list --service=api-gateway
# Rollback to previous revision
gcloud run services update-traffic api-gateway \
--to-revisions=api-gateway-00001-abc=100
# Verify traffic shift
gcloud run services describe api-gateway --format='value(status.traffic)'
Scale Up
# Increase min instances
gcloud run services update api-gateway \
--min-instances=10
# Increase memory/CPU
gcloud run services update api-gateway \
--memory=2Gi --cpu=2
Restart Service
# Force new instances by updating with no changes
gcloud run services update api-gateway --no-traffic
gcloud run services update-traffic api-gateway --to-latest
Database Operations (Cloud Spanner)
# Check for long-running transactions
gcloud spanner databases execute-sql olympus-db \
--instance=prod-olympus-spanner \
--sql="SELECT * FROM SPANNER_SYS.TXN_STATS_TOP_10MIN ORDER BY AVG_COMMIT_LATENCY_SECONDS DESC LIMIT 10"
# Scale Spanner nodes during incident
gcloud spanner instances update prod-olympus-spanner --nodes=5
Fix Categories
When choosing a remediation strategy, always prefer the lowest-risk option that resolves the issue. Rollback should be the default first response for deployment-related incidents. Hot fixes and database changes during active incidents carry significant risk and require Incident Commander approval for SEV-1/SEV-2 events.
| Category | Action | Risk Level |
|---|---|---|
| Rollback | Revert to last known good | Low |
| Scale | Add capacity | Low |
| Restart | Force new instances | Medium |
| Config change | Update runtime config | Medium |
| Hot fix | Deploy targeted fix | High |
| Database change | Modify data/schema | Very High |
Communication
Internal Updates
Post updates every 15-30 minutes in Slack:
📢 Incident Update (12:45 PM)
Status: INVESTIGATING
Severity: SEV-2
Duration: 30 minutes
Impact: Elevated error rates on order processing
Current actions:
- Identified root cause: database connection pool exhausted
- Scaling up connection pool from 50 to 100
- Monitoring for improvement
Next update: 1:00 PM or when resolved
External Communications (SEV-1/SEV-2)
-
Status page update
https://status.olympuscloud.ai -
Customer notification template
Subject: [Olympus Cloud] Service Degradation - {Date}
We are currently experiencing elevated error rates affecting
[affected services]. Our team is actively investigating and
working to resolve the issue.
Impact: [Describe customer impact]
Workaround: [If available]
We will provide updates every 30 minutes until resolved. -
Resolution notification
Subject: [Olympus Cloud] Service Restored - {Date}
The service degradation has been resolved. All systems are
now operating normally.
Root cause: [Brief description]
Duration: [X hours Y minutes]
A detailed post-incident review will be conducted and
shared within 3 business days.
Resolution
Verification Steps
-
Confirm fix applied
- Verify deployment succeeded
- Check configuration changes took effect
-
Monitor metrics
- Error rate returning to baseline
- Latency returning to baseline
- No new error patterns
-
Test functionality
- Run smoke tests
- Verify critical user flows
- Check affected integrations
-
Hold period
- Monitor for 15-30 minutes after fix
- Ensure no regression
Closing the Incident
-
Update PagerDuty
- Mark incident as resolved
- Add resolution notes
-
Update Slack
✅ Incident Resolved
Duration: 1 hour 15 minutes
Root cause: Database connection pool exhaustion due to
connection leak in order-service
Resolution: Deployed hotfix v1.24.6 with connection leak fix
Post-incident review scheduled for [date/time] -
Update status page
- Mark incident as resolved
- Add brief resolution summary
Post-Incident
Post-Incident Review (PIR)
For SEV-1 and SEV-2 incidents, conduct a PIR within 3 business days.
PIR Template
# Post-Incident Review: [Incident Title]
## Summary
- **Date**:
- **Duration**:
- **Severity**:
- **Impact**:
## Timeline
| Time | Event |
|------|-------|
| 10:00 | Alert fired |
| 10:05 | On-call acknowledged |
| ... | ... |
## Root Cause
[Detailed root cause analysis]
## Contributing Factors
1. [Factor 1]
2. [Factor 2]
## Impact
- Users affected:
- Revenue impact:
- Data loss:
## Resolution
[How was it fixed]
## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| Add monitoring for X | @engineer | 2026-01-25 |
## Lessons Learned
- What went well
- What could be improved
Action Items
Track action items in GitHub Issues with label postmortem:
gh issue create \
--title "PIR: Add connection pool monitoring" \
--label "postmortem,priority:high" \
--body "Action item from incident on 2026-01-18..."
Appendix
Contact Information
| Role | Contact |
|---|---|
| On-Call Primary | PagerDuty schedule |
| On-Call Secondary | PagerDuty schedule |
| Incident Commander | @ic-rotation |
| Security | @security-team |
| Database Admin | @dba-team |
Quick Links
Emergency Contacts
For critical vendor issues:
- GCP Support: 1-855-831-3592 (Enterprise)
- Cloudflare: Enterprise support portal
- Stripe: Dashboard escalation