Incident Response Runbook

Complete guide for handling production incidents in the Olympus Cloud platform.

Overview

This runbook provides step-by-step procedures for identifying, responding to, and resolving production incidents. All on-call engineers must be familiar with these procedures.

Incident Response Flow

┌─────────────────────────────────────────────────────────────────┐
│                  Incident Response Lifecycle                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  DETECT        TRIAGE        RESPOND        RESOLVE       LEARN  │
│     │            │             │              │            │     │
│     ▼            ▼             ▼              ▼            ▼     │
│  ┌─────┐     ┌─────┐       ┌─────┐        ┌─────┐      ┌─────┐ │
│  │Alert│ ──► │Assess│ ──► │Engage│ ──► │Fix│ ──► │Review│ │
│  │Fires│     │Impact│      │Team │       │Issue│     │Learn│ │
│  └─────┘     └─────┘       └─────┘        └─────┘      └─────┘ │
│                                                                   │
│  < 5 min    < 15 min      < 30 min       < 4 hrs      < 3 days  │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Severity Levels

Severity Classification

Severity	Impact	Response Time	Examples
SEV-1	Complete outage, all users affected	< 15 min	Platform down, data loss
SEV-2	Major degradation, many users affected	< 30 min	Payment failures, slow response
SEV-3	Partial degradation, some users affected	< 1 hour	Single feature broken
SEV-4	Minor issue, few users affected	< 4 hours	Cosmetic issues, edge cases

Severity Indicators

SEV-1 (Critical)

Error rate > 50%
Latency p99 > 30 seconds
Complete service unavailability
Data corruption or loss
Security breach

SEV-2 (High)

Error rate > 10%
Latency p99 > 10 seconds
Major feature unavailable
Payment processing issues
Voice AI service down

SEV-3 (Medium)

Error rate > 5%
Latency p99 > 5 seconds
Single feature degraded
Non-critical integrations failing

SEV-4 (Low)

Error rate < 5%
Minor UI issues
Documentation errors
Performance not meeting SLA but functional

On-Call Responsibilities

Primary On-Call

Monitor PagerDuty for alerts
Acknowledge incidents within 5 minutes
Perform initial triage and assessment
Engage secondary on-call if needed
Escalate to Incident Commander for SEV-1/SEV-2

Secondary On-Call

Available as backup for primary
Join incident bridges when escalated
Assist with investigation and remediation
Take over if primary unavailable

Incident Commander (SEV-1/SEV-2 only)

Coordinate response across teams
Make decisions on rollback vs. fix-forward
Manage external communications
Ensure proper documentation

Detection and Alerting

Alert Sources

Source	Monitors	Escalation
Cloud Monitoring	GCP metrics, uptime	PagerDuty
Datadog	APM, custom metrics	PagerDuty
Sentry	Application errors	Slack + PagerDuty
Cloudflare	Edge health, WAF	Slack + PagerDuty
Synthetic Monitoring	End-user flows	PagerDuty

Critical Alerts

┌─────────────────────────────────────────────────────────────────┐
│  Alert: High Error Rate                                          │
├─────────────────────────────────────────────────────────────────┤
│  Service: api-gateway                                            │
│  Metric: error_rate > 10%                                        │
│  Current Value: 15.3%                                            │
│  Duration: 5 minutes                                             │
│  Impact: SEV-2                                                   │
│  Runbook: /docs/operations/runbooks/incident-response           │
│                                                                   │
│  Recent Deployments:                                             │
│  - api-gateway v1.24.5 (deployed 15 min ago)                    │
│  - platform-service v2.8.1 (deployed 2 hours ago)               │
│                                                                   │
│  Quick Actions:                                                  │
│  [View Logs] [View Traces] [Rollback] [Escalate]                │
└─────────────────────────────────────────────────────────────────┘

Initial Response

Step 1: Acknowledge (< 5 minutes)

Acknowledge in PagerDuty

Click "Acknowledge" in PagerDuty mobile app or web UI

Join Slack channel

#incident-response (for all incidents)
#incident-sev1 (for SEV-1 only)

Post initial status

@here SEV-2 incident: High error rate on api-gateway
Investigating. Will provide update in 15 minutes.

Step 2: Triage (< 15 minutes)

Identify scope
- Which services are affected?
- Which regions?
- How many users impacted?

Check recent changes

# View recent deployments
gcloud run revisions list --service=api-gateway --limit=5

# View recent config changes
git log --oneline --since="2 hours ago" -- infrastructure/

Determine severity
- Update severity if initial classification was wrong
- Escalate if SEV-1 or SEV-2
Initial hypothesis
- Recent deployment?
- External dependency issue?
- Traffic spike?
- Infrastructure problem?

Step 3: Engage Team (SEV-1/SEV-2)

Page Incident Commander

In PagerDuty: Escalate to "Incident Commander" schedule

Start incident bridge

Google Meet: meet.google.com/inc-olympus-cloud
Or use Slack Huddle in #incident-sev1

Assign roles
- Incident Commander: Coordinates response
- Communications: Updates stakeholders
- Technical Lead: Drives investigation
- Scribe: Documents timeline

Investigation

Log Analysis

# View Cloud Run logs for errors
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
  --limit=50 --format=json

# View specific service logs
gcloud logging read "resource.labels.service_name=api-gateway" \
  --limit=100 --format="table(timestamp,severity,jsonPayload.message)"

# Search for specific error patterns
gcloud logging read 'textPayload=~"database connection"' \
  --limit=50

Metrics Analysis

Check dashboards
- Main dashboard: https://console.cloud.google.com/monitoring/dashboards/custom/olympus-main
- Service dashboards: Per-service dashboards
Key metrics to check
- Error rate
- Latency percentiles (p50, p95, p99)
- Request rate
- CPU/Memory utilization
- Database connections
Correlation
- Compare metrics across services
- Look for coincident changes
- Check external dependencies

Trace Analysis

# Find slow traces
# In Cloud Trace, filter by:
# - Latency > 5s
# - Has error
# - Service name = api-gateway

Common Investigation Patterns

Symptom	Check First	Common Cause
High error rate	Recent deployments	Bad code deploy
High latency	Database metrics	Query performance
Connection errors	Network/firewall	Config change
Memory issues	Container metrics	Memory leak
Intermittent failures	External deps	Third-party issue

Remediation

Quick Fixes

Rollback Deployment

# List recent revisions
gcloud run revisions list --service=api-gateway

# Rollback to previous revision
gcloud run services update-traffic api-gateway \
  --to-revisions=api-gateway-00001-abc=100

# Verify traffic shift
gcloud run services describe api-gateway --format='value(status.traffic)'

Scale Up

# Increase min instances
gcloud run services update api-gateway \
  --min-instances=10

# Increase memory/CPU
gcloud run services update api-gateway \
  --memory=2Gi --cpu=2

Restart Service

# Force new instances by updating with no changes
gcloud run services update api-gateway --no-traffic
gcloud run services update-traffic api-gateway --to-latest

Database Operations (Cloud Spanner)

# Check for long-running transactions
gcloud spanner databases execute-sql olympus-db \
  --instance=prod-olympus-spanner \
  --sql="SELECT * FROM SPANNER_SYS.TXN_STATS_TOP_10MIN ORDER BY AVG_COMMIT_LATENCY_SECONDS DESC LIMIT 10"

# Scale Spanner nodes during incident
gcloud spanner instances update prod-olympus-spanner --nodes=5

Fix Categories

warning

When choosing a remediation strategy, always prefer the lowest-risk option that resolves the issue. Rollback should be the default first response for deployment-related incidents. Hot fixes and database changes during active incidents carry significant risk and require Incident Commander approval for SEV-1/SEV-2 events.

Category	Action	Risk Level
Rollback	Revert to last known good	Low
Scale	Add capacity	Low
Restart	Force new instances	Medium
Config change	Update runtime config	Medium
Hot fix	Deploy targeted fix	High
Database change	Modify data/schema	Very High

Communication

Internal Updates

Post updates every 15-30 minutes in Slack:

📢 Incident Update (12:45 PM)

Status: INVESTIGATING
Severity: SEV-2
Duration: 30 minutes
Impact: Elevated error rates on order processing

Current actions:
- Identified root cause: database connection pool exhausted
- Scaling up connection pool from 50 to 100
- Monitoring for improvement

Next update: 1:00 PM or when resolved

External Communications (SEV-1/SEV-2)

Status page update
```
https://status.olympuscloud.ai
```

Customer notification template

Subject: [Olympus Cloud] Service Degradation - {Date}

We are currently experiencing elevated error rates affecting
[affected services]. Our team is actively investigating and
working to resolve the issue.

Impact: [Describe customer impact]
Workaround: [If available]

We will provide updates every 30 minutes until resolved.

Resolution notification

Subject: [Olympus Cloud] Service Restored - {Date}

The service degradation has been resolved. All systems are
now operating normally.

Root cause: [Brief description]
Duration: [X hours Y minutes]

A detailed post-incident review will be conducted and
shared within 3 business days.

Resolution

Verification Steps

Confirm fix applied
- Verify deployment succeeded
- Check configuration changes took effect
Monitor metrics
- Error rate returning to baseline
- Latency returning to baseline
- No new error patterns
Test functionality
- Run smoke tests
- Verify critical user flows
- Check affected integrations
Hold period
- Monitor for 15-30 minutes after fix
- Ensure no regression

Closing the Incident

Update PagerDuty
- Mark incident as resolved
- Add resolution notes

Update Slack

✅ Incident Resolved

Duration: 1 hour 15 minutes
Root cause: Database connection pool exhaustion due to
connection leak in order-service

Resolution: Deployed hotfix v1.24.6 with connection leak fix

Post-incident review scheduled for [date/time]

Update status page
- Mark incident as resolved
- Add brief resolution summary

Post-Incident

Post-Incident Review (PIR)

For SEV-1 and SEV-2 incidents, conduct a PIR within 3 business days.

PIR Template

# Post-Incident Review: [Incident Title]

## Summary
- **Date**:
- **Duration**:
- **Severity**:
- **Impact**:

## Timeline
| Time | Event |
|------|-------|
| 10:00 | Alert fired |
| 10:05 | On-call acknowledged |
| ... | ... |

## Root Cause
[Detailed root cause analysis]

## Contributing Factors
1. [Factor 1]
2. [Factor 2]

## Impact
- Users affected:
- Revenue impact:
- Data loss:

## Resolution
[How was it fixed]

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| Add monitoring for X | @engineer | 2026-01-25 |

## Lessons Learned
- What went well
- What could be improved

Action Items

Track action items in GitHub Issues with label postmortem:

gh issue create \
  --title "PIR: Add connection pool monitoring" \
  --label "postmortem,priority:high" \
  --body "Action item from incident on 2026-01-18..."

Appendix

Contact Information

Role	Contact
On-Call Primary	PagerDuty schedule
On-Call Secondary	PagerDuty schedule
Incident Commander	@ic-rotation
Security	@security-team
Database Admin	@dba-team

Quick Links

Emergency Contacts

For critical vendor issues:

GCP Support: 1-855-831-3592 (Enterprise)
Cloudflare: Enterprise support portal
Stripe: Dashboard escalation

Overview​

Incident Response Flow​

Severity Levels​

Severity Classification​

Severity Indicators​

On-Call Responsibilities​

Primary On-Call​

Secondary On-Call​

Incident Commander (SEV-1/SEV-2 only)​

Detection and Alerting​

Alert Sources​

Critical Alerts​

Initial Response​

Step 1: Acknowledge (< 5 minutes)​

Step 2: Triage (< 15 minutes)​

Step 3: Engage Team (SEV-1/SEV-2)​

Investigation​

Log Analysis​

Metrics Analysis​

Trace Analysis​

Common Investigation Patterns​

Remediation​

Quick Fixes​

Fix Categories​

Communication​

Internal Updates​

External Communications (SEV-1/SEV-2)​

Resolution​

Verification Steps​

Closing the Incident​

Post-Incident​

Post-Incident Review (PIR)​

Action Items​

Appendix​

Contact Information​

Quick Links​

Emergency Contacts​