Skip to main content

Incident Response Runbook

Complete guide for handling production incidents in the Olympus Cloud platform.

Overview

This runbook provides step-by-step procedures for identifying, responding to, and resolving production incidents. All on-call engineers must be familiar with these procedures.

Incident Response Flow

┌─────────────────────────────────────────────────────────────────┐
│ Incident Response Lifecycle │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DETECT TRIAGE RESPOND RESOLVE LEARN │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │Alert│ ──► │Assess│ ──► │Engage│ ──► │Fix│ ──► │Review│ │
│ │Fires│ │Impact│ │Team │ │Issue│ │Learn│ │
│ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │
│ │
│ < 5 min < 15 min < 30 min < 4 hrs < 3 days │
│ │
└─────────────────────────────────────────────────────────────────┘

Severity Levels

Severity Classification

SeverityImpactResponse TimeExamples
SEV-1Complete outage, all users affected< 15 minPlatform down, data loss
SEV-2Major degradation, many users affected< 30 minPayment failures, slow response
SEV-3Partial degradation, some users affected< 1 hourSingle feature broken
SEV-4Minor issue, few users affected< 4 hoursCosmetic issues, edge cases

Severity Indicators

SEV-1 (Critical)

  • Error rate > 50%
  • Latency p99 > 30 seconds
  • Complete service unavailability
  • Data corruption or loss
  • Security breach

SEV-2 (High)

  • Error rate > 10%
  • Latency p99 > 10 seconds
  • Major feature unavailable
  • Payment processing issues
  • Voice AI service down

SEV-3 (Medium)

  • Error rate > 5%
  • Latency p99 > 5 seconds
  • Single feature degraded
  • Non-critical integrations failing

SEV-4 (Low)

  • Error rate < 5%
  • Minor UI issues
  • Documentation errors
  • Performance not meeting SLA but functional

On-Call Responsibilities

Primary On-Call

  • Monitor PagerDuty for alerts
  • Acknowledge incidents within 5 minutes
  • Perform initial triage and assessment
  • Engage secondary on-call if needed
  • Escalate to Incident Commander for SEV-1/SEV-2

Secondary On-Call

  • Available as backup for primary
  • Join incident bridges when escalated
  • Assist with investigation and remediation
  • Take over if primary unavailable

Incident Commander (SEV-1/SEV-2 only)

  • Coordinate response across teams
  • Make decisions on rollback vs. fix-forward
  • Manage external communications
  • Ensure proper documentation

Detection and Alerting

Alert Sources

SourceMonitorsEscalation
Cloud MonitoringGCP metrics, uptimePagerDuty
DatadogAPM, custom metricsPagerDuty
SentryApplication errorsSlack + PagerDuty
CloudflareEdge health, WAFSlack + PagerDuty
Synthetic MonitoringEnd-user flowsPagerDuty

Critical Alerts

┌─────────────────────────────────────────────────────────────────┐
│ Alert: High Error Rate │
├─────────────────────────────────────────────────────────────────┤
│ Service: api-gateway │
│ Metric: error_rate > 10% │
│ Current Value: 15.3% │
│ Duration: 5 minutes │
│ Impact: SEV-2 │
│ Runbook: /docs/operations/runbooks/incident-response │
│ │
│ Recent Deployments: │
│ - api-gateway v1.24.5 (deployed 15 min ago) │
│ - platform-service v2.8.1 (deployed 2 hours ago) │
│ │
│ Quick Actions: │
│ [View Logs] [View Traces] [Rollback] [Escalate] │
└─────────────────────────────────────────────────────────────────┘

Initial Response

Step 1: Acknowledge (< 5 minutes)

  1. Acknowledge in PagerDuty

    Click "Acknowledge" in PagerDuty mobile app or web UI
  2. Join Slack channel

    #incident-response (for all incidents)
    #incident-sev1 (for SEV-1 only)
  3. Post initial status

    @here SEV-2 incident: High error rate on api-gateway
    Investigating. Will provide update in 15 minutes.

Step 2: Triage (< 15 minutes)

  1. Identify scope

    • Which services are affected?
    • Which regions?
    • How many users impacted?
  2. Check recent changes

    # View recent deployments
    gcloud run revisions list --service=api-gateway --limit=5

    # View recent config changes
    git log --oneline --since="2 hours ago" -- infrastructure/
  3. Determine severity

    • Update severity if initial classification was wrong
    • Escalate if SEV-1 or SEV-2
  4. Initial hypothesis

    • Recent deployment?
    • External dependency issue?
    • Traffic spike?
    • Infrastructure problem?

Step 3: Engage Team (SEV-1/SEV-2)

  1. Page Incident Commander

    In PagerDuty: Escalate to "Incident Commander" schedule
  2. Start incident bridge

    Google Meet: meet.google.com/inc-olympus-cloud
    Or use Slack Huddle in #incident-sev1
  3. Assign roles

    • Incident Commander: Coordinates response
    • Communications: Updates stakeholders
    • Technical Lead: Drives investigation
    • Scribe: Documents timeline

Investigation

Log Analysis

# View Cloud Run logs for errors
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" \
--limit=50 --format=json

# View specific service logs
gcloud logging read "resource.labels.service_name=api-gateway" \
--limit=100 --format="table(timestamp,severity,jsonPayload.message)"

# Search for specific error patterns
gcloud logging read 'textPayload=~"database connection"' \
--limit=50

Metrics Analysis

  1. Check dashboards

    • Main dashboard: https://console.cloud.google.com/monitoring/dashboards/custom/olympus-main
    • Service dashboards: Per-service dashboards
  2. Key metrics to check

    • Error rate
    • Latency percentiles (p50, p95, p99)
    • Request rate
    • CPU/Memory utilization
    • Database connections
  3. Correlation

    • Compare metrics across services
    • Look for coincident changes
    • Check external dependencies

Trace Analysis

# Find slow traces
# In Cloud Trace, filter by:
# - Latency > 5s
# - Has error
# - Service name = api-gateway

Common Investigation Patterns

SymptomCheck FirstCommon Cause
High error rateRecent deploymentsBad code deploy
High latencyDatabase metricsQuery performance
Connection errorsNetwork/firewallConfig change
Memory issuesContainer metricsMemory leak
Intermittent failuresExternal depsThird-party issue

Remediation

Quick Fixes

Rollback Deployment

# List recent revisions
gcloud run revisions list --service=api-gateway

# Rollback to previous revision
gcloud run services update-traffic api-gateway \
--to-revisions=api-gateway-00001-abc=100

# Verify traffic shift
gcloud run services describe api-gateway --format='value(status.traffic)'

Scale Up

# Increase min instances
gcloud run services update api-gateway \
--min-instances=10

# Increase memory/CPU
gcloud run services update api-gateway \
--memory=2Gi --cpu=2

Restart Service

# Force new instances by updating with no changes
gcloud run services update api-gateway --no-traffic
gcloud run services update-traffic api-gateway --to-latest

Database Operations (Cloud Spanner)

# Check for long-running transactions
gcloud spanner databases execute-sql olympus-db \
--instance=prod-olympus-spanner \
--sql="SELECT * FROM SPANNER_SYS.TXN_STATS_TOP_10MIN ORDER BY AVG_COMMIT_LATENCY_SECONDS DESC LIMIT 10"

# Scale Spanner nodes during incident
gcloud spanner instances update prod-olympus-spanner --nodes=5

Fix Categories

warning

When choosing a remediation strategy, always prefer the lowest-risk option that resolves the issue. Rollback should be the default first response for deployment-related incidents. Hot fixes and database changes during active incidents carry significant risk and require Incident Commander approval for SEV-1/SEV-2 events.

CategoryActionRisk Level
RollbackRevert to last known goodLow
ScaleAdd capacityLow
RestartForce new instancesMedium
Config changeUpdate runtime configMedium
Hot fixDeploy targeted fixHigh
Database changeModify data/schemaVery High

Communication

Internal Updates

Post updates every 15-30 minutes in Slack:

📢 Incident Update (12:45 PM)

Status: INVESTIGATING
Severity: SEV-2
Duration: 30 minutes
Impact: Elevated error rates on order processing

Current actions:
- Identified root cause: database connection pool exhausted
- Scaling up connection pool from 50 to 100
- Monitoring for improvement

Next update: 1:00 PM or when resolved

External Communications (SEV-1/SEV-2)

  1. Status page update

    https://status.olympuscloud.ai
  2. Customer notification template

    Subject: [Olympus Cloud] Service Degradation - {Date}

    We are currently experiencing elevated error rates affecting
    [affected services]. Our team is actively investigating and
    working to resolve the issue.

    Impact: [Describe customer impact]
    Workaround: [If available]

    We will provide updates every 30 minutes until resolved.
  3. Resolution notification

    Subject: [Olympus Cloud] Service Restored - {Date}

    The service degradation has been resolved. All systems are
    now operating normally.

    Root cause: [Brief description]
    Duration: [X hours Y minutes]

    A detailed post-incident review will be conducted and
    shared within 3 business days.

Resolution

Verification Steps

  1. Confirm fix applied

    • Verify deployment succeeded
    • Check configuration changes took effect
  2. Monitor metrics

    • Error rate returning to baseline
    • Latency returning to baseline
    • No new error patterns
  3. Test functionality

    • Run smoke tests
    • Verify critical user flows
    • Check affected integrations
  4. Hold period

    • Monitor for 15-30 minutes after fix
    • Ensure no regression

Closing the Incident

  1. Update PagerDuty

    • Mark incident as resolved
    • Add resolution notes
  2. Update Slack

    ✅ Incident Resolved

    Duration: 1 hour 15 minutes
    Root cause: Database connection pool exhaustion due to
    connection leak in order-service

    Resolution: Deployed hotfix v1.24.6 with connection leak fix

    Post-incident review scheduled for [date/time]
  3. Update status page

    • Mark incident as resolved
    • Add brief resolution summary

Post-Incident

Post-Incident Review (PIR)

For SEV-1 and SEV-2 incidents, conduct a PIR within 3 business days.

PIR Template

# Post-Incident Review: [Incident Title]

## Summary
- **Date**:
- **Duration**:
- **Severity**:
- **Impact**:

## Timeline
| Time | Event |
|------|-------|
| 10:00 | Alert fired |
| 10:05 | On-call acknowledged |
| ... | ... |

## Root Cause
[Detailed root cause analysis]

## Contributing Factors
1. [Factor 1]
2. [Factor 2]

## Impact
- Users affected:
- Revenue impact:
- Data loss:

## Resolution
[How was it fixed]

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| Add monitoring for X | @engineer | 2026-01-25 |

## Lessons Learned
- What went well
- What could be improved

Action Items

Track action items in GitHub Issues with label postmortem:

gh issue create \
--title "PIR: Add connection pool monitoring" \
--label "postmortem,priority:high" \
--body "Action item from incident on 2026-01-18..."

Appendix

Contact Information

RoleContact
On-Call PrimaryPagerDuty schedule
On-Call SecondaryPagerDuty schedule
Incident Commander@ic-rotation
Security@security-team
Database Admin@dba-team

Emergency Contacts

For critical vendor issues:

  • GCP Support: 1-855-831-3592 (Enterprise)
  • Cloudflare: Enterprise support portal
  • Stripe: Dashboard escalation