Skip to main content

Disaster Recovery Runbook

Complete disaster recovery procedures for Olympus Cloud infrastructure.

Overview

This document outlines the disaster recovery (DR) procedures for the Olympus Cloud platform. All critical services are designed for high availability and rapid recovery.

DR Philosophy

┌─────────────────────────────────────────────────────────────────┐
│ Disaster Recovery Layers │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: High Availability (HA) │
│ ├── Multi-zone deployment │
│ ├── Auto-scaling and self-healing │
│ └── Load balancing with health checks │
│ │
│ Layer 2: Regional Resilience │
│ ├── Multi-region database replication │
│ ├── Cross-region backups │
│ └── Regional failover capability │
│ │
│ Layer 3: Disaster Recovery │
│ ├── Point-in-time recovery │
│ ├── Cold standby infrastructure │
│ └── Documented recovery procedures │
│ │
└─────────────────────────────────────────────────────────────────┘

Recovery Objectives

RTO and RPO Targets

Service TierRTO (Recovery Time)RPO (Recovery Point)Examples
Tier 1 (Critical)1 hour5 minutesPayment, Orders
Tier 2 (High)4 hours1 hourPlatform, User Auth
Tier 3 (Medium)8 hours4 hoursAnalytics, Reporting
Tier 4 (Low)24 hours24 hoursLogs, Archives

Service Classification

ServiceTierRTORPO
Order Processing11 hour5 min
Payment Gateway11 hour5 min
User Authentication11 hour15 min
Platform Service24 hours1 hour
AI Services24 hours1 hour
Analytics38 hours4 hours
Reporting38 hours4 hours

Disaster Scenarios

Scenario 1: Single Service Failure

Impact: One service unavailable, others functioning

Recovery Steps:

  1. Auto-healing should restart failed instances
  2. If not recovered in 5 minutes:
    # Force redeploy
    gcloud run services update SERVICE --no-traffic
    gcloud run services update-traffic SERVICE --to-latest
  3. If still failing, rollback:
    gcloud run services update-traffic SERVICE \
    --to-revisions=PREVIOUS_REVISION=100

Scenario 2: Database Failure

Impact: Data unavailable, services degraded or down

Recovery Steps:

For Cloud Spanner:

  1. Check instance status:
    gcloud spanner instances describe prod-olympus-spanner
  2. If regional outage, wait for GCP recovery
  3. For data corruption, restore from backup:
    gcloud spanner databases create olympus-db-restored \
    --instance=prod-olympus-spanner \
    --source-backup=olympus-db-backup-latest

For Cloud SQL (DEPRECATED — Cloud SQL replaced by Cloud Spanner):

Cloud SQL is no longer in use. All OLTP data is in Cloud Spanner. These procedures are retained for reference only.

Scenario 3: Regional Outage

Impact: Primary region unavailable

Recovery Steps:

  1. Verify regional outage via GCP Status Dashboard
  2. Initiate regional failover:
    # Update DNS to point to DR region
    gcloud dns record-sets update api.olympuscloud.ai \
    --type=A \
    --zone=olympuscloud \
    --rrdatas=DR_REGION_IP
  3. Scale up DR region:
    gcloud run services update api-gateway \
    --region=us-east1 \
    --min-instances=10
  4. Verify database replication:
    gcloud spanner instances describe olympus-dr \
    --region=us-east1

Scenario 4: Complete Infrastructure Failure

Impact: All services unavailable

Recovery Steps:

  1. Assess scope of failure
  2. Activate DR team and incident bridge
  3. Deploy from cold standby:
    # Apply Terraform to DR environment
    cd infrastructure/terraform
    terraform workspace select dr
    terraform apply -auto-approve
  4. Restore databases from cross-region backups
  5. Update DNS to DR endpoints
  6. Verify functionality with smoke tests

Scenario 5: Data Corruption

Impact: Corrupted data affecting service integrity

Recovery Steps:

  1. Identify corruption scope:
    • Which tables/records affected?
    • When did corruption occur?
  2. Stop writes to affected data:
    # Enable read-only mode
    gcloud run services update order-service \
    --set-env-vars="READ_ONLY_MODE=true"
  3. Determine recovery point:
    • Point-in-time recovery to before corruption
    • Or restore from last known good backup
  4. Restore data:
    # Restore to specific point in time
    gcloud sql instances clone olympus-pg-prod olympus-pg-restored \
    --point-in-time="2026-01-18T10:00:00Z"
  5. Validate data integrity
  6. Resume normal operations

Backup Strategy

Backup Schedules

Data TypeFrequencyRetentionLocation
SpannerContinuous (PITR)7 daysSame region
Spanner BackupsDaily30 daysCross-region
Cloud SQL (deprecated)
Object StorageContinuous90 daysMulti-region
Config/SecretsOn changeForeverGit/Secret Manager

Backup Verification

Weekly Checks:

# Verify Spanner backups
gcloud spanner backups list --instance=prod-olympus-spanner

# Cloud SQL decommissioned — no backups to verify

# Verify storage replication
gsutil ls -L gs://olympus-backups-dr/

Monthly Restoration Test:

  1. Restore database to test instance
  2. Verify data integrity
  3. Run application tests
  4. Document results

Recovery Procedures

Database Recovery

Cloud Spanner Point-in-Time Recovery:

# Identify recovery timestamp (before incident)
RECOVERY_TIME="2026-01-18T10:00:00Z"

# Create new database from backup
# Use the appropriate instance: dev-olympus-spanner, staging-olympus-spanner, or prod-olympus-spanner
gcloud spanner databases create olympus-db-recovered \
--instance=prod-olympus-spanner \
--source-backup=olympus-db-backup-20260118

# Verify recovery
gcloud spanner databases execute-sql olympus-db-recovered \
--instance=prod-olympus-spanner \
--sql="SELECT COUNT(*) FROM orders"

Cloud SQL (DEPRECATED — replaced by Cloud Spanner):

Cloud SQL PITR procedures no longer apply. Use Spanner PITR above.

Service Recovery

Redeploy All Services:

# Trigger full redeploy via CI/CD
gh workflow run deploy.yml -f environment=production

# Or manually for each service
for service in api-gateway platform-service order-service user-service; do
gcloud run services update $service --region=us-central1 --no-traffic
gcloud run services update-traffic $service --to-latest
done

Rollback to Known Good State:

# Get last known good revision
gcloud run revisions list --service=api-gateway --limit=5

# Rollback each service
gcloud run services update-traffic api-gateway \
--to-revisions=api-gateway-20260117-abc123=100

DNS Failover

Failover to DR Region:

# Update Cloud DNS
gcloud dns record-sets update api.olympuscloud.ai \
--type=A \
--zone=olympuscloud-zone \
--rrdatas=DR_LOAD_BALANCER_IP \
--ttl=60

# Update Cloudflare (if using)
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-H "Authorization: Bearer $CF_TOKEN" \
-d '{"content": "DR_IP"}'

DR Testing

Test Schedule

Test TypeFrequencyScope
Backup VerificationWeeklyVerify backups exist
Restoration TestMonthlyRestore single database
Service RecoveryQuarterlyRedeploy services
Regional FailoverAnnuallyFull DR drill

Quarterly DR Drill

Objectives:

  1. Verify DR runbooks are current
  2. Test team response time
  3. Identify gaps in procedures
  4. Update documentation

Drill Steps:

  1. Schedule drill (notify stakeholders)
  2. Simulate failure scenario
  3. Execute recovery procedures
  4. Measure RTO/RPO achieved
  5. Document lessons learned
  6. Update runbooks

DR Drill Checklist

# DR Drill Report: [Date]

## Scenario
- [ ] Scenario type: [Service/Database/Regional/Full]
- [ ] Systems tested: [List]

## Execution
- [ ] Drill start time: [Time]
- [ ] Recovery initiated: [Time]
- [ ] Full recovery: [Time]

## Metrics
- [ ] Actual RTO: [Duration]
- [ ] Target RTO: [Duration]
- [ ] RTO Met: [Yes/No]

## Issues Encountered
1. [Issue 1]
2. [Issue 2]

## Action Items
1. [Action 1] - Owner - Due Date
2. [Action 2] - Owner - Due Date

## Participants
- [Name 1] - Role
- [Name 2] - Role

Communication

DR Team Contacts

RolePrimaryBackup
DR Coordinator@dr-coord@dr-coord-backup
Database Lead@dba-lead@dba-backup
Infrastructure Lead@infra-lead@infra-backup
Application Lead@app-lead@app-backup
Communications@comms-lead@comms-backup

Escalation Path

Level 1: On-Call Engineer

▼ (if DR needed)
Level 2: DR Coordinator

▼ (if major incident)
Level 3: VP Engineering

▼ (if company-wide impact)
Level 4: CEO

Communication Templates

DR Initiation:

Subject: [DR INITIATED] Olympus Cloud - [Date/Time]

A disaster recovery event has been initiated.

Scenario: [Description]
Impact: [Affected services]
ETA: [Expected recovery time]

DR Team has been activated. Updates every 30 minutes.

Bridge: [Meeting link]

DR Status Update:

Subject: [DR UPDATE] Olympus Cloud - [Date/Time]

Status: [In Progress / Recovering / Resolved]
Duration: [Time since initiation]

Progress:
- [Completed step 1]
- [Completed step 2]
- [In progress: step 3]

Next update: [Time]

DR Resolution:

Subject: [DR RESOLVED] Olympus Cloud - [Date/Time]

The disaster recovery event has been resolved.

Duration: [Total time]
Services Restored: [List]
Data Loss: [None / Description]

Post-incident review scheduled for [Date].

Infrastructure Documentation

Primary Region (us-central1)

ResourceID/Name
VPColympus-vpc-prod
Subnetolympus-subnet-central
Cloud Runapi-gateway, platform-service, etc.
Spannerprod-olympus-spanner
Load Balancerolympus-lb-prod

DR Region (us-east1)

ResourceID/Name
VPColympus-vpc-dr
Subnetolympus-subnet-east
Cloud Run(deployed on demand)
Spannerolympus-dr (replica)
Load Balancerolympus-lb-dr

Critical Configurations

# DR Configuration Summary
primary_region: us-central1
dr_region: us-east1

databases:
spanner:
production: prod-olympus-spanner
dr_replica: olympus-dr
backup_location: us
backup_retention: 30d

# Cloud SQL decommissioned — replaced by Cloud Spanner
backup_location: us
backup_retention: 30d

dns:
provider: cloudflare
ttl: 300
failover_ttl: 60

Post-DR Actions

Immediate (Within 24 hours)

  1. Verify all services operational
  2. Check data integrity
  3. Review metrics and logs
  4. Clear incident communications

Short-term (Within 1 week)

  1. Conduct post-incident review
  2. Document lessons learned
  3. Update runbooks if needed
  4. File tickets for improvements

Long-term (Within 1 month)

  1. Implement improvements from PIR
  2. Update DR testing schedule
  3. Review backup strategies
  4. Train team on any new procedures