Disaster Recovery Runbook
Complete disaster recovery procedures for Olympus Cloud infrastructure.
Overview
This document outlines the disaster recovery (DR) procedures for the Olympus Cloud platform. All critical services are designed for high availability and rapid recovery.
DR Philosophy
┌─────────────────────────────────────────────────────────────────┐
│ Disaster Recovery Layers │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: High Availability (HA) │
│ ├── Multi-zone deployment │
│ ├── Auto-scaling and self-healing │
│ └── Load balancing with health checks │
│ │
│ Layer 2: Regional Resilience │
│ ├── Multi-region database replication │
│ ├── Cross-region backups │
│ └── Regional failover capability │
│ │
│ Layer 3: Disaster Recovery │
│ ├── Point-in-time recovery │
│ ├── Cold standby infrastructure │
│ └── Documented recovery procedures │
│ │
└─────────────────────────────────────────────────────────────────┘
Recovery Objectives
RTO and RPO Targets
| Service Tier | RTO (Recovery Time) | RPO (Recovery Point) | Examples |
|---|---|---|---|
| Tier 1 (Critical) | 1 hour | 5 minutes | Payment, Orders |
| Tier 2 (High) | 4 hours | 1 hour | Platform, User Auth |
| Tier 3 (Medium) | 8 hours | 4 hours | Analytics, Reporting |
| Tier 4 (Low) | 24 hours | 24 hours | Logs, Archives |
Service Classification
| Service | Tier | RTO | RPO |
|---|---|---|---|
| Order Processing | 1 | 1 hour | 5 min |
| Payment Gateway | 1 | 1 hour | 5 min |
| User Authentication | 1 | 1 hour | 15 min |
| Platform Service | 2 | 4 hours | 1 hour |
| AI Services | 2 | 4 hours | 1 hour |
| Analytics | 3 | 8 hours | 4 hours |
| Reporting | 3 | 8 hours | 4 hours |
Disaster Scenarios
Scenario 1: Single Service Failure
Impact: One service unavailable, others functioning
Recovery Steps:
- Auto-healing should restart failed instances
- If not recovered in 5 minutes:
# Force redeploy
gcloud run services update SERVICE --no-traffic
gcloud run services update-traffic SERVICE --to-latest - If still failing, rollback:
gcloud run services update-traffic SERVICE \
--to-revisions=PREVIOUS_REVISION=100
Scenario 2: Database Failure
Impact: Data unavailable, services degraded or down
Recovery Steps:
For Cloud Spanner:
- Check instance status:
gcloud spanner instances describe prod-olympus-spanner - If regional outage, wait for GCP recovery
- For data corruption, restore from backup:
gcloud spanner databases create olympus-db-restored \
--instance=prod-olympus-spanner \
--source-backup=olympus-db-backup-latest
For Cloud SQL (DEPRECATED — Cloud SQL replaced by Cloud Spanner):
Cloud SQL is no longer in use. All OLTP data is in Cloud Spanner. These procedures are retained for reference only.
Scenario 3: Regional Outage
Impact: Primary region unavailable
Recovery Steps:
- Verify regional outage via GCP Status Dashboard
- Initiate regional failover:
# Update DNS to point to DR region
gcloud dns record-sets update api.olympuscloud.ai \
--type=A \
--zone=olympuscloud \
--rrdatas=DR_REGION_IP - Scale up DR region:
gcloud run services update api-gateway \
--region=us-east1 \
--min-instances=10 - Verify database replication:
gcloud spanner instances describe olympus-dr \
--region=us-east1
Scenario 4: Complete Infrastructure Failure
Impact: All services unavailable
Recovery Steps:
- Assess scope of failure
- Activate DR team and incident bridge
- Deploy from cold standby:
# Apply Terraform to DR environment
cd infrastructure/terraform
terraform workspace select dr
terraform apply -auto-approve - Restore databases from cross-region backups
- Update DNS to DR endpoints
- Verify functionality with smoke tests
Scenario 5: Data Corruption
Impact: Corrupted data affecting service integrity
Recovery Steps:
- Identify corruption scope:
- Which tables/records affected?
- When did corruption occur?
- Stop writes to affected data:
# Enable read-only mode
gcloud run services update order-service \
--set-env-vars="READ_ONLY_MODE=true" - Determine recovery point:
- Point-in-time recovery to before corruption
- Or restore from last known good backup
- Restore data:
# Restore to specific point in time
gcloud sql instances clone olympus-pg-prod olympus-pg-restored \
--point-in-time="2026-01-18T10:00:00Z" - Validate data integrity
- Resume normal operations
Backup Strategy
Backup Schedules
| Data Type | Frequency | Retention | Location |
|---|---|---|---|
| Spanner | Continuous (PITR) | 7 days | Same region |
| Spanner Backups | Daily | 30 days | Cross-region |
| Cloud SQL (deprecated) | — | — | — |
| Object Storage | Continuous | 90 days | Multi-region |
| Config/Secrets | On change | Forever | Git/Secret Manager |
Backup Verification
Weekly Checks:
# Verify Spanner backups
gcloud spanner backups list --instance=prod-olympus-spanner
# Cloud SQL decommissioned — no backups to verify
# Verify storage replication
gsutil ls -L gs://olympus-backups-dr/
Monthly Restoration Test:
- Restore database to test instance
- Verify data integrity
- Run application tests
- Document results
Recovery Procedures
Database Recovery
Cloud Spanner Point-in-Time Recovery:
# Identify recovery timestamp (before incident)
RECOVERY_TIME="2026-01-18T10:00:00Z"
# Create new database from backup
# Use the appropriate instance: dev-olympus-spanner, staging-olympus-spanner, or prod-olympus-spanner
gcloud spanner databases create olympus-db-recovered \
--instance=prod-olympus-spanner \
--source-backup=olympus-db-backup-20260118
# Verify recovery
gcloud spanner databases execute-sql olympus-db-recovered \
--instance=prod-olympus-spanner \
--sql="SELECT COUNT(*) FROM orders"
Cloud SQL (DEPRECATED — replaced by Cloud Spanner):
Cloud SQL PITR procedures no longer apply. Use Spanner PITR above.
Service Recovery
Redeploy All Services:
# Trigger full redeploy via CI/CD
gh workflow run deploy.yml -f environment=production
# Or manually for each service
for service in api-gateway platform-service order-service user-service; do
gcloud run services update $service --region=us-central1 --no-traffic
gcloud run services update-traffic $service --to-latest
done
Rollback to Known Good State:
# Get last known good revision
gcloud run revisions list --service=api-gateway --limit=5
# Rollback each service
gcloud run services update-traffic api-gateway \
--to-revisions=api-gateway-20260117-abc123=100
DNS Failover
Failover to DR Region:
# Update Cloud DNS
gcloud dns record-sets update api.olympuscloud.ai \
--type=A \
--zone=olympuscloud-zone \
--rrdatas=DR_LOAD_BALANCER_IP \
--ttl=60
# Update Cloudflare (if using)
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-H "Authorization: Bearer $CF_TOKEN" \
-d '{"content": "DR_IP"}'
DR Testing
Test Schedule
| Test Type | Frequency | Scope |
|---|---|---|
| Backup Verification | Weekly | Verify backups exist |
| Restoration Test | Monthly | Restore single database |
| Service Recovery | Quarterly | Redeploy services |
| Regional Failover | Annually | Full DR drill |
Quarterly DR Drill
Objectives:
- Verify DR runbooks are current
- Test team response time
- Identify gaps in procedures
- Update documentation
Drill Steps:
- Schedule drill (notify stakeholders)
- Simulate failure scenario
- Execute recovery procedures
- Measure RTO/RPO achieved
- Document lessons learned
- Update runbooks
DR Drill Checklist
# DR Drill Report: [Date]
## Scenario
- [ ] Scenario type: [Service/Database/Regional/Full]
- [ ] Systems tested: [List]
## Execution
- [ ] Drill start time: [Time]
- [ ] Recovery initiated: [Time]
- [ ] Full recovery: [Time]
## Metrics
- [ ] Actual RTO: [Duration]
- [ ] Target RTO: [Duration]
- [ ] RTO Met: [Yes/No]
## Issues Encountered
1. [Issue 1]
2. [Issue 2]
## Action Items
1. [Action 1] - Owner - Due Date
2. [Action 2] - Owner - Due Date
## Participants
- [Name 1] - Role
- [Name 2] - Role
Communication
DR Team Contacts
| Role | Primary | Backup |
|---|---|---|
| DR Coordinator | @dr-coord | @dr-coord-backup |
| Database Lead | @dba-lead | @dba-backup |
| Infrastructure Lead | @infra-lead | @infra-backup |
| Application Lead | @app-lead | @app-backup |
| Communications | @comms-lead | @comms-backup |
Escalation Path
Level 1: On-Call Engineer
│
▼ (if DR needed)
Level 2: DR Coordinator
│
▼ (if major incident)
Level 3: VP Engineering
│
▼ (if company-wide impact)
Level 4: CEO
Communication Templates
DR Initiation:
Subject: [DR INITIATED] Olympus Cloud - [Date/Time]
A disaster recovery event has been initiated.
Scenario: [Description]
Impact: [Affected services]
ETA: [Expected recovery time]
DR Team has been activated. Updates every 30 minutes.
Bridge: [Meeting link]
DR Status Update:
Subject: [DR UPDATE] Olympus Cloud - [Date/Time]
Status: [In Progress / Recovering / Resolved]
Duration: [Time since initiation]
Progress:
- [Completed step 1]
- [Completed step 2]
- [In progress: step 3]
Next update: [Time]
DR Resolution:
Subject: [DR RESOLVED] Olympus Cloud - [Date/Time]
The disaster recovery event has been resolved.
Duration: [Total time]
Services Restored: [List]
Data Loss: [None / Description]
Post-incident review scheduled for [Date].
Infrastructure Documentation
Primary Region (us-central1)
| Resource | ID/Name |
|---|---|
| VPC | olympus-vpc-prod |
| Subnet | olympus-subnet-central |
| Cloud Run | api-gateway, platform-service, etc. |
| Spanner | prod-olympus-spanner |
| Load Balancer | olympus-lb-prod |
DR Region (us-east1)
| Resource | ID/Name |
|---|---|
| VPC | olympus-vpc-dr |
| Subnet | olympus-subnet-east |
| Cloud Run | (deployed on demand) |
| Spanner | olympus-dr (replica) |
| Load Balancer | olympus-lb-dr |
Critical Configurations
# DR Configuration Summary
primary_region: us-central1
dr_region: us-east1
databases:
spanner:
production: prod-olympus-spanner
dr_replica: olympus-dr
backup_location: us
backup_retention: 30d
# Cloud SQL decommissioned — replaced by Cloud Spanner
backup_location: us
backup_retention: 30d
dns:
provider: cloudflare
ttl: 300
failover_ttl: 60
Post-DR Actions
Immediate (Within 24 hours)
- Verify all services operational
- Check data integrity
- Review metrics and logs
- Clear incident communications
Short-term (Within 1 week)
- Conduct post-incident review
- Document lessons learned
- Update runbooks if needed
- File tickets for improvements
Long-term (Within 1 month)
- Implement improvements from PIR
- Update DR testing schedule
- Review backup strategies
- Train team on any new procedures
Related Documentation
- Incident Response - For incident handling
- Database Operations - Database recovery details
- Scaling - Scaling DR resources
- On-Call Guide - Escalation procedures