Skip to main content

Operational Runbooks

Step-by-step procedures for operating, troubleshooting, and maintaining the Olympus Cloud platform.

Core Runbooks

RunbookDescription
Incident ResponseSeverity levels, escalation paths, and resolution workflows
Database OperationsCloud Spanner maintenance, troubleshooting, and emergency procedures
On-Call GuideOn-call responsibilities, tools, and procedures
ScalingScaling procedures for Cloud Run, databases, and edge infrastructure
Monitoring & AlertsAlert configuration, metric interpretation, and dashboard usage
Disaster RecoveryRTO/RPO targets, failover procedures, and recovery workflows
DeploymentPre-flight checks, rollout strategies, and rollback procedures
Production SecuritySecurity procedures, compliance verification, and security incident response

Troubleshooting Guides

RunbookDescription
Data Sync & IoTTroubleshooting data synchronization and IoT device issues
Vision AICamera connectivity, model accuracy, and edge device troubleshooting

Quick Reference

Severity Levels:

LevelResponse TimeExample
SEV-1 (Critical)15 minutesComplete service outage
SEV-2 (High)30 minutesMajor feature degraded
SEV-3 (Medium)2 hoursMinor feature issue
SEV-4 (Low)Next business dayCosmetic or minor bug

Key Contacts:

  • On-call rotation managed via PagerDuty
  • Escalation path: On-call engineer → Team lead → Engineering manager
  • Status page updates for SEV-1/SEV-2 incidents