Operations Guide
This section covers deployment, monitoring, and operational procedures for the Olympus Cloud platform.
Infrastructure Overview
Services
Backend Services
| Service | Technology | Cloud Run | Purpose |
|---|---|---|---|
| Auth | Rust (Axum) | auth-service | Authentication |
| Platform | Rust (GraphQL) | platform-service | Multi-tenancy |
| Commerce | Rust (Axum) | commerce-service | Orders, payments |
| Gateway | Go (Chi) | api-gateway | Unified routing |
| AI/ML | Python (FastAPI) | ai-ml-service | ML models, analytics |
Edge Services
| Service | Technology | Worker | Purpose |
|---|---|---|---|
| Main Worker | TypeScript | olympus-edge | Auth, routing, caching |
| AI Gateway | TypeScript | ai-gateway | Model routing |
| Restaurant AI | TypeScript | restaurant-ai-chat | Support agents |
Deployment
Cloud Run
All backend services deploy via Cloud Build:
# Deploy a service
gcloud run deploy auth-service \
--image gcr.io/olympuscloud/auth-service:latest \
--region us-central1 \
--platform managed
Cloudflare Workers
# Deploy worker
cd infrastructure/cloudflare-workers
wrangler deploy --env production
Monitoring
Key Metrics
- Latency: P50, P95, P99 response times
- Error Rate: 4xx, 5xx responses
- Throughput: Requests per second
- Saturation: CPU, memory utilization
Dashboards
Runbooks
Operational procedures and emergency response guides:
Emergency Response
- Incident Response - First response procedures, severity levels, escalation
- Disaster Recovery - DR procedures, failover, data recovery
- On-Call Guide - On-call responsibilities, shift handoffs, escalation paths
Infrastructure Operations
- Database Operations - Spanner operations, backup/restore, query optimization
- Scaling - Auto-scaling, manual scaling, capacity planning
- Deployment - Deployment procedures, rollback, blue-green deployments
Monitoring & Alerting
- Monitoring & Alerts - Alert handling, threshold tuning, false positive management
- Production Security - Security incident response, access reviews, audit procedures
Specialized Systems
- Vision AI Operations - Vision AI system operations, model updates, camera troubleshooting
- Data Sync & IoT - Edge sync issues, IoT device management, OlympusEdge operations
Environments
| Environment | API URL | Purpose |
|---|---|---|
| Production | api.olympuscloud.ai | Live traffic |
| Staging | staging.api.olympuscloud.ai | Pre-production |
| Development | dev.api.olympuscloud.ai | Development |
On-Call
For production issues:
- Check status page
- Review Cloud Monitoring alerts
- Follow runbook for incident type
- Escalate if needed