Deployment Runbook
Procedures for deploying Olympus Cloud services to production environments.
Quick Reference
| Task | Command/Action |
|---|---|
| Deploy to staging | Merge to develop branch |
| Deploy to production | Merge to main branch |
| Quick rollback | gcloud run services update-traffic SERVICE --to-revisions=PREV=100 |
| Check deployment status | gcloud run revisions list --service=SERVICE |
| View deployment logs | gcloud logging read 'resource.type="cloud_run_revision"' |
Deployment Architecture
Environments
| Environment | Branch | URL | Auto-Deploy |
|---|---|---|---|
| Development | develop | dev.olympuscloud.ai | Yes |
| Staging | develop | staging.olympuscloud.ai | Yes |
| Production | main | olympuscloud.ai | Yes (with approval) |
Service Topology
┌─────────────────────────────────────────────────────────────────────┐
│ Cloudflare Edge │
│ (WAF, DDoS, CDN, Edge Workers) │
└────────────────────────────────┬────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ API Gateway │ │ Web Portal │ │ Mobile BFF │
│ (Go) │ │ (Flutter Web) │ │ (Rust) │
│ Cloud Run │ │ Cloud Run │ │ Cloud Run │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└──────────────────────┼──────────────────────┘
▼
┌──────────────────────┴──────────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Platform Service│ │ Commerce Service│ │ AI Service │
│ (Rust) │ │ (Rust) │ │ (Python) │
│ Cloud Run │ │ Cloud Run │ │ Cloud Run │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└──────────────────────┴──────────────────────┘
│
▼
┌───────────────────────┐
│ Cloud Spanner │
│ (Primary Database) │
└───────────────────────┘
Pre-Deployment Checklist
Before Every Production Deployment
- All tests pass in CI (
gh run view --webto check) - Code review approved by at least 1 reviewer
- No P0/P1 incidents currently active
- Deployment window is appropriate (not Friday 5pm)
- Rollback plan documented
- On-call engineer aware of deployment
For Database Migrations
- Migration tested in staging
- Migration is backward compatible
- Rollback migration prepared
- Estimated migration time documented
- Maintenance window scheduled (if needed)
For Infrastructure Changes
- Terraform plan reviewed
- No destructive changes without approval
- Cost impact assessed
- Security review if IAM changes
Standard Deployment Procedures
Deploying to Staging
Staging deployments are automatic on merge to develop:
# Create PR to develop
gh pr create --base develop --title "feat: your feature"
# After approval and merge, monitor deployment
gh run watch
# Check staging deployment
curl -s https://staging.olympuscloud.ai/health | jq
Deploying to Production
Production deployments require approval:
# 1. Create PR from develop to main
gh pr create --base main --head develop --title "Release $(date +%Y-%m-%d)"
# 2. Review deployment changes
gh pr diff
# 3. After approval, merge triggers deployment
gh pr merge --merge
# 4. Monitor deployment pipeline
gh run watch
# 5. Verify production health
curl -s https://api.olympuscloud.ai/health | jq
Manual Deployment (Emergency)
For emergency hotfixes bypassing normal CI/CD:
# 1. Build and push image directly
docker build -t gcr.io/olympuscloud-prod/api-gateway:hotfix-$(date +%Y%m%d) .
docker push gcr.io/olympuscloud-prod/api-gateway:hotfix-$(date +%Y%m%d)
# 2. Deploy to Cloud Run
gcloud run deploy api-gateway \
--image=gcr.io/olympuscloud-prod/api-gateway:hotfix-$(date +%Y%m%d) \
--region=us-central1 \
--project=olympuscloud-prod
# 3. Document in incident channel
# 4. Create follow-up PR with same changes
Canary Deployment Procedure
Traffic Splitting Strategy
| Phase | Canary Traffic | Duration | Success Criteria |
|---|---|---|---|
| Phase 1 | 5% | 10 min | Error rate under 0.1% |
| Phase 2 | 25% | 15 min | Error rate under 0.1%, latency p99 under 500ms |
| Phase 3 | 50% | 15 min | Error rate under 0.1%, no alerts |
| Phase 4 | 100% | - | Full rollout |
Executing Canary Deployment
# Deploy new revision without traffic
gcloud run deploy api-gateway \
--image=gcr.io/olympuscloud-prod/api-gateway:v2.1.0 \
--region=us-central1 \
--no-traffic \
--tag=canary
# Phase 1: 5% traffic to canary
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-tags=canary=5
# Monitor for 10 minutes
# Check error rates, latency, alerts
# Phase 2: 25% traffic
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-tags=canary=25
# Continue monitoring...
# Phase 3: 50% traffic
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-tags=canary=50
# Phase 4: Full rollout
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-latest
Monitoring During Canary
# Watch error rates
watch -n 5 'gcloud logging read \
"resource.type=cloud_run_revision AND severity>=ERROR" \
--limit=10 --format="table(timestamp,textPayload)"'
# Compare latency between revisions
gcloud monitoring metrics list \
--filter="metric.type:run.googleapis.com/request_latencies"
Rollback Procedures
Quick Rollback (Immediate)
When you need to rollback immediately:
# 1. List recent revisions
gcloud run revisions list \
--service=api-gateway \
--region=us-central1 \
--format="table(REVISION,ACTIVE,LAST_DEPLOYED)"
# 2. Route all traffic to previous revision
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-revisions=api-gateway-00042-abc=100
# 3. Verify rollback
curl -s https://api.olympuscloud.ai/health | jq '.version'
Full Rollback (With Investigation)
For issues requiring investigation:
# 1. Reduce traffic to bad revision
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-revisions=api-gateway-00042-abc=100
# 2. Capture logs from bad revision
gcloud logging read \
'resource.labels.revision_name="api-gateway-00043-def"' \
--limit=1000 \
--format=json > rollback-logs.json
# 3. Create incident for investigation
gh issue create --title "Rollback: api-gateway v2.1.0" \
--body "Rolled back due to: [describe issue]"
# 4. Delete bad revision (optional)
gcloud run revisions delete api-gateway-00043-def \
--region=us-central1
Database Rollback
If deployment included database migrations:
# 1. Check current migration version
psql $DATABASE_URL -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 5;"
# 2. Run rollback migration
./scripts/db-migrate.sh rollback
# 3. Verify rollback
psql $DATABASE_URL -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;"
Infrastructure Deployment (Terraform)
Terraform Workflow
# 1. Initialize (if needed)
cd infrastructure/terraform
terraform init -backend-config=environments/prod/backend.tfvars
# 2. Plan changes
terraform plan \
-var-file=environments/prod/terraform.tfvars \
-out=tfplan
# 3. Review plan carefully
terraform show tfplan
# 4. Apply (requires approval for production)
terraform apply tfplan
Common Infrastructure Changes
Scaling Cloud Run:
# In terraform.tfvars
cloud_run_services = {
api_gateway = {
min_instances = 2
max_instances = 100
cpu = "2"
memory = "2Gi"
}
}
Adding a new service:
# 1. Add service definition in Terraform
# 2. Plan and apply
terraform plan -var-file=environments/prod/terraform.tfvars
terraform apply
# 3. Deploy application code
gcloud run deploy NEW_SERVICE --image=gcr.io/...
Infrastructure Rollback
# 1. Find previous state version
gsutil ls -l gs://olympuscloud-terraform-state/terraform/state/
# 2. Download previous state (careful!)
gsutil cp gs://olympuscloud-terraform-state/terraform/state/PREVIOUS_VERSION terraform.tfstate.backup
# 3. Plan to revert
terraform plan -var-file=environments/prod/terraform.tfvars
# 4. Apply revert
terraform apply
Service-Specific Procedures
API Gateway (Go)
# Build
docker build -f services/api-gateway/Dockerfile \
-t gcr.io/olympuscloud-prod/api-gateway:$(git rev-parse --short HEAD) .
# Deploy
gcloud run deploy api-gateway \
--image=gcr.io/olympuscloud-prod/api-gateway:$(git rev-parse --short HEAD) \
--region=us-central1 \
--set-env-vars="ENV=production" \
--min-instances=2 \
--max-instances=100
Platform Service (Rust)
# Build (uses multi-stage Dockerfile)
docker build -f services/platform/Dockerfile \
-t gcr.io/olympuscloud-prod/platform-service:$(git rev-parse --short HEAD) .
# Deploy
gcloud run deploy platform-service \
--image=gcr.io/olympuscloud-prod/platform-service:$(git rev-parse --short HEAD) \
--region=us-central1 \
--memory=1Gi \
--cpu=2
AI Services (Python)
# Build with ML dependencies
docker build -f services/ai/Dockerfile \
-t gcr.io/olympuscloud-prod/ai-service:$(git rev-parse --short HEAD) .
# Deploy with GPU (if needed)
gcloud run deploy ai-service \
--image=gcr.io/olympuscloud-prod/ai-service:$(git rev-parse --short HEAD) \
--region=us-central1 \
--memory=4Gi \
--cpu=4
Edge Workers (Cloudflare)
# Deploy to staging
cd workers/api-router
wrangler deploy --env staging
# Test staging
curl -s https://staging.olympuscloud.ai/api/health
# Deploy to production
wrangler deploy --env production
Monitoring Deployments
Real-Time Monitoring
# Watch deployment progress
gcloud run services describe api-gateway \
--region=us-central1 \
--format="value(status.traffic)"
# Tail logs during deployment
gcloud logging tail \
'resource.type="cloud_run_revision" AND
resource.labels.service_name="api-gateway"' \
--format="value(textPayload)"
# Watch for errors
gcloud logging read \
'resource.type="cloud_run_revision" AND severity>=ERROR' \
--limit=20 \
--format="table(timestamp,resource.labels.revision_name,textPayload)"
Key Metrics to Watch
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| Error rate | under 0.1% | over 0.5% | over 1% |
| Latency p99 | under 500ms | over 1s | over 2s |
| Instance count | Stable | Scaling up | At max |
| Memory usage | under 80% | over 85% | over 95% |
Post-Deployment Verification
# Health check
curl -s https://api.olympuscloud.ai/health | jq
# Version check
curl -s https://api.olympuscloud.ai/version | jq
# Smoke tests
./scripts/smoke-tests.sh production
# Check for error spikes
gcloud logging read \
'resource.type="cloud_run_revision" AND severity=ERROR' \
--limit=50 \
--freshness=10m
Troubleshooting Deployments
Deployment Stuck
# Check revision status
gcloud run revisions describe REVISION_NAME \
--region=us-central1 \
--format="yaml(status)"
# Common issues:
# - Image pull errors: Check image exists and permissions
# - Health check failing: Check /health endpoint
# - Resource limits: May need more CPU/memory
Container Won't Start
# Check container logs
gcloud logging read \
'resource.labels.revision_name="REVISION_NAME" AND
textPayload=~"error|Error|ERROR|panic|fatal"' \
--limit=100
# Common issues:
# - Missing environment variables
# - Database connection failures
# - Port binding issues (must use PORT env var)
High Error Rate After Deployment
# Identify error patterns
gcloud logging read \
'resource.type="cloud_run_revision" AND severity>=ERROR' \
--limit=200 \
--format=json | jq '.[] | .textPayload' | sort | uniq -c | sort -rn
# Quick mitigation: rollback
gcloud run services update-traffic SERVICE \
--to-revisions=PREVIOUS_REVISION=100
Escalation
| Situation | Action | Contact |
|---|---|---|
| Deployment pipeline failing | Check GitHub Actions | Platform team Slack |
| Cloud Run deployment issues | Check GCP console | SRE on-call |
| Terraform state issues | Do NOT modify manually | Infrastructure lead |
| Production incident during deploy | Stop deployment, rollback | Incident commander |