Deployment Runbook

Procedures for deploying Olympus Cloud services to production environments.

Quick Reference

Task	Command/Action
Deploy to staging	Merge to `develop` branch
Deploy to production	Merge to `main` branch
Quick rollback	`gcloud run services update-traffic SERVICE --to-revisions=PREV=100`
Check deployment status	`gcloud run revisions list --service=SERVICE`
View deployment logs	`gcloud logging read 'resource.type="cloud_run_revision"'`

Deployment Architecture

Environments

Environment	Branch	URL	Auto-Deploy
Development	`develop`	dev.olympuscloud.ai	Yes
Staging	`develop`	staging.olympuscloud.ai	Yes
Production	`main`	olympuscloud.ai	Yes (with approval)

Service Topology

┌─────────────────────────────────────────────────────────────────────┐
│                         Cloudflare Edge                              │
│  (WAF, DDoS, CDN, Edge Workers)                                     │
└────────────────────────────────┬────────────────────────────────────┘
                                 │
         ┌───────────────────────┼───────────────────────┐
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   API Gateway   │    │   Web Portal    │    │  Mobile BFF     │
│   (Go)          │    │   (Flutter Web) │    │  (Rust)         │
│   Cloud Run     │    │   Cloud Run     │    │  Cloud Run      │
└────────┬────────┘    └────────┬────────┘    └────────┬────────┘
         │                      │                      │
         └──────────────────────┼──────────────────────┘
                                ▼
         ┌──────────────────────┴──────────────────────┐
         ▼                      ▼                      ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Platform Service│    │ Commerce Service│    │   AI Service    │
│   (Rust)        │    │   (Rust)        │    │   (Python)      │
│   Cloud Run     │    │   Cloud Run     │    │   Cloud Run     │
└────────┬────────┘    └────────┬────────┘    └────────┬────────┘
         │                      │                      │
         └──────────────────────┴──────────────────────┘
                                │
                                ▼
                    ┌───────────────────────┐
                    │   Cloud Spanner       │
                    │   (Primary Database)  │
                    └───────────────────────┘

Pre-Deployment Checklist

Before Every Production Deployment

All tests pass in CI (gh run view --web to check)
Code review approved by at least 1 reviewer
No P0/P1 incidents currently active
Deployment window is appropriate (not Friday 5pm)
Rollback plan documented
On-call engineer aware of deployment

For Database Migrations

Migration tested in staging
Migration is backward compatible
Rollback migration prepared
Estimated migration time documented
Maintenance window scheduled (if needed)

For Infrastructure Changes

Terraform plan reviewed
No destructive changes without approval
Cost impact assessed
Security review if IAM changes

Standard Deployment Procedures

Deploying to Staging

Staging deployments are automatic on merge to develop:

# Create PR to develop
gh pr create --base develop --title "feat: your feature"

# After approval and merge, monitor deployment
gh run watch

# Check staging deployment
curl -s https://staging.olympuscloud.ai/health | jq

Deploying to Production

Production deployments require approval:

# 1. Create PR from develop to main
gh pr create --base main --head develop --title "Release $(date +%Y-%m-%d)"

# 2. Review deployment changes
gh pr diff

# 3. After approval, merge triggers deployment
gh pr merge --merge

# 4. Monitor deployment pipeline
gh run watch

# 5. Verify production health
curl -s https://api.olympuscloud.ai/health | jq

Manual Deployment (Emergency)

For emergency hotfixes bypassing normal CI/CD:

# 1. Build and push image directly
docker build -t gcr.io/olympuscloud-prod/api-gateway:hotfix-$(date +%Y%m%d) .
docker push gcr.io/olympuscloud-prod/api-gateway:hotfix-$(date +%Y%m%d)

# 2. Deploy to Cloud Run
gcloud run deploy api-gateway \
  --image=gcr.io/olympuscloud-prod/api-gateway:hotfix-$(date +%Y%m%d) \
  --region=us-central1 \
  --project=olympuscloud-prod

# 3. Document in incident channel
# 4. Create follow-up PR with same changes

Canary Deployment Procedure

Traffic Splitting Strategy

Phase	Canary Traffic	Duration	Success Criteria
Phase 1	5%	10 min	Error rate under 0.1%
Phase 2	25%	15 min	Error rate under 0.1%, latency p99 under 500ms
Phase 3	50%	15 min	Error rate under 0.1%, no alerts
Phase 4	100%	-	Full rollout

Executing Canary Deployment

# Deploy new revision without traffic
gcloud run deploy api-gateway \
  --image=gcr.io/olympuscloud-prod/api-gateway:v2.1.0 \
  --region=us-central1 \
  --no-traffic \
  --tag=canary

# Phase 1: 5% traffic to canary
gcloud run services update-traffic api-gateway \
  --region=us-central1 \
  --to-tags=canary=5

# Monitor for 10 minutes
# Check error rates, latency, alerts

# Phase 2: 25% traffic
gcloud run services update-traffic api-gateway \
  --region=us-central1 \
  --to-tags=canary=25

# Continue monitoring...

# Phase 3: 50% traffic
gcloud run services update-traffic api-gateway \
  --region=us-central1 \
  --to-tags=canary=50

# Phase 4: Full rollout
gcloud run services update-traffic api-gateway \
  --region=us-central1 \
  --to-latest

Monitoring During Canary

# Watch error rates
watch -n 5 'gcloud logging read \
  "resource.type=cloud_run_revision AND severity>=ERROR" \
  --limit=10 --format="table(timestamp,textPayload)"'

# Compare latency between revisions
gcloud monitoring metrics list \
  --filter="metric.type:run.googleapis.com/request_latencies"

Rollback Procedures

Quick Rollback (Immediate)

When you need to rollback immediately:

# 1. List recent revisions
gcloud run revisions list \
  --service=api-gateway \
  --region=us-central1 \
  --format="table(REVISION,ACTIVE,LAST_DEPLOYED)"

# 2. Route all traffic to previous revision
gcloud run services update-traffic api-gateway \
  --region=us-central1 \
  --to-revisions=api-gateway-00042-abc=100

# 3. Verify rollback
curl -s https://api.olympuscloud.ai/health | jq '.version'

Full Rollback (With Investigation)

For issues requiring investigation:

# 1. Reduce traffic to bad revision
gcloud run services update-traffic api-gateway \
  --region=us-central1 \
  --to-revisions=api-gateway-00042-abc=100

# 2. Capture logs from bad revision
gcloud logging read \
  'resource.labels.revision_name="api-gateway-00043-def"' \
  --limit=1000 \
  --format=json > rollback-logs.json

# 3. Create incident for investigation
gh issue create --title "Rollback: api-gateway v2.1.0" \
  --body "Rolled back due to: [describe issue]"

# 4. Delete bad revision (optional)
gcloud run revisions delete api-gateway-00043-def \
  --region=us-central1

Database Rollback

If deployment included database migrations:

# 1. Check current migration version
psql $DATABASE_URL -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 5;"

# 2. Run rollback migration
./scripts/db-migrate.sh rollback

# 3. Verify rollback
psql $DATABASE_URL -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;"

Infrastructure Deployment (Terraform)

Terraform Workflow

# 1. Initialize (if needed)
cd infrastructure/terraform
terraform init -backend-config=environments/prod/backend.tfvars

# 2. Plan changes
terraform plan \
  -var-file=environments/prod/terraform.tfvars \
  -out=tfplan

# 3. Review plan carefully
terraform show tfplan

# 4. Apply (requires approval for production)
terraform apply tfplan

Common Infrastructure Changes

Scaling Cloud Run:

# In terraform.tfvars
cloud_run_services = {
  api_gateway = {
    min_instances = 2
    max_instances = 100
    cpu           = "2"
    memory        = "2Gi"
  }
}

Adding a new service:

# 1. Add service definition in Terraform
# 2. Plan and apply
terraform plan -var-file=environments/prod/terraform.tfvars
terraform apply

# 3. Deploy application code
gcloud run deploy NEW_SERVICE --image=gcr.io/...

Infrastructure Rollback

# 1. Find previous state version
gsutil ls -l gs://olympuscloud-terraform-state/terraform/state/

# 2. Download previous state (careful!)
gsutil cp gs://olympuscloud-terraform-state/terraform/state/PREVIOUS_VERSION terraform.tfstate.backup

# 3. Plan to revert
terraform plan -var-file=environments/prod/terraform.tfvars

# 4. Apply revert
terraform apply

Service-Specific Procedures

API Gateway (Go)

# Build
docker build -f services/api-gateway/Dockerfile \
  -t gcr.io/olympuscloud-prod/api-gateway:$(git rev-parse --short HEAD) .

# Deploy
gcloud run deploy api-gateway \
  --image=gcr.io/olympuscloud-prod/api-gateway:$(git rev-parse --short HEAD) \
  --region=us-central1 \
  --set-env-vars="ENV=production" \
  --min-instances=2 \
  --max-instances=100

Platform Service (Rust)

# Build (uses multi-stage Dockerfile)
docker build -f services/platform/Dockerfile \
  -t gcr.io/olympuscloud-prod/platform-service:$(git rev-parse --short HEAD) .

# Deploy
gcloud run deploy platform-service \
  --image=gcr.io/olympuscloud-prod/platform-service:$(git rev-parse --short HEAD) \
  --region=us-central1 \
  --memory=1Gi \
  --cpu=2

AI Services (Python)

# Build with ML dependencies
docker build -f services/ai/Dockerfile \
  -t gcr.io/olympuscloud-prod/ai-service:$(git rev-parse --short HEAD) .

# Deploy with GPU (if needed)
gcloud run deploy ai-service \
  --image=gcr.io/olympuscloud-prod/ai-service:$(git rev-parse --short HEAD) \
  --region=us-central1 \
  --memory=4Gi \
  --cpu=4

Edge Workers (Cloudflare)

# Deploy to staging
cd workers/api-router
wrangler deploy --env staging

# Test staging
curl -s https://staging.olympuscloud.ai/api/health

# Deploy to production
wrangler deploy --env production

Monitoring Deployments

Real-Time Monitoring

# Watch deployment progress
gcloud run services describe api-gateway \
  --region=us-central1 \
  --format="value(status.traffic)"

# Tail logs during deployment
gcloud logging tail \
  'resource.type="cloud_run_revision" AND
   resource.labels.service_name="api-gateway"' \
  --format="value(textPayload)"

# Watch for errors
gcloud logging read \
  'resource.type="cloud_run_revision" AND severity>=ERROR' \
  --limit=20 \
  --format="table(timestamp,resource.labels.revision_name,textPayload)"

Key Metrics to Watch

Metric	Normal	Warning	Critical
Error rate	under 0.1%	over 0.5%	over 1%
Latency p99	under 500ms	over 1s	over 2s
Instance count	Stable	Scaling up	At max
Memory usage	under 80%	over 85%	over 95%

Post-Deployment Verification

# Health check
curl -s https://api.olympuscloud.ai/health | jq

# Version check
curl -s https://api.olympuscloud.ai/version | jq

# Smoke tests
./scripts/smoke-tests.sh production

# Check for error spikes
gcloud logging read \
  'resource.type="cloud_run_revision" AND severity=ERROR' \
  --limit=50 \
  --freshness=10m

Troubleshooting Deployments

Deployment Stuck

# Check revision status
gcloud run revisions describe REVISION_NAME \
  --region=us-central1 \
  --format="yaml(status)"

# Common issues:
# - Image pull errors: Check image exists and permissions
# - Health check failing: Check /health endpoint
# - Resource limits: May need more CPU/memory

Container Won't Start

# Check container logs
gcloud logging read \
  'resource.labels.revision_name="REVISION_NAME" AND
   textPayload=~"error|Error|ERROR|panic|fatal"' \
  --limit=100

# Common issues:
# - Missing environment variables
# - Database connection failures
# - Port binding issues (must use PORT env var)

High Error Rate After Deployment

# Identify error patterns
gcloud logging read \
  'resource.type="cloud_run_revision" AND severity>=ERROR' \
  --limit=200 \
  --format=json | jq '.[] | .textPayload' | sort | uniq -c | sort -rn

# Quick mitigation: rollback
gcloud run services update-traffic SERVICE \
  --to-revisions=PREVIOUS_REVISION=100

Escalation

Situation	Action	Contact
Deployment pipeline failing	Check GitHub Actions	Platform team Slack
Cloud Run deployment issues	Check GCP console	SRE on-call
Terraform state issues	Do NOT modify manually	Infrastructure lead
Production incident during deploy	Stop deployment, rollback	Incident commander

Quick Reference​

Deployment Architecture​

Environments​

Service Topology​

Pre-Deployment Checklist​

Before Every Production Deployment​

For Database Migrations​

For Infrastructure Changes​

Standard Deployment Procedures​

Deploying to Staging​

Deploying to Production​

Manual Deployment (Emergency)​

Canary Deployment Procedure​

Traffic Splitting Strategy​

Executing Canary Deployment​

Monitoring During Canary​

Rollback Procedures​

Quick Rollback (Immediate)​

Full Rollback (With Investigation)​

Database Rollback​

Infrastructure Deployment (Terraform)​

Terraform Workflow​

Common Infrastructure Changes​

Infrastructure Rollback​

Service-Specific Procedures​

API Gateway (Go)​

Platform Service (Rust)​

AI Services (Python)​

Edge Workers (Cloudflare)​

Monitoring Deployments​

Real-Time Monitoring​

Key Metrics to Watch​

Post-Deployment Verification​

Troubleshooting Deployments​

Deployment Stuck​

Container Won't Start​

High Error Rate After Deployment​

Escalation​

Related Documentation​