Skip to main content

Deployment Runbook

Procedures for deploying Olympus Cloud services to production environments.

Quick Reference

TaskCommand/Action
Deploy to stagingMerge to develop branch
Deploy to productionMerge to main branch
Quick rollbackgcloud run services update-traffic SERVICE --to-revisions=PREV=100
Check deployment statusgcloud run revisions list --service=SERVICE
View deployment logsgcloud logging read 'resource.type="cloud_run_revision"'

Deployment Architecture

Environments

EnvironmentBranchURLAuto-Deploy
Developmentdevelopdev.olympuscloud.aiYes
Stagingdevelopstaging.olympuscloud.aiYes
Productionmainolympuscloud.aiYes (with approval)

Service Topology

┌─────────────────────────────────────────────────────────────────────┐
│ Cloudflare Edge │
│ (WAF, DDoS, CDN, Edge Workers) │
└────────────────────────────────┬────────────────────────────────────┘

┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ API Gateway │ │ Web Portal │ │ Mobile BFF │
│ (Go) │ │ (Flutter Web) │ │ (Rust) │
│ Cloud Run │ │ Cloud Run │ │ Cloud Run │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└──────────────────────┼──────────────────────┘

┌──────────────────────┴──────────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Platform Service│ │ Commerce Service│ │ AI Service │
│ (Rust) │ │ (Rust) │ │ (Python) │
│ Cloud Run │ │ Cloud Run │ │ Cloud Run │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└──────────────────────┴──────────────────────┘


┌───────────────────────┐
│ Cloud Spanner │
│ (Primary Database) │
└───────────────────────┘

Pre-Deployment Checklist

Before Every Production Deployment

  • All tests pass in CI (gh run view --web to check)
  • Code review approved by at least 1 reviewer
  • No P0/P1 incidents currently active
  • Deployment window is appropriate (not Friday 5pm)
  • Rollback plan documented
  • On-call engineer aware of deployment

For Database Migrations

  • Migration tested in staging
  • Migration is backward compatible
  • Rollback migration prepared
  • Estimated migration time documented
  • Maintenance window scheduled (if needed)

For Infrastructure Changes

  • Terraform plan reviewed
  • No destructive changes without approval
  • Cost impact assessed
  • Security review if IAM changes

Standard Deployment Procedures

Deploying to Staging

Staging deployments are automatic on merge to develop:

# Create PR to develop
gh pr create --base develop --title "feat: your feature"

# After approval and merge, monitor deployment
gh run watch

# Check staging deployment
curl -s https://staging.olympuscloud.ai/health | jq

Deploying to Production

Production deployments require approval:

# 1. Create PR from develop to main
gh pr create --base main --head develop --title "Release $(date +%Y-%m-%d)"

# 2. Review deployment changes
gh pr diff

# 3. After approval, merge triggers deployment
gh pr merge --merge

# 4. Monitor deployment pipeline
gh run watch

# 5. Verify production health
curl -s https://api.olympuscloud.ai/health | jq

Manual Deployment (Emergency)

For emergency hotfixes bypassing normal CI/CD:

# 1. Build and push image directly
docker build -t gcr.io/olympuscloud-prod/api-gateway:hotfix-$(date +%Y%m%d) .
docker push gcr.io/olympuscloud-prod/api-gateway:hotfix-$(date +%Y%m%d)

# 2. Deploy to Cloud Run
gcloud run deploy api-gateway \
--image=gcr.io/olympuscloud-prod/api-gateway:hotfix-$(date +%Y%m%d) \
--region=us-central1 \
--project=olympuscloud-prod

# 3. Document in incident channel
# 4. Create follow-up PR with same changes

Canary Deployment Procedure

Traffic Splitting Strategy

PhaseCanary TrafficDurationSuccess Criteria
Phase 15%10 minError rate under 0.1%
Phase 225%15 minError rate under 0.1%, latency p99 under 500ms
Phase 350%15 minError rate under 0.1%, no alerts
Phase 4100%-Full rollout

Executing Canary Deployment

# Deploy new revision without traffic
gcloud run deploy api-gateway \
--image=gcr.io/olympuscloud-prod/api-gateway:v2.1.0 \
--region=us-central1 \
--no-traffic \
--tag=canary

# Phase 1: 5% traffic to canary
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-tags=canary=5

# Monitor for 10 minutes
# Check error rates, latency, alerts

# Phase 2: 25% traffic
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-tags=canary=25

# Continue monitoring...

# Phase 3: 50% traffic
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-tags=canary=50

# Phase 4: Full rollout
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-latest

Monitoring During Canary

# Watch error rates
watch -n 5 'gcloud logging read \
"resource.type=cloud_run_revision AND severity>=ERROR" \
--limit=10 --format="table(timestamp,textPayload)"'

# Compare latency between revisions
gcloud monitoring metrics list \
--filter="metric.type:run.googleapis.com/request_latencies"

Rollback Procedures

Quick Rollback (Immediate)

When you need to rollback immediately:

# 1. List recent revisions
gcloud run revisions list \
--service=api-gateway \
--region=us-central1 \
--format="table(REVISION,ACTIVE,LAST_DEPLOYED)"

# 2. Route all traffic to previous revision
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-revisions=api-gateway-00042-abc=100

# 3. Verify rollback
curl -s https://api.olympuscloud.ai/health | jq '.version'

Full Rollback (With Investigation)

For issues requiring investigation:

# 1. Reduce traffic to bad revision
gcloud run services update-traffic api-gateway \
--region=us-central1 \
--to-revisions=api-gateway-00042-abc=100

# 2. Capture logs from bad revision
gcloud logging read \
'resource.labels.revision_name="api-gateway-00043-def"' \
--limit=1000 \
--format=json > rollback-logs.json

# 3. Create incident for investigation
gh issue create --title "Rollback: api-gateway v2.1.0" \
--body "Rolled back due to: [describe issue]"

# 4. Delete bad revision (optional)
gcloud run revisions delete api-gateway-00043-def \
--region=us-central1

Database Rollback

If deployment included database migrations:

# 1. Check current migration version
psql $DATABASE_URL -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 5;"

# 2. Run rollback migration
./scripts/db-migrate.sh rollback

# 3. Verify rollback
psql $DATABASE_URL -c "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1;"

Infrastructure Deployment (Terraform)

Terraform Workflow

# 1. Initialize (if needed)
cd infrastructure/terraform
terraform init -backend-config=environments/prod/backend.tfvars

# 2. Plan changes
terraform plan \
-var-file=environments/prod/terraform.tfvars \
-out=tfplan

# 3. Review plan carefully
terraform show tfplan

# 4. Apply (requires approval for production)
terraform apply tfplan

Common Infrastructure Changes

Scaling Cloud Run:

# In terraform.tfvars
cloud_run_services = {
api_gateway = {
min_instances = 2
max_instances = 100
cpu = "2"
memory = "2Gi"
}
}

Adding a new service:

# 1. Add service definition in Terraform
# 2. Plan and apply
terraform plan -var-file=environments/prod/terraform.tfvars
terraform apply

# 3. Deploy application code
gcloud run deploy NEW_SERVICE --image=gcr.io/...

Infrastructure Rollback

# 1. Find previous state version
gsutil ls -l gs://olympuscloud-terraform-state/terraform/state/

# 2. Download previous state (careful!)
gsutil cp gs://olympuscloud-terraform-state/terraform/state/PREVIOUS_VERSION terraform.tfstate.backup

# 3. Plan to revert
terraform plan -var-file=environments/prod/terraform.tfvars

# 4. Apply revert
terraform apply

Service-Specific Procedures

API Gateway (Go)

# Build
docker build -f services/api-gateway/Dockerfile \
-t gcr.io/olympuscloud-prod/api-gateway:$(git rev-parse --short HEAD) .

# Deploy
gcloud run deploy api-gateway \
--image=gcr.io/olympuscloud-prod/api-gateway:$(git rev-parse --short HEAD) \
--region=us-central1 \
--set-env-vars="ENV=production" \
--min-instances=2 \
--max-instances=100

Platform Service (Rust)

# Build (uses multi-stage Dockerfile)
docker build -f services/platform/Dockerfile \
-t gcr.io/olympuscloud-prod/platform-service:$(git rev-parse --short HEAD) .

# Deploy
gcloud run deploy platform-service \
--image=gcr.io/olympuscloud-prod/platform-service:$(git rev-parse --short HEAD) \
--region=us-central1 \
--memory=1Gi \
--cpu=2

AI Services (Python)

# Build with ML dependencies
docker build -f services/ai/Dockerfile \
-t gcr.io/olympuscloud-prod/ai-service:$(git rev-parse --short HEAD) .

# Deploy with GPU (if needed)
gcloud run deploy ai-service \
--image=gcr.io/olympuscloud-prod/ai-service:$(git rev-parse --short HEAD) \
--region=us-central1 \
--memory=4Gi \
--cpu=4

Edge Workers (Cloudflare)

# Deploy to staging
cd workers/api-router
wrangler deploy --env staging

# Test staging
curl -s https://staging.olympuscloud.ai/api/health

# Deploy to production
wrangler deploy --env production

Monitoring Deployments

Real-Time Monitoring

# Watch deployment progress
gcloud run services describe api-gateway \
--region=us-central1 \
--format="value(status.traffic)"

# Tail logs during deployment
gcloud logging tail \
'resource.type="cloud_run_revision" AND
resource.labels.service_name="api-gateway"' \
--format="value(textPayload)"

# Watch for errors
gcloud logging read \
'resource.type="cloud_run_revision" AND severity>=ERROR' \
--limit=20 \
--format="table(timestamp,resource.labels.revision_name,textPayload)"

Key Metrics to Watch

MetricNormalWarningCritical
Error rateunder 0.1%over 0.5%over 1%
Latency p99under 500msover 1sover 2s
Instance countStableScaling upAt max
Memory usageunder 80%over 85%over 95%

Post-Deployment Verification

# Health check
curl -s https://api.olympuscloud.ai/health | jq

# Version check
curl -s https://api.olympuscloud.ai/version | jq

# Smoke tests
./scripts/smoke-tests.sh production

# Check for error spikes
gcloud logging read \
'resource.type="cloud_run_revision" AND severity=ERROR' \
--limit=50 \
--freshness=10m

Troubleshooting Deployments

Deployment Stuck

# Check revision status
gcloud run revisions describe REVISION_NAME \
--region=us-central1 \
--format="yaml(status)"

# Common issues:
# - Image pull errors: Check image exists and permissions
# - Health check failing: Check /health endpoint
# - Resource limits: May need more CPU/memory

Container Won't Start

# Check container logs
gcloud logging read \
'resource.labels.revision_name="REVISION_NAME" AND
textPayload=~"error|Error|ERROR|panic|fatal"' \
--limit=100

# Common issues:
# - Missing environment variables
# - Database connection failures
# - Port binding issues (must use PORT env var)

High Error Rate After Deployment

# Identify error patterns
gcloud logging read \
'resource.type="cloud_run_revision" AND severity>=ERROR' \
--limit=200 \
--format=json | jq '.[] | .textPayload' | sort | uniq -c | sort -rn

# Quick mitigation: rollback
gcloud run services update-traffic SERVICE \
--to-revisions=PREVIOUS_REVISION=100

Escalation

SituationActionContact
Deployment pipeline failingCheck GitHub ActionsPlatform team Slack
Cloud Run deployment issuesCheck GCP consoleSRE on-call
Terraform state issuesDo NOT modify manuallyInfrastructure lead
Production incident during deployStop deployment, rollbackIncident commander