Scaling Operations Runbook
Procedures for scaling Olympus Cloud infrastructure to meet demand.
Overview
Olympus Cloud uses auto-scaling for most workloads, but manual intervention may be needed for rapid scaling, cost optimization, or incident response.
Scaling Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Scaling Layers │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: Edge (Cloudflare) │
│ ├── Workers: Auto-scale, no limits │
│ ├── KV: Globally distributed │
│ └── R2: Unlimited storage │
│ │
│ Layer 2: Compute (Cloud Run) │
│ ├── Min Instances: 1-10 (configurable) │
│ ├── Max Instances: 100-1000 (configurable) │
│ └── Scale to zero: Disabled for production │
│ │
│ Layer 3: Database (Spanner) │
│ ├── Spanner: Node-based horizontal scaling │
│ └── ClickHouse Cloud: Auto-scales for OLAP workloads │
│ │
│ Layer 4: Async Processing (Pub/Sub + Cloud Tasks) │
│ ├── Pub/Sub: Auto-scales subscribers │
│ └── Cloud Tasks: Queue-based processing │
│ │
└─────────────────────────────────────────────────────────────────┘
Cloud Run Scaling
Current Configuration
| Service | Min Instances | Max Instances | CPU | Memory |
|---|---|---|---|---|
| api-gateway | 3 | 100 | 2 | 2Gi |
| platform-service | 2 | 50 | 2 | 4Gi |
| order-service | 2 | 100 | 2 | 2Gi |
| user-service | 2 | 50 | 1 | 1Gi |
| ai-service | 1 | 20 | 4 | 8Gi |
Scaling Triggers
Cloud Run auto-scales based on:
- Request concurrency (default: 80 per instance)
- CPU utilization (default: 60%)
Manual Scaling Commands
Increase Minimum Instances
# For traffic spike preparation
gcloud run services update api-gateway \
--min-instances=10 \
--region=us-central1
Increase Maximum Instances
# For unexpected demand
gcloud run services update api-gateway \
--max-instances=200 \
--region=us-central1
Adjust CPU/Memory
# For memory-intensive workloads
gcloud run services update ai-service \
--memory=16Gi \
--cpu=8 \
--region=us-central1
Adjust Concurrency
# Lower concurrency for heavy requests
gcloud run services update ai-service \
--concurrency=20 \
--region=us-central1
Pre-Scaling for Events
Before known high-traffic events:
-
24 hours before
# Double minimum instances
gcloud run services update api-gateway --min-instances=6
gcloud run services update order-service --min-instances=4 -
1 hour before
# Pre-warm by sending synthetic traffic
# This ensures instances are ready
hey -n 1000 -c 50 https://api.olympuscloud.ai/health -
During event
- Monitor dashboards closely
- Be ready to scale further
-
After event
# Return to normal (after traffic normalizes)
gcloud run services update api-gateway --min-instances=3
gcloud run services update order-service --min-instances=2
Database Scaling
Cloud Spanner
Check Current Utilization
# View CPU utilization (target: 45-65%)
gcloud monitoring read \
"spanner.googleapis.com/instance/cpu/smoothed_utilization" \
--filter='resource.labels.instance_id="prod-olympus-spanner"' \
--interval='now-1h'
Scale Up Nodes
# Increase from 3 to 5 nodes
gcloud spanner instances update prod-olympus-spanner \
--nodes=5
# Takes effect immediately
# Monitor for 15 minutes to verify
Scale Down Nodes
# Only scale down when CPU < 40%
# Scale down gradually (one node at a time)
gcloud spanner instances update prod-olympus-spanner \
--nodes=4
# Wait 30 minutes, verify stable
gcloud spanner instances update prod-olympus-spanner \
--nodes=3
Scaling Guidelines
| CPU Utilization | Action |
|---|---|
| < 30% | Consider scaling down |
| 30-65% | Optimal range |
| 65-80% | Monitor closely |
| > 80% | Scale up immediately |
Cloud SQL (DEPRECATED — replaced by Cloud Spanner)
Cloud SQL is no longer in use. All OLTP data is in Cloud Spanner. Spanner scales horizontally by adding nodes (see Spanner section above).
Edge Scaling (Cloudflare)
Workers
Cloudflare Workers auto-scale globally. No manual intervention needed.
Monitor Worker Performance
- Dashboard: Cloudflare Analytics
- Key metrics: CPU time, requests, errors
Adjust Worker Limits
// In wrangler.toml, adjust limits if needed
[limits]
cpu_ms = 50 // Default: 10ms for free, 50ms for paid
Rate Limiting
Adjust Rate Limits for Traffic Spikes
# Using Cloudflare API to update rate limit rule
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/ratelimits/$RULE_ID" \
-H "Authorization: Bearer $CF_TOKEN" \
-d '{"threshold": 10000}'
Cache Configuration
Increase Cache Hit Ratio
# In Cloudflare Page Rules:
# - Cache Level: Cache Everything
# - Edge Cache TTL: 1 month
# - Browser Cache TTL: 1 hour
Pub/Sub & Cloud Tasks
Pub/Sub Scaling
Pub/Sub auto-scales. Monitor for:
- Subscription backlog
- Oldest unacked message age
Increase Ack Deadline
# For long-running subscribers
gcloud pubsub subscriptions update orders-subscription \
--ack-deadline=600
Cloud Tasks Scaling
Adjust Queue Rate
# Increase processing rate
gcloud tasks queues update order-processing \
--max-dispatches-per-second=500 \
--max-concurrent-dispatches=100
Clear Stuck Queue
# Pause queue
gcloud tasks queues pause order-processing
# Purge if needed
gcloud tasks queues purge order-processing
# Resume
gcloud tasks queues resume order-processing
Scaling Playbooks
Playbook: Traffic Spike
Symptoms: Increased latency, rising error rates
-
Assess scale
# Check current instance count
gcloud run services describe api-gateway \
--format='value(status.latestReadyRevisionName)' -
Increase capacity
# Increase min instances
gcloud run services update api-gateway --min-instances=20
gcloud run services update order-service --min-instances=10 -
Monitor
- Watch latency and error rates
- Verify instance count increasing
-
Scale database if needed
gcloud spanner instances update prod-olympus-spanner --nodes=5
Playbook: Cost Optimization
Goal: Reduce costs while maintaining performance
-
Review current utilization
- Check Cloud Run instance counts
- Check Spanner CPU utilization
- Review Spanner node count
-
Identify over-provisioned resources
- Services with very low CPU
- Database with < 30% CPU
-
Scale down gradually
# Reduce min instances (one at a time)
gcloud run services update user-service --min-instances=1
# Wait 1 hour, verify stability -
Monitor for regressions
- Set alerts for latency increases
- Watch error rates
Playbook: New Feature Launch
Before Launch (1 week)
- Review expected traffic increase
- Estimate resource requirements
- Pre-scale critical services
Launch Day
- Double minimum instances
- Increase database capacity
- Monitor dashboards
Post-Launch (1 week)
- Analyze actual vs expected traffic
- Right-size resources
- Document for future launches
Monitoring and Alerts
Key Scaling Metrics
| Metric | Source | Threshold |
|---|---|---|
| Instance Count | Cloud Run | Near max = alert |
| CPU Utilization | Cloud Run | > 80% = alert |
| Spanner CPU | Spanner | > 65% = alert |
| Request Latency p99 | Cloud Monitoring | > 5s = alert |
| Error Rate | Cloud Monitoring | > 5% = alert |
Scaling Alerts
# Alert: Cloud Run at capacity
displayName: "Cloud Run Near Max Instances"
conditions:
- displayName: "Instance count near max"
conditionThreshold:
filter: 'resource.type="cloud_run_revision" AND metric.type="run.googleapis.com/container/instance_count"'
comparison: COMPARISON_GT
thresholdValue: 80 # 80% of max
duration: "300s"
Cost Considerations
Scaling Cost Impact
| Resource | Scale Action | Cost Impact |
|---|---|---|
| Cloud Run min instances | +1 instance | ~$30/month |
| Cloud Run max instances | Increase limit | Only costs if used |
| Spanner node | +1 node | ~$900/month |
| ClickHouse Cloud | Scale replicas | Variable |
Cost-Effective Scaling
- Use min instances wisely - Only for consistent baseline
- Let auto-scale handle spikes - Cheaper than over-provisioning
- Scale Spanner carefully - Most expensive resource
- Use read replicas - Cheaper than scaling primary
Related Documentation
- Incident Response - Scaling during incidents
- Database Operations - Database-specific scaling
- On-Call Guide - When to scale during on-call
- Deployment - Canary deployments for gradual scaling