Scaling Operations Runbook

Procedures for scaling Olympus Cloud infrastructure to meet demand.

Overview

Olympus Cloud uses auto-scaling for most workloads, but manual intervention may be needed for rapid scaling, cost optimization, or incident response.

Scaling Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Scaling Layers                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Layer 1: Edge (Cloudflare)                                      │
│  ├── Workers: Auto-scale, no limits                             │
│  ├── KV: Globally distributed                                   │
│  └── R2: Unlimited storage                                      │
│                                                                   │
│  Layer 2: Compute (Cloud Run)                                    │
│  ├── Min Instances: 1-10 (configurable)                         │
│  ├── Max Instances: 100-1000 (configurable)                     │
│  └── Scale to zero: Disabled for production                     │
│                                                                   │
│  Layer 3: Database (Spanner)                                     │
│  ├── Spanner: Node-based horizontal scaling                     │
│  └── ClickHouse Cloud: Auto-scales for OLAP workloads           │
│                                                                   │
│  Layer 4: Async Processing (Pub/Sub + Cloud Tasks)              │
│  ├── Pub/Sub: Auto-scales subscribers                           │
│  └── Cloud Tasks: Queue-based processing                        │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Cloud Run Scaling

Current Configuration

Service	Min Instances	Max Instances	CPU	Memory
api-gateway	3	100	2	2Gi
platform-service	2	50	2	4Gi
order-service	2	100	2	2Gi
user-service	2	50	1	1Gi
ai-service	1	20	4	8Gi

Scaling Triggers

Cloud Run auto-scales based on:

Request concurrency (default: 80 per instance)
CPU utilization (default: 60%)

Manual Scaling Commands

Increase Minimum Instances

# For traffic spike preparation
gcloud run services update api-gateway \
  --min-instances=10 \
  --region=us-central1

Increase Maximum Instances

# For unexpected demand
gcloud run services update api-gateway \
  --max-instances=200 \
  --region=us-central1

Adjust CPU/Memory

# For memory-intensive workloads
gcloud run services update ai-service \
  --memory=16Gi \
  --cpu=8 \
  --region=us-central1

Adjust Concurrency

# Lower concurrency for heavy requests
gcloud run services update ai-service \
  --concurrency=20 \
  --region=us-central1

Pre-Scaling for Events

Before known high-traffic events:

24 hours before

# Double minimum instances
gcloud run services update api-gateway --min-instances=6
gcloud run services update order-service --min-instances=4

1 hour before

# Pre-warm by sending synthetic traffic
# This ensures instances are ready
hey -n 1000 -c 50 https://api.olympuscloud.ai/health

During event
- Monitor dashboards closely
- Be ready to scale further

After event

# Return to normal (after traffic normalizes)
gcloud run services update api-gateway --min-instances=3
gcloud run services update order-service --min-instances=2

Database Scaling

Cloud Spanner

Check Current Utilization

# View CPU utilization (target: 45-65%)
gcloud monitoring read \
  "spanner.googleapis.com/instance/cpu/smoothed_utilization" \
  --filter='resource.labels.instance_id="prod-olympus-spanner"' \
  --interval='now-1h'

Scale Up Nodes

# Increase from 3 to 5 nodes
gcloud spanner instances update prod-olympus-spanner \
  --nodes=5

# Takes effect immediately
# Monitor for 15 minutes to verify

Scale Down Nodes

# Only scale down when CPU < 40%
# Scale down gradually (one node at a time)
gcloud spanner instances update prod-olympus-spanner \
  --nodes=4

# Wait 30 minutes, verify stable
gcloud spanner instances update prod-olympus-spanner \
  --nodes=3

Scaling Guidelines

CPU Utilization	Action
< 30%	Consider scaling down
30-65%	Optimal range
65-80%	Monitor closely
> 80%	Scale up immediately

Cloud SQL (DEPRECATED — replaced by Cloud Spanner)

Cloud SQL is no longer in use. All OLTP data is in Cloud Spanner. Spanner scales horizontally by adding nodes (see Spanner section above).

Edge Scaling (Cloudflare)

Workers

Cloudflare Workers auto-scale globally. No manual intervention needed.

Monitor Worker Performance

Dashboard: Cloudflare Analytics
Key metrics: CPU time, requests, errors

Adjust Worker Limits

// In wrangler.toml, adjust limits if needed
[limits]
cpu_ms = 50  // Default: 10ms for free, 50ms for paid

Rate Limiting

Adjust Rate Limits for Traffic Spikes

# Using Cloudflare API to update rate limit rule
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/ratelimits/$RULE_ID" \
  -H "Authorization: Bearer $CF_TOKEN" \
  -d '{"threshold": 10000}'

Cache Configuration

Increase Cache Hit Ratio

# In Cloudflare Page Rules:
# - Cache Level: Cache Everything
# - Edge Cache TTL: 1 month
# - Browser Cache TTL: 1 hour

Pub/Sub & Cloud Tasks

Pub/Sub Scaling

Pub/Sub auto-scales. Monitor for:

Subscription backlog
Oldest unacked message age

Increase Ack Deadline

# For long-running subscribers
gcloud pubsub subscriptions update orders-subscription \
  --ack-deadline=600

Cloud Tasks Scaling

Adjust Queue Rate

# Increase processing rate
gcloud tasks queues update order-processing \
  --max-dispatches-per-second=500 \
  --max-concurrent-dispatches=100

Clear Stuck Queue

# Pause queue
gcloud tasks queues pause order-processing

# Purge if needed
gcloud tasks queues purge order-processing

# Resume
gcloud tasks queues resume order-processing

Scaling Playbooks

Playbook: Traffic Spike

Symptoms: Increased latency, rising error rates

Assess scale

# Check current instance count
gcloud run services describe api-gateway \
  --format='value(status.latestReadyRevisionName)'

Increase capacity

# Increase min instances
gcloud run services update api-gateway --min-instances=20
gcloud run services update order-service --min-instances=10

Monitor
- Watch latency and error rates
- Verify instance count increasing

Scale database if needed

gcloud spanner instances update prod-olympus-spanner --nodes=5

Playbook: Cost Optimization

Goal: Reduce costs while maintaining performance

Review current utilization
- Check Cloud Run instance counts
- Check Spanner CPU utilization
- Review Spanner node count
Identify over-provisioned resources
- Services with very low CPU
- Database with < 30% CPU

Scale down gradually

# Reduce min instances (one at a time)
gcloud run services update user-service --min-instances=1

# Wait 1 hour, verify stability

Monitor for regressions
- Set alerts for latency increases
- Watch error rates

Playbook: New Feature Launch

Before Launch (1 week)

Review expected traffic increase
Estimate resource requirements
Pre-scale critical services

Launch Day

Double minimum instances
Increase database capacity
Monitor dashboards

Post-Launch (1 week)

Analyze actual vs expected traffic
Right-size resources
Document for future launches

Monitoring and Alerts

Key Scaling Metrics

Metric	Source	Threshold
Instance Count	Cloud Run	Near max = alert
CPU Utilization	Cloud Run	> 80% = alert
Spanner CPU	Spanner	> 65% = alert
Request Latency p99	Cloud Monitoring	> 5s = alert
Error Rate	Cloud Monitoring	> 5% = alert

Scaling Alerts

# Alert: Cloud Run at capacity
displayName: "Cloud Run Near Max Instances"
conditions:
  - displayName: "Instance count near max"
    conditionThreshold:
      filter: 'resource.type="cloud_run_revision" AND metric.type="run.googleapis.com/container/instance_count"'
      comparison: COMPARISON_GT
      thresholdValue: 80  # 80% of max
      duration: "300s"

Cost Considerations

Scaling Cost Impact

Resource	Scale Action	Cost Impact
Cloud Run min instances	+1 instance	~$30/month
Cloud Run max instances	Increase limit	Only costs if used
Spanner node	+1 node	~$900/month
ClickHouse Cloud	Scale replicas	Variable

Cost-Effective Scaling

Use min instances wisely - Only for consistent baseline
Let auto-scale handle spikes - Cheaper than over-provisioning
Scale Spanner carefully - Most expensive resource
Use read replicas - Cheaper than scaling primary

Incident Response - Scaling during incidents
Database Operations - Database-specific scaling
On-Call Guide - When to scale during on-call
Deployment - Canary deployments for gradual scaling

Overview​

Scaling Architecture​

Cloud Run Scaling​

Current Configuration​

Scaling Triggers​

Manual Scaling Commands​

Pre-Scaling for Events​

Database Scaling​

Cloud Spanner​

Cloud SQL (DEPRECATED — replaced by Cloud Spanner)​

Edge Scaling (Cloudflare)​

Workers​

Rate Limiting​

Cache Configuration​

Pub/Sub & Cloud Tasks​

Pub/Sub Scaling​

Cloud Tasks Scaling​

Scaling Playbooks​

Playbook: Traffic Spike​

Playbook: Cost Optimization​

Playbook: New Feature Launch​

Monitoring and Alerts​

Key Scaling Metrics​

Scaling Alerts​

Cost Considerations​

Scaling Cost Impact​

Cost-Effective Scaling​

Related Documentation​