Skip to main content

Scaling Operations Runbook

Procedures for scaling Olympus Cloud infrastructure to meet demand.

Overview

Olympus Cloud uses auto-scaling for most workloads, but manual intervention may be needed for rapid scaling, cost optimization, or incident response.

Scaling Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Scaling Layers │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: Edge (Cloudflare) │
│ ├── Workers: Auto-scale, no limits │
│ ├── KV: Globally distributed │
│ └── R2: Unlimited storage │
│ │
│ Layer 2: Compute (Cloud Run) │
│ ├── Min Instances: 1-10 (configurable) │
│ ├── Max Instances: 100-1000 (configurable) │
│ └── Scale to zero: Disabled for production │
│ │
│ Layer 3: Database (Spanner) │
│ ├── Spanner: Node-based horizontal scaling │
│ └── ClickHouse Cloud: Auto-scales for OLAP workloads │
│ │
│ Layer 4: Async Processing (Pub/Sub + Cloud Tasks) │
│ ├── Pub/Sub: Auto-scales subscribers │
│ └── Cloud Tasks: Queue-based processing │
│ │
└─────────────────────────────────────────────────────────────────┘

Cloud Run Scaling

Current Configuration

ServiceMin InstancesMax InstancesCPUMemory
api-gateway310022Gi
platform-service25024Gi
order-service210022Gi
user-service25011Gi
ai-service12048Gi

Scaling Triggers

Cloud Run auto-scales based on:

  • Request concurrency (default: 80 per instance)
  • CPU utilization (default: 60%)

Manual Scaling Commands

Increase Minimum Instances

# For traffic spike preparation
gcloud run services update api-gateway \
--min-instances=10 \
--region=us-central1

Increase Maximum Instances

# For unexpected demand
gcloud run services update api-gateway \
--max-instances=200 \
--region=us-central1

Adjust CPU/Memory

# For memory-intensive workloads
gcloud run services update ai-service \
--memory=16Gi \
--cpu=8 \
--region=us-central1

Adjust Concurrency

# Lower concurrency for heavy requests
gcloud run services update ai-service \
--concurrency=20 \
--region=us-central1

Pre-Scaling for Events

Before known high-traffic events:

  1. 24 hours before

    # Double minimum instances
    gcloud run services update api-gateway --min-instances=6
    gcloud run services update order-service --min-instances=4
  2. 1 hour before

    # Pre-warm by sending synthetic traffic
    # This ensures instances are ready
    hey -n 1000 -c 50 https://api.olympuscloud.ai/health
  3. During event

    • Monitor dashboards closely
    • Be ready to scale further
  4. After event

    # Return to normal (after traffic normalizes)
    gcloud run services update api-gateway --min-instances=3
    gcloud run services update order-service --min-instances=2

Database Scaling

Cloud Spanner

Check Current Utilization

# View CPU utilization (target: 45-65%)
gcloud monitoring read \
"spanner.googleapis.com/instance/cpu/smoothed_utilization" \
--filter='resource.labels.instance_id="prod-olympus-spanner"' \
--interval='now-1h'

Scale Up Nodes

# Increase from 3 to 5 nodes
gcloud spanner instances update prod-olympus-spanner \
--nodes=5

# Takes effect immediately
# Monitor for 15 minutes to verify

Scale Down Nodes

# Only scale down when CPU < 40%
# Scale down gradually (one node at a time)
gcloud spanner instances update prod-olympus-spanner \
--nodes=4

# Wait 30 minutes, verify stable
gcloud spanner instances update prod-olympus-spanner \
--nodes=3

Scaling Guidelines

CPU UtilizationAction
< 30%Consider scaling down
30-65%Optimal range
65-80%Monitor closely
> 80%Scale up immediately

Cloud SQL (DEPRECATED — replaced by Cloud Spanner)

Cloud SQL is no longer in use. All OLTP data is in Cloud Spanner. Spanner scales horizontally by adding nodes (see Spanner section above).


Edge Scaling (Cloudflare)

Workers

Cloudflare Workers auto-scale globally. No manual intervention needed.

Monitor Worker Performance

Adjust Worker Limits

// In wrangler.toml, adjust limits if needed
[limits]
cpu_ms = 50 // Default: 10ms for free, 50ms for paid

Rate Limiting

Adjust Rate Limits for Traffic Spikes

# Using Cloudflare API to update rate limit rule
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/ratelimits/$RULE_ID" \
-H "Authorization: Bearer $CF_TOKEN" \
-d '{"threshold": 10000}'

Cache Configuration

Increase Cache Hit Ratio

# In Cloudflare Page Rules:
# - Cache Level: Cache Everything
# - Edge Cache TTL: 1 month
# - Browser Cache TTL: 1 hour

Pub/Sub & Cloud Tasks

Pub/Sub Scaling

Pub/Sub auto-scales. Monitor for:

  • Subscription backlog
  • Oldest unacked message age

Increase Ack Deadline

# For long-running subscribers
gcloud pubsub subscriptions update orders-subscription \
--ack-deadline=600

Cloud Tasks Scaling

Adjust Queue Rate

# Increase processing rate
gcloud tasks queues update order-processing \
--max-dispatches-per-second=500 \
--max-concurrent-dispatches=100

Clear Stuck Queue

# Pause queue
gcloud tasks queues pause order-processing

# Purge if needed
gcloud tasks queues purge order-processing

# Resume
gcloud tasks queues resume order-processing

Scaling Playbooks

Playbook: Traffic Spike

Symptoms: Increased latency, rising error rates

  1. Assess scale

    # Check current instance count
    gcloud run services describe api-gateway \
    --format='value(status.latestReadyRevisionName)'
  2. Increase capacity

    # Increase min instances
    gcloud run services update api-gateway --min-instances=20
    gcloud run services update order-service --min-instances=10
  3. Monitor

    • Watch latency and error rates
    • Verify instance count increasing
  4. Scale database if needed

    gcloud spanner instances update prod-olympus-spanner --nodes=5

Playbook: Cost Optimization

Goal: Reduce costs while maintaining performance

  1. Review current utilization

    • Check Cloud Run instance counts
    • Check Spanner CPU utilization
    • Review Spanner node count
  2. Identify over-provisioned resources

    • Services with very low CPU
    • Database with < 30% CPU
  3. Scale down gradually

    # Reduce min instances (one at a time)
    gcloud run services update user-service --min-instances=1

    # Wait 1 hour, verify stability
  4. Monitor for regressions

    • Set alerts for latency increases
    • Watch error rates

Playbook: New Feature Launch

Before Launch (1 week)

  1. Review expected traffic increase
  2. Estimate resource requirements
  3. Pre-scale critical services

Launch Day

  1. Double minimum instances
  2. Increase database capacity
  3. Monitor dashboards

Post-Launch (1 week)

  1. Analyze actual vs expected traffic
  2. Right-size resources
  3. Document for future launches

Monitoring and Alerts

Key Scaling Metrics

MetricSourceThreshold
Instance CountCloud RunNear max = alert
CPU UtilizationCloud Run> 80% = alert
Spanner CPUSpanner> 65% = alert
Request Latency p99Cloud Monitoring> 5s = alert
Error RateCloud Monitoring> 5% = alert

Scaling Alerts

# Alert: Cloud Run at capacity
displayName: "Cloud Run Near Max Instances"
conditions:
- displayName: "Instance count near max"
conditionThreshold:
filter: 'resource.type="cloud_run_revision" AND metric.type="run.googleapis.com/container/instance_count"'
comparison: COMPARISON_GT
thresholdValue: 80 # 80% of max
duration: "300s"

Cost Considerations

Scaling Cost Impact

ResourceScale ActionCost Impact
Cloud Run min instances+1 instance~$30/month
Cloud Run max instancesIncrease limitOnly costs if used
Spanner node+1 node~$900/month
ClickHouse CloudScale replicasVariable

Cost-Effective Scaling

  1. Use min instances wisely - Only for consistent baseline
  2. Let auto-scale handle spikes - Cheaper than over-provisioning
  3. Scale Spanner carefully - Most expensive resource
  4. Use read replicas - Cheaper than scaling primary