Skip to main content

Database Operations Runbook

Complete guide for database operations, maintenance, and troubleshooting.

Overview

Olympus Cloud uses Cloud Spanner as its primary transactional database (OLTP) for global scale, with ClickHouse Cloud for analytics (OLAP).

note

Cloud SQL (PostgreSQL) has been fully deprecated and removed. All workloads run on Cloud Spanner.

Database Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Database Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ │
│ │ API Gateway │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌─────────────────┼─────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Platform │ │ Order │ │ User │ │
│ │ Service │ │ Service │ │ Service │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ Cloud Spanner │ │
│ │ (${ENV}-olympus-spanner)│ │
│ │ │ │
│ │ ┌──────┐ ┌───────┐ │ │
│ │ │ DB 1 │ │ DB 2 │ │ │
│ │ └──────┘ └───────┘ │ │
│ └────────────────────────┘ │
│ │
│ Multi-region: us-central1, us-east1 │
│ │
└─────────────────────────────────────────────────────────────────┘

Cloud Spanner Operations

Instance Details

EnvironmentInstanceConfigNodes
Productionprod-olympus-spannerregional-us-central13
Stagingstaging-olympus-spannerregional-us-central11
Developmentdev-olympus-spannerregional-us-central11

Monitoring Spanner

Key Metrics to Monitor

MetricThresholdAction
CPU Utilization> 65%Scale up nodes
Storage Utilization> 70%Review data retention
Latency p99> 100msInvestigate queries
Lock Wait Time> 1sCheck for deadlocks

Check Current Status

# View instance details
gcloud spanner instances describe prod-olympus-spanner

# View database list (use appropriate instance for your environment)
gcloud spanner databases list --instance=${SPANNER_INSTANCE}

# Check CPU utilization
gcloud monitoring read "spanner.googleapis.com/instance/cpu/utilization" \
--filter='resource.labels.instance_id="prod-olympus-spanner"' \
--interval='now-1h'

Scaling Spanner

Scale Up (Add Nodes)

# Production: Increase from 3 to 5 nodes
gcloud spanner instances update prod-olympus-spanner \
--nodes=5

# Verify scaling
gcloud spanner instances describe prod-olympus-spanner

Scale Down

# Only scale down during low traffic periods
# Check current load first
gcloud monitoring read "spanner.googleapis.com/instance/cpu/smoothed_utilization" \
--filter='resource.labels.instance_id="prod-olympus-spanner"'

# Scale down if CPU < 40%
gcloud spanner instances update prod-olympus-spanner \
--nodes=3

Query Analysis

Find Slow Queries

-- Query statistics (last hour)
SELECT
text_fingerprint,
AVG(avg_latency_seconds) as avg_latency,
SUM(execution_count) as total_executions,
AVG(avg_rows_returned) as avg_rows
FROM SPANNER_SYS.QUERY_STATS_TOP_HOUR
WHERE avg_latency_seconds > 0.1
ORDER BY avg_latency DESC
LIMIT 20;

Analyze Query Plan

-- Get execution plan
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE tenant_id = 'tenant-123'
AND created_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY);

Lock Analysis

-- Check for lock contention
SELECT
lock_wait_seconds,
sample_lock_requests
FROM SPANNER_SYS.LOCK_STATS_TOP_MINUTE
ORDER BY lock_wait_seconds DESC
LIMIT 10;

Backup and Restore

Automated Backups Spanner has point-in-time recovery enabled. Retention: 7 days.

Create Manual Backup

# Create backup
gcloud spanner backups create olympus-db-backup-$(date +%Y%m%d) \
--instance=${SPANNER_INSTANCE} \
--database=olympus-db \
--retention-period=30d \
--async

# Check backup status
# Use the appropriate instance: dev-olympus-spanner, staging-olympus-spanner, or prod-olympus-spanner
gcloud spanner backups list --instance=${SPANNER_INSTANCE}

Restore from Backup

# Restore to new database
gcloud spanner databases create olympus-db-restored \
--instance=${SPANNER_INSTANCE} \
--source-database=olympus-db \
--source-backup=olympus-db-backup-20260118

Cloud SQL (PostgreSQL) Operations - DEPRECATED

Deprecated

Cloud SQL (PostgreSQL) has been fully replaced by Cloud Spanner. The section below is retained for historical reference only. Do not use these procedures — all database operations should use the Cloud Spanner section above.

Instance Details

EnvironmentInstanceVersionSize
Productionolympus-pg-prodPostgreSQL 15db-custom-4-16384
Stagingolympus-pg-stagingPostgreSQL 15db-f1-micro

Connection Management

Connect to Database

# Using Cloud SQL Proxy
cloud-sql-proxy olympuscloud-prod:us-central1:olympus-pg-prod &

# Connect via psql
psql "host=127.0.0.1 port=5432 dbname=olympus user=olympus_app"

Connection Pool Status

-- Check active connections
SELECT
datname,
count(*) as connections,
state
FROM pg_stat_activity
GROUP BY datname, state;

-- Check max connections
SHOW max_connections;

-- Kill idle connections over 1 hour
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '1 hour';

Query Troubleshooting

Find Slow Queries

-- Currently running queries
SELECT
pid,
now() - query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

-- Most time-consuming queries (if pg_stat_statements enabled)
SELECT
query,
calls,
total_time / 1000 AS total_seconds,
mean_time / 1000 AS avg_seconds
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 20;

Kill Long-Running Query

-- Graceful termination
SELECT pg_cancel_backend(pid);

-- Force termination
SELECT pg_terminate_backend(pid);

Analyze Table Statistics

-- Check table bloat
SELECT
schemaname,
relname,
n_dead_tup,
n_live_tup,
round(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 2) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

-- Run vacuum analyze
VACUUM ANALYZE tablename;

Index Management

Check Index Usage

-- Unused indexes
SELECT
schemaname,
relname,
indexrelname,
idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY relname;

-- Index size
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;

Rebuild Index

-- Rebuild index (non-blocking)
REINDEX INDEX CONCURRENTLY index_name;

-- Rebuild all indexes on table
REINDEX TABLE CONCURRENTLY table_name;

Backup and Restore

Automated Backups

  • Daily automated backups: 7-day retention
  • Point-in-time recovery: Enabled

Create Manual Backup

# Create on-demand backup
gcloud sql backups create --instance=olympus-pg-prod

# List backups
gcloud sql backups list --instance=olympus-pg-prod

Restore from Backup

# Restore to same instance (DESTRUCTIVE)
gcloud sql backups restore BACKUP_ID \
--restore-instance=olympus-pg-prod

# Restore to new instance (SAFE)
gcloud sql instances clone olympus-pg-prod olympus-pg-restored \
--bin-log-file-name=... \
--bin-log-position=...

Failover Operations

Check Replica Status

# List replicas
gcloud sql instances list --filter="masterInstanceName:olympus-pg-prod"

# Check replication lag
gcloud sql instances describe olympus-pg-replica

Initiate Failover

# Failover to replica (production impact)
gcloud sql instances failover olympus-pg-prod

# Verify new primary
gcloud sql instances describe olympus-pg-prod

Common Issues

High CPU Utilization

Symptoms

  • Slow query response
  • Connection timeouts
  • High latency alerts

Investigation

# For Spanner
gcloud monitoring read "spanner.googleapis.com/instance/cpu/utilization" \
--filter='resource.labels.instance_id="prod-olympus-spanner"'

# For PostgreSQL
gcloud sql instances describe olympus-pg-prod --format='value(settings.databaseFlags)'

Resolution

  1. Identify problematic queries (see Query Analysis)
  2. Add missing indexes
  3. Scale up nodes/instance
  4. Enable query caching

Connection Exhaustion

Symptoms

  • "Too many connections" errors
  • New connections failing
  • Application connection timeouts

Investigation

-- PostgreSQL
SELECT count(*) FROM pg_stat_activity;
SHOW max_connections;

Resolution

  1. Increase max_connections
  2. Implement connection pooling (PgBouncer)
  3. Kill idle connections
  4. Review connection leak in application

Deadlocks

Symptoms

  • Transactions hanging
  • Lock wait timeouts
  • Application errors

Investigation

-- PostgreSQL: View locks
SELECT
blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks ON blocked_locks.locktype = blocking_locks.locktype
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
WHERE NOT blocked_locks.granted;

Resolution

  1. Kill blocking transaction
  2. Review transaction isolation levels
  3. Optimize lock ordering in application

Replication Lag

Symptoms

  • Read replicas returning stale data
  • Monitoring alerts for lag

Investigation

# Check lag for Cloud SQL
gcloud sql instances describe olympus-pg-replica \
--format='value(replicaConfiguration.failoverTarget)'

Resolution

  1. Scale up replica
  2. Reduce write load on primary
  3. Check network issues

Maintenance Procedures

Weekly Maintenance

  1. Review slow query logs
  2. Check index usage
  3. Verify backup status
  4. Review storage growth

Monthly Maintenance

  1. VACUUM ANALYZE on all tables
  2. Review and remove unused indexes
  3. Test backup restoration
  4. Review connection pool settings

Quarterly Maintenance

  1. Full backup test and DR drill
  2. Review and update scaling thresholds
  3. Performance baseline comparison
  4. Schema optimization review

Emergency Procedures

danger

Database emergency procedures involve destructive operations (failover, restore) that can cause data loss if executed incorrectly. Always confirm you are operating on the correct instance and environment before running any restore or failover command. Production operations require Incident Commander approval.

Database Corruption

  1. Stop all writes immediately
  2. Assess corruption scope
  3. Restore from latest backup
  4. Validate data integrity
  5. Resume operations

Complete Database Failure

  1. Failover to replica (if available)
  2. If no replica, restore from backup
  3. Update application connection strings
  4. Validate functionality
  5. Post-incident review

Data Loss Prevention

  1. Enable point-in-time recovery
  2. Cross-region backups for production
  3. Regular backup testing
  4. Read replicas for critical workloads

Monitoring Dashboards

DashboardPurposeLink
Spanner OverviewInstance healthCloud Console
PostgreSQL OverviewInstance healthCloud Console
Query InsightsQuery performanceQuery Insights
Custom MetricsApplication-levelCloud Monitoring