Database Operations Runbook
Complete guide for database operations, maintenance, and troubleshooting.
Overview
Olympus Cloud uses Cloud Spanner as its primary transactional database (OLTP) for global scale, with ClickHouse Cloud for analytics (OLAP).
Cloud SQL (PostgreSQL) has been fully deprecated and removed. All workloads run on Cloud Spanner.
Database Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Database Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ │
│ │ API Gateway │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌─────────────────┼─────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Platform │ │ Order │ │ User │ │
│ │ Service │ │ Service │ │ Service │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ Cloud Spanner │ │
│ │ (${ENV}-olympus-spanner)│ │
│ │ │ │
│ │ ┌──────┐ ┌───────┐ │ │
│ │ │ DB 1 │ │ DB 2 │ │ │
│ │ └──────┘ └───────┘ │ │
│ └────────────────────────┘ │
│ │
│ Multi-region: us-central1, us-east1 │
│ │
└─────────────────────────────────────────────────────────────────┘
Cloud Spanner Operations
Instance Details
| Environment | Instance | Config | Nodes |
|---|---|---|---|
| Production | prod-olympus-spanner | regional-us-central1 | 3 |
| Staging | staging-olympus-spanner | regional-us-central1 | 1 |
| Development | dev-olympus-spanner | regional-us-central1 | 1 |
Monitoring Spanner
Key Metrics to Monitor
| Metric | Threshold | Action |
|---|---|---|
| CPU Utilization | > 65% | Scale up nodes |
| Storage Utilization | > 70% | Review data retention |
| Latency p99 | > 100ms | Investigate queries |
| Lock Wait Time | > 1s | Check for deadlocks |
Check Current Status
# View instance details
gcloud spanner instances describe prod-olympus-spanner
# View database list (use appropriate instance for your environment)
gcloud spanner databases list --instance=${SPANNER_INSTANCE}
# Check CPU utilization
gcloud monitoring read "spanner.googleapis.com/instance/cpu/utilization" \
--filter='resource.labels.instance_id="prod-olympus-spanner"' \
--interval='now-1h'
Scaling Spanner
Scale Up (Add Nodes)
# Production: Increase from 3 to 5 nodes
gcloud spanner instances update prod-olympus-spanner \
--nodes=5
# Verify scaling
gcloud spanner instances describe prod-olympus-spanner
Scale Down
# Only scale down during low traffic periods
# Check current load first
gcloud monitoring read "spanner.googleapis.com/instance/cpu/smoothed_utilization" \
--filter='resource.labels.instance_id="prod-olympus-spanner"'
# Scale down if CPU < 40%
gcloud spanner instances update prod-olympus-spanner \
--nodes=3
Query Analysis
Find Slow Queries
-- Query statistics (last hour)
SELECT
text_fingerprint,
AVG(avg_latency_seconds) as avg_latency,
SUM(execution_count) as total_executions,
AVG(avg_rows_returned) as avg_rows
FROM SPANNER_SYS.QUERY_STATS_TOP_HOUR
WHERE avg_latency_seconds > 0.1
ORDER BY avg_latency DESC
LIMIT 20;
Analyze Query Plan
-- Get execution plan
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE tenant_id = 'tenant-123'
AND created_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY);
Lock Analysis
-- Check for lock contention
SELECT
lock_wait_seconds,
sample_lock_requests
FROM SPANNER_SYS.LOCK_STATS_TOP_MINUTE
ORDER BY lock_wait_seconds DESC
LIMIT 10;
Backup and Restore
Automated Backups Spanner has point-in-time recovery enabled. Retention: 7 days.
Create Manual Backup
# Create backup
gcloud spanner backups create olympus-db-backup-$(date +%Y%m%d) \
--instance=${SPANNER_INSTANCE} \
--database=olympus-db \
--retention-period=30d \
--async
# Check backup status
# Use the appropriate instance: dev-olympus-spanner, staging-olympus-spanner, or prod-olympus-spanner
gcloud spanner backups list --instance=${SPANNER_INSTANCE}
Restore from Backup
# Restore to new database
gcloud spanner databases create olympus-db-restored \
--instance=${SPANNER_INSTANCE} \
--source-database=olympus-db \
--source-backup=olympus-db-backup-20260118
Cloud SQL (PostgreSQL) Operations - DEPRECATED
Cloud SQL (PostgreSQL) has been fully replaced by Cloud Spanner. The section below is retained for historical reference only. Do not use these procedures — all database operations should use the Cloud Spanner section above.
Instance Details
| Environment | Instance | Version | Size |
|---|---|---|---|
| Production | olympus-pg-prod | PostgreSQL 15 | db-custom-4-16384 |
| Staging | olympus-pg-staging | PostgreSQL 15 | db-f1-micro |
Connection Management
Connect to Database
# Using Cloud SQL Proxy
cloud-sql-proxy olympuscloud-prod:us-central1:olympus-pg-prod &
# Connect via psql
psql "host=127.0.0.1 port=5432 dbname=olympus user=olympus_app"
Connection Pool Status
-- Check active connections
SELECT
datname,
count(*) as connections,
state
FROM pg_stat_activity
GROUP BY datname, state;
-- Check max connections
SHOW max_connections;
-- Kill idle connections over 1 hour
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '1 hour';
Query Troubleshooting
Find Slow Queries
-- Currently running queries
SELECT
pid,
now() - query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
-- Most time-consuming queries (if pg_stat_statements enabled)
SELECT
query,
calls,
total_time / 1000 AS total_seconds,
mean_time / 1000 AS avg_seconds
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 20;
Kill Long-Running Query
-- Graceful termination
SELECT pg_cancel_backend(pid);
-- Force termination
SELECT pg_terminate_backend(pid);
Analyze Table Statistics
-- Check table bloat
SELECT
schemaname,
relname,
n_dead_tup,
n_live_tup,
round(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 2) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
-- Run vacuum analyze
VACUUM ANALYZE tablename;
Index Management
Check Index Usage
-- Unused indexes
SELECT
schemaname,
relname,
indexrelname,
idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY relname;
-- Index size
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
Rebuild Index
-- Rebuild index (non-blocking)
REINDEX INDEX CONCURRENTLY index_name;
-- Rebuild all indexes on table
REINDEX TABLE CONCURRENTLY table_name;
Backup and Restore
Automated Backups
- Daily automated backups: 7-day retention
- Point-in-time recovery: Enabled
Create Manual Backup
# Create on-demand backup
gcloud sql backups create --instance=olympus-pg-prod
# List backups
gcloud sql backups list --instance=olympus-pg-prod
Restore from Backup
# Restore to same instance (DESTRUCTIVE)
gcloud sql backups restore BACKUP_ID \
--restore-instance=olympus-pg-prod
# Restore to new instance (SAFE)
gcloud sql instances clone olympus-pg-prod olympus-pg-restored \
--bin-log-file-name=... \
--bin-log-position=...
Failover Operations
Check Replica Status
# List replicas
gcloud sql instances list --filter="masterInstanceName:olympus-pg-prod"
# Check replication lag
gcloud sql instances describe olympus-pg-replica
Initiate Failover
# Failover to replica (production impact)
gcloud sql instances failover olympus-pg-prod
# Verify new primary
gcloud sql instances describe olympus-pg-prod
Common Issues
High CPU Utilization
Symptoms
- Slow query response
- Connection timeouts
- High latency alerts
Investigation
# For Spanner
gcloud monitoring read "spanner.googleapis.com/instance/cpu/utilization" \
--filter='resource.labels.instance_id="prod-olympus-spanner"'
# For PostgreSQL
gcloud sql instances describe olympus-pg-prod --format='value(settings.databaseFlags)'
Resolution
- Identify problematic queries (see Query Analysis)
- Add missing indexes
- Scale up nodes/instance
- Enable query caching
Connection Exhaustion
Symptoms
- "Too many connections" errors
- New connections failing
- Application connection timeouts
Investigation
-- PostgreSQL
SELECT count(*) FROM pg_stat_activity;
SHOW max_connections;
Resolution
- Increase max_connections
- Implement connection pooling (PgBouncer)
- Kill idle connections
- Review connection leak in application
Deadlocks
Symptoms
- Transactions hanging
- Lock wait timeouts
- Application errors
Investigation
-- PostgreSQL: View locks
SELECT
blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks ON blocked_locks.locktype = blocking_locks.locktype
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
WHERE NOT blocked_locks.granted;
Resolution
- Kill blocking transaction
- Review transaction isolation levels
- Optimize lock ordering in application
Replication Lag
Symptoms
- Read replicas returning stale data
- Monitoring alerts for lag
Investigation
# Check lag for Cloud SQL
gcloud sql instances describe olympus-pg-replica \
--format='value(replicaConfiguration.failoverTarget)'
Resolution
- Scale up replica
- Reduce write load on primary
- Check network issues
Maintenance Procedures
Weekly Maintenance
- Review slow query logs
- Check index usage
- Verify backup status
- Review storage growth
Monthly Maintenance
- VACUUM ANALYZE on all tables
- Review and remove unused indexes
- Test backup restoration
- Review connection pool settings
Quarterly Maintenance
- Full backup test and DR drill
- Review and update scaling thresholds
- Performance baseline comparison
- Schema optimization review
Emergency Procedures
Database emergency procedures involve destructive operations (failover, restore) that can cause data loss if executed incorrectly. Always confirm you are operating on the correct instance and environment before running any restore or failover command. Production operations require Incident Commander approval.
Database Corruption
- Stop all writes immediately
- Assess corruption scope
- Restore from latest backup
- Validate data integrity
- Resume operations
Complete Database Failure
- Failover to replica (if available)
- If no replica, restore from backup
- Update application connection strings
- Validate functionality
- Post-incident review
Data Loss Prevention
- Enable point-in-time recovery
- Cross-region backups for production
- Regular backup testing
- Read replicas for critical workloads
Monitoring Dashboards
| Dashboard | Purpose | Link |
|---|---|---|
| Spanner Overview | Instance health | Cloud Console |
| PostgreSQL Overview | Instance health | Cloud Console |
| Query Insights | Query performance | Query Insights |
| Custom Metrics | Application-level | Cloud Monitoring |
Related Documentation
- Incident Response - For database-related incidents
- Disaster Recovery - DR procedures including database
- Deployment - Database migration procedures