Skip to main content

Monitoring & Alerts Runbook

Complete guide to observability, monitoring, and alerting for Olympus Cloud.

Overview

Olympus Cloud uses a multi-layered observability stack to ensure visibility into system health, performance, and user experience.

Observability Stack

┌─────────────────────────────────────────────────────────────────┐
│ Observability Stack │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Data Collection │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ Metrics │ Logs │ Traces │ Events │ │
│ │ (metrics) │ (stdout) │ (OpenTel) │ (Pub/Sub) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Processing & Storage │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ Cloud Monitoring │ Cloud Logging │ Cloud Trace │ │
│ │ (metrics) │ (logs) │ (traces) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Visualization & Alerting │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ Dashboards │ Alert Policies │ PagerDuty │ Slack │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Dashboards

Main Dashboards

DashboardPurposeLink
Platform HealthOverall system statusView
ServicesPer-service metricsView
DatabaseSpanner + SQL metricsView
EdgeCloudflare metricsCloudflare
AI ServicesVoice AI, LLM metricsView

Platform Health Dashboard

┌─────────────────────────────────────────────────────────────────┐
│ Platform Health Last 1 hour │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Availability Error Rate Latency p99 │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ 99.95% │ │ 0.12% │ │ 245ms │ │
│ │ ✅ │ │ ✅ │ │ ✅ │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │
│ Request Rate (req/s) │
│ 500─┤ ▄▄▄▄▄▄▄▄ │
│ 400─┤ ▄▄▄▄████████▄▄▄▄ │
│ 300─┤ ▄▄▄▄████████████████▄▄▄▄ │
│ 200─┤▄▄▄▄████████████████████████▄▄▄▄ │
│ 100─┤ │
│ 0─┴────────────────────────────────────────────────── │
│ │
│ Service Status │
│ ├── api-gateway ✅ Healthy │
│ ├── platform-service ✅ Healthy │
│ ├── order-service ✅ Healthy │
│ ├── user-service ✅ Healthy │
│ └── ai-service ✅ Healthy │
│ │
└─────────────────────────────────────────────────────────────────┘

Creating Custom Dashboards

  1. Navigate to Cloud Monitoring > Dashboards
  2. Click "Create Dashboard"
  3. Add widgets:
    • Line charts for time series
    • Gauges for current values
    • Tables for top-N lists
    • Text for documentation

Metrics

Key Metrics to Monitor

Service Metrics

MetricDescriptionSLO
request_countTotal requests-
request_latency_msRequest durationp99 < 500ms
error_rateErrors / requests< 0.1%
instance_countRunning instances-
cpu_utilizationCPU usage< 80%
memory_utilizationMemory usage< 90%

Database Metrics

MetricDescriptionSLO
spanner/cpu_utilizationSpanner CPU< 65%
spanner/storage_usedStorage bytes< 80% capacity
clickhouse/cpu_usageClickHouse Cloud CPU< 80%
clickhouse/active_queriesActive queries< 100 concurrent

AI Service Metrics

MetricDescriptionSLO
voice_ai/transcription_accuracySTT accuracy> 95%
voice_ai/response_latency_msVoice response time< 2000ms
llm/tokens_usedLLM token consumption-
llm/cost_usdLLM cost tracking-

Custom Metrics

Creating custom metrics in application code:

# Python example
from google.cloud import monitoring_v3

def record_custom_metric(value):
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"

series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/order/processing_time_ms"
series.resource.type = "cloud_run_revision"

point = monitoring_v3.Point()
point.value.double_value = value
point.interval.end_time.seconds = int(time.time())
series.points = [point]

client.create_time_series(name=project_name, time_series=[series])

Logging

Log Locations

SourceLog NameRetention
Cloud Runrun.googleapis.com/requests30 days
Applicationstdout/stderr30 days
Auditcloudaudit.googleapis.com400 days
VPC Flowcompute.googleapis.com/vpc_flows30 days

Log Queries

View Recent Errors

resource.type="cloud_run_revision"
severity>=ERROR
timestamp>="2026-01-18T00:00:00Z"

Find Specific Request

resource.type="cloud_run_revision"
httpRequest.requestUrl=~"orders/12345"

Search Application Logs

resource.type="cloud_run_revision"
resource.labels.service_name="order-service"
jsonPayload.user_id="user-abc123"

Aggregate Error Counts

resource.type="cloud_run_revision"
severity=ERROR
| GROUP BY jsonPayload.error_type
| COUNT

Log-Based Metrics

Create metrics from logs for alerting:

# Terraform example
resource "google_logging_metric" "payment_errors" {
name = "payment-errors"
filter = <<-EOT
resource.type="cloud_run_revision"
resource.labels.service_name="order-service"
jsonPayload.error_type="PaymentFailed"
EOT
metric_descriptor {
metric_kind = "DELTA"
value_type = "INT64"
}
}

Tracing

Trace Analysis

View Traces in Cloud Trace

  1. Navigate to Cloud Trace
  2. Filter by:
    • Service name
    • Latency threshold
    • Time range
    • HTTP method/status

Identify Slow Spans

# In Cloud Trace, filter:
latency > 1s
service = "order-service"

Trace Sampling

EnvironmentSampling Rate
Development100%
Staging50%
Production10%

Adding Custom Spans

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)

with tracer.start_as_current_span("validate_order"):
validate(order_id)

with tracer.start_as_current_span("charge_payment"):
charge(order_id)

Alert Policies

Current Alert Policies

AlertConditionSeverity
High Error Rate> 5% errors for 5 minP1
High Latencyp99 > 2s for 5 minP2
Service DownUptime check failsP1
Database CPU High> 80% for 10 minP2
Spanner CPU High> 65% for 10 minP2
Memory Exhaustion> 95% for 5 minP1
Certificate Expiry< 14 daysP3

Alert Policy Structure

# Example alert policy
displayName: "High Error Rate - API Gateway"
documentation:
content: |
## Error Rate Alert

The API gateway error rate has exceeded 5% for 5 minutes.

**Runbook**: /docs/operations/runbooks/incident-response

**Quick Actions**:
1. Check recent deployments
2. View error logs
3. Consider rollback

conditions:
- displayName: "Error rate > 5%"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision" AND
resource.labels.service_name = "api-gateway" AND
metric.type = "run.googleapis.com/request_count"
aggregations:
- alignmentPeriod: "300s"
perSeriesAligner: ALIGN_RATE
comparison: COMPARISON_GT
thresholdValue: 0.05
duration: "300s"

notificationChannels:
- projects/olympuscloud-prod/notificationChannels/pagerduty
- projects/olympuscloud-prod/notificationChannels/slack

Creating New Alerts

  1. Via Console

    • Cloud Monitoring > Alerting > Create Policy
    • Define condition, notification, documentation
  2. Via Terraform

    resource "google_monitoring_alert_policy" "high_latency" {
    display_name = "High Latency Alert"
    combiner = "OR"
    conditions {
    display_name = "Latency p99 > 2s"
    condition_threshold {
    filter = "resource.type=\"cloud_run_revision\""
    comparison = "COMPARISON_GT"
    threshold_value = 2000
    duration = "300s"
    }
    }
    notification_channels = [google_monitoring_notification_channel.pagerduty.name]
    }

Alert Notification Channels

ChannelPurposeConfig
PagerDutyOn-call pagingIntegration key
SlackTeam awarenessWebhook URL
EmailBackup notificationEmail addresses

Uptime Monitoring

Uptime Checks

CheckTargetFrequency
API Healthapi.olympuscloud.ai/health1 min
Platform Portalportal.olympuscloud.ai1 min
Status Pagestatus.olympuscloud.ai1 min
Edge Healthedge.olympuscloud.ai/health1 min

Creating Uptime Checks

# Terraform example
resource "google_monitoring_uptime_check_config" "api_health" {
display_name = "API Health Check"
timeout = "10s"
period = "60s"

http_check {
path = "/health"
port = 443
use_ssl = true
validate_ssl = true
}

monitored_resource {
type = "uptime_url"
labels = {
host = "api.olympuscloud.ai"
}
}
}

SLOs and Error Budgets

Service Level Objectives

ServiceSLISLO Target
API GatewayAvailability99.9%
API GatewayLatency p99< 500ms
Order ServiceSuccess rate99.95%
Voice AIResponse time< 2s

Error Budget

Error Budget = 100% - SLO Target

For 99.9% availability SLO:
- Error budget: 0.1%
- Monthly budget: 43.2 minutes downtime

SLO Dashboard

┌─────────────────────────────────────────────────────────────────┐
│ SLO Dashboard Rolling 30d │
├─────────────────────────────────────────────────────────────────┤
│ │
│ API Gateway Availability │
│ ├── SLO Target: 99.9% │
│ ├── Current: 99.95% │
│ ├── Error Budget Used: 50% │
│ └── Status: ✅ Healthy │
│ │
│ API Gateway Latency │
│ ├── SLO Target: p99 < 500ms │
│ ├── Current: p99 = 245ms │
│ ├── Error Budget Used: 20% │
│ └── Status: ✅ Healthy │
│ │
│ Order Service Success Rate │
│ ├── SLO Target: 99.95% │
│ ├── Current: 99.98% │
│ ├── Error Budget Used: 40% │
│ └── Status: ✅ Healthy │
│ │
└─────────────────────────────────────────────────────────────────┘

Troubleshooting Observability

Missing Metrics

IssueCheckFix
No Cloud Run metricsService deployed?Deploy service
No custom metricsPermissions?Grant monitoring.metricWriter
Delayed metricsPropagation lagWait 2-3 minutes

Missing Logs

IssueCheckFix
No logs appearingContainer running?Check deployment
Logs not searchableIndex delayWait 1-2 minutes
Logs missing fieldsJSON parsingFix log format

Alert Not Firing

IssueCheckFix
Condition not metView metricAdjust threshold
Notification failedChannel configTest channel
SilencedSnooze active?Unsnooze policy