Monitoring & Alerts Runbook

Complete guide to observability, monitoring, and alerting for Olympus Cloud.

Overview

Olympus Cloud uses a multi-layered observability stack to ensure visibility into system health, performance, and user experience.

Observability Stack

┌─────────────────────────────────────────────────────────────────┐
│                    Observability Stack                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │                     Data Collection                          │ │
│  ├─────────────────────────────────────────────────────────────┤ │
│  │  Metrics    │  Logs       │  Traces     │  Events           │ │
│  │  (metrics)  │  (stdout)   │  (OpenTel)  │  (Pub/Sub)        │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                              │                                    │
│                              ▼                                    │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │                     Processing & Storage                     │ │
│  ├─────────────────────────────────────────────────────────────┤ │
│  │  Cloud Monitoring  │  Cloud Logging  │  Cloud Trace         │ │
│  │  (metrics)         │  (logs)         │  (traces)            │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                              │                                    │
│                              ▼                                    │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │                     Visualization & Alerting                 │ │
│  ├─────────────────────────────────────────────────────────────┤ │
│  │  Dashboards  │  Alert Policies  │  PagerDuty  │  Slack      │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Dashboards

Main Dashboards

Dashboard	Purpose	Link
Platform Health	Overall system status	View
Services	Per-service metrics	View
Database	Spanner + SQL metrics	View
Edge	Cloudflare metrics	Cloudflare
AI Services	Voice AI, LLM metrics	View

Platform Health Dashboard

┌─────────────────────────────────────────────────────────────────┐
│  Platform Health                                   Last 1 hour   │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Availability        Error Rate         Latency p99              │
│  ┌────────────┐     ┌────────────┐     ┌────────────┐           │
│  │   99.95%   │     │   0.12%    │     │   245ms    │           │
│  │    ✅       │     │    ✅       │     │    ✅       │           │
│  └────────────┘     └────────────┘     └────────────┘           │
│                                                                   │
│  Request Rate (req/s)                                            │
│  500─┤            ▄▄▄▄▄▄▄▄                                      │
│  400─┤        ▄▄▄▄████████▄▄▄▄                                  │
│  300─┤    ▄▄▄▄████████████████▄▄▄▄                              │
│  200─┤▄▄▄▄████████████████████████▄▄▄▄                          │
│  100─┤                                                           │
│    0─┴──────────────────────────────────────────────────        │
│                                                                   │
│  Service Status                                                  │
│  ├── api-gateway        ✅ Healthy                              │
│  ├── platform-service   ✅ Healthy                              │
│  ├── order-service      ✅ Healthy                              │
│  ├── user-service       ✅ Healthy                              │
│  └── ai-service         ✅ Healthy                              │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Creating Custom Dashboards

Navigate to Cloud Monitoring > Dashboards
Click "Create Dashboard"
Add widgets:
- Line charts for time series
- Gauges for current values
- Tables for top-N lists
- Text for documentation

Metrics

Key Metrics to Monitor

Service Metrics

Metric	Description	SLO
`request_count`	Total requests	-
`request_latency_ms`	Request duration	p99 < 500ms
`error_rate`	Errors / requests	< 0.1%
`instance_count`	Running instances	-
`cpu_utilization`	CPU usage	< 80%
`memory_utilization`	Memory usage	< 90%

Database Metrics

Metric	Description	SLO
`spanner/cpu_utilization`	Spanner CPU	< 65%
`spanner/storage_used`	Storage bytes	< 80% capacity
`clickhouse/cpu_usage`	ClickHouse Cloud CPU	< 80%
`clickhouse/active_queries`	Active queries	< 100 concurrent

AI Service Metrics

Metric	Description	SLO
`voice_ai/transcription_accuracy`	STT accuracy	> 95%
`voice_ai/response_latency_ms`	Voice response time	< 2000ms
`llm/tokens_used`	LLM token consumption	-
`llm/cost_usd`	LLM cost tracking	-

Custom Metrics

Creating custom metrics in application code:

# Python example
from google.cloud import monitoring_v3

def record_custom_metric(value):
    client = monitoring_v3.MetricServiceClient()
    project_name = f"projects/{PROJECT_ID}"

    series = monitoring_v3.TimeSeries()
    series.metric.type = "custom.googleapis.com/order/processing_time_ms"
    series.resource.type = "cloud_run_revision"

    point = monitoring_v3.Point()
    point.value.double_value = value
    point.interval.end_time.seconds = int(time.time())
    series.points = [point]

    client.create_time_series(name=project_name, time_series=[series])

Logging

Log Locations

Source	Log Name	Retention
Cloud Run	`run.googleapis.com/requests`	30 days
Application	`stdout/stderr`	30 days
Audit	`cloudaudit.googleapis.com`	400 days
VPC Flow	`compute.googleapis.com/vpc_flows`	30 days

Log Queries

View Recent Errors

resource.type="cloud_run_revision"
severity>=ERROR
timestamp>="2026-01-18T00:00:00Z"

Find Specific Request

resource.type="cloud_run_revision"
httpRequest.requestUrl=~"orders/12345"

Search Application Logs

resource.type="cloud_run_revision"
resource.labels.service_name="order-service"
jsonPayload.user_id="user-abc123"

Aggregate Error Counts

resource.type="cloud_run_revision"
severity=ERROR
| GROUP BY jsonPayload.error_type
| COUNT

Log-Based Metrics

Create metrics from logs for alerting:

# Terraform example
resource "google_logging_metric" "payment_errors" {
  name   = "payment-errors"
  filter = <<-EOT
    resource.type="cloud_run_revision"
    resource.labels.service_name="order-service"
    jsonPayload.error_type="PaymentFailed"
  EOT
  metric_descriptor {
    metric_kind = "DELTA"
    value_type  = "INT64"
  }
}

Tracing

Trace Analysis

View Traces in Cloud Trace

Navigate to Cloud Trace
Filter by:
- Service name
- Latency threshold
- Time range
- HTTP method/status

Identify Slow Spans

# In Cloud Trace, filter:
latency > 1s
service = "order-service"

Trace Sampling

Environment	Sampling Rate
Development	100%
Staging	50%
Production	10%

Adding Custom Spans

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("validate_order"):
            validate(order_id)

        with tracer.start_as_current_span("charge_payment"):
            charge(order_id)

Alert Policies

Current Alert Policies

Alert	Condition	Severity
High Error Rate	> 5% errors for 5 min	P1
High Latency	p99 > 2s for 5 min	P2
Service Down	Uptime check fails	P1
Database CPU High	> 80% for 10 min	P2
Spanner CPU High	> 65% for 10 min	P2
Memory Exhaustion	> 95% for 5 min	P1
Certificate Expiry	< 14 days	P3

Alert Policy Structure

# Example alert policy
displayName: "High Error Rate - API Gateway"
documentation:
  content: |
    ## Error Rate Alert

    The API gateway error rate has exceeded 5% for 5 minutes.

    **Runbook**: /docs/operations/runbooks/incident-response

    **Quick Actions**:
    1. Check recent deployments
    2. View error logs
    3. Consider rollback

conditions:
  - displayName: "Error rate > 5%"
    conditionThreshold:
      filter: |
        resource.type = "cloud_run_revision" AND
        resource.labels.service_name = "api-gateway" AND
        metric.type = "run.googleapis.com/request_count"
      aggregations:
        - alignmentPeriod: "300s"
          perSeriesAligner: ALIGN_RATE
      comparison: COMPARISON_GT
      thresholdValue: 0.05
      duration: "300s"

notificationChannels:
  - projects/olympuscloud-prod/notificationChannels/pagerduty
  - projects/olympuscloud-prod/notificationChannels/slack

Creating New Alerts

Via Console
- Cloud Monitoring > Alerting > Create Policy
- Define condition, notification, documentation

Via Terraform

resource "google_monitoring_alert_policy" "high_latency" {
  display_name = "High Latency Alert"
  combiner     = "OR"
  conditions {
    display_name = "Latency p99 > 2s"
    condition_threshold {
      filter          = "resource.type=\"cloud_run_revision\""
      comparison      = "COMPARISON_GT"
      threshold_value = 2000
      duration        = "300s"
    }
  }
  notification_channels = [google_monitoring_notification_channel.pagerduty.name]
}

Alert Notification Channels

Channel	Purpose	Config
PagerDuty	On-call paging	Integration key
Slack	Team awareness	Webhook URL
Email	Backup notification	Email addresses

Uptime Monitoring

Uptime Checks

Check	Target	Frequency
API Health	`api.olympuscloud.ai/health`	1 min
Platform Portal	`portal.olympuscloud.ai`	1 min
Status Page	`status.olympuscloud.ai`	1 min
Edge Health	`edge.olympuscloud.ai/health`	1 min

Creating Uptime Checks

# Terraform example
resource "google_monitoring_uptime_check_config" "api_health" {
  display_name = "API Health Check"
  timeout      = "10s"
  period       = "60s"

  http_check {
    path         = "/health"
    port         = 443
    use_ssl      = true
    validate_ssl = true
  }

  monitored_resource {
    type = "uptime_url"
    labels = {
      host = "api.olympuscloud.ai"
    }
  }
}

SLOs and Error Budgets

Service Level Objectives

Service	SLI	SLO Target
API Gateway	Availability	99.9%
API Gateway	Latency p99	< 500ms
Order Service	Success rate	99.95%
Voice AI	Response time	< 2s

Error Budget

Error Budget = 100% - SLO Target

For 99.9% availability SLO:
- Error budget: 0.1%
- Monthly budget: 43.2 minutes downtime

SLO Dashboard

┌─────────────────────────────────────────────────────────────────┐
│  SLO Dashboard                                    Rolling 30d    │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  API Gateway Availability                                        │
│  ├── SLO Target: 99.9%                                          │
│  ├── Current: 99.95%                                            │
│  ├── Error Budget Used: 50%                                     │
│  └── Status: ✅ Healthy                                         │
│                                                                   │
│  API Gateway Latency                                             │
│  ├── SLO Target: p99 < 500ms                                    │
│  ├── Current: p99 = 245ms                                       │
│  ├── Error Budget Used: 20%                                     │
│  └── Status: ✅ Healthy                                         │
│                                                                   │
│  Order Service Success Rate                                      │
│  ├── SLO Target: 99.95%                                         │
│  ├── Current: 99.98%                                            │
│  ├── Error Budget Used: 40%                                     │
│  └── Status: ✅ Healthy                                         │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Troubleshooting Observability

Missing Metrics

Issue	Check	Fix
No Cloud Run metrics	Service deployed?	Deploy service
No custom metrics	Permissions?	Grant monitoring.metricWriter
Delayed metrics	Propagation lag	Wait 2-3 minutes

Missing Logs

Issue	Check	Fix
No logs appearing	Container running?	Check deployment
Logs not searchable	Index delay	Wait 1-2 minutes
Logs missing fields	JSON parsing	Fix log format

Alert Not Firing

Issue	Check	Fix
Condition not met	View metric	Adjust threshold
Notification failed	Channel config	Test channel
Silenced	Snooze active?	Unsnooze policy

Incident Response - Using monitoring during incidents
On-Call Guide - Alert triage for on-call
Scaling - Scaling based on metrics

Overview​

Observability Stack​

Dashboards​

Main Dashboards​

Platform Health Dashboard​

Creating Custom Dashboards​

Metrics​

Key Metrics to Monitor​

Service Metrics​

Database Metrics​

AI Service Metrics​

Custom Metrics​

Logging​

Log Locations​

Log Queries​

Log-Based Metrics​

Tracing​

Trace Analysis​

Trace Sampling​

Adding Custom Spans​

Alert Policies​

Current Alert Policies​

Alert Policy Structure​

Creating New Alerts​

Alert Notification Channels​

Uptime Monitoring​

Uptime Checks​

Creating Uptime Checks​

SLOs and Error Budgets​

Service Level Objectives​

Error Budget​

SLO Dashboard​

Troubleshooting Observability​

Missing Metrics​

Missing Logs​

Alert Not Firing​

Related Documentation​