Monitoring & Alerts Runbook
Complete guide to observability, monitoring, and alerting for Olympus Cloud.
Overview
Olympus Cloud uses a multi-layered observability stack to ensure visibility into system health, performance, and user experience.
Observability Stack
┌─────────────────────────────────────────────────────────────────┐
│ Observability Stack │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Data Collection │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ Metrics │ Logs │ Traces │ Events │ │
│ │ (metrics) │ (stdout) │ (OpenTel) │ (Pub/Sub) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Processing & Storage │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ Cloud Monitoring │ Cloud Logging │ Cloud Trace │ │
│ │ (metrics) │ (logs) │ (traces) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Visualization & Alerting │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ Dashboards │ Alert Policies │ PagerDuty │ Slack │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Dashboards
Main Dashboards
| Dashboard | Purpose | Link |
|---|---|---|
| Platform Health | Overall system status | View |
| Services | Per-service metrics | View |
| Database | Spanner + SQL metrics | View |
| Edge | Cloudflare metrics | Cloudflare |
| AI Services | Voice AI, LLM metrics | View |
Platform Health Dashboard
┌─────────────────────────────────────────────────────────────────┐
│ Platform Health Last 1 hour │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Availability Error Rate Latency p99 │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ 99.95% │ │ 0.12% │ │ 245ms │ │
│ │ ✅ │ │ ✅ │ │ ✅ │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │
│ Request Rate (req/s) │
│ 500─┤ ▄▄▄▄▄▄▄▄ │
│ 400─┤ ▄▄▄▄████████▄▄▄▄ │
│ 300─┤ ▄▄▄▄████████████████▄▄▄▄ │
│ 200─┤▄▄▄▄████████████████████████▄▄▄▄ │
│ 100─┤ │
│ 0─┴────────────────────────────────────────────────── │
│ │
│ Service Status │
│ ├── api-gateway ✅ Healthy │
│ ├── platform-service ✅ Healthy │
│ ├── order-service ✅ Healthy │
│ ├── user-service ✅ Healthy │
│ └── ai-service ✅ Healthy │
│ │
└─────────────────────────────────────────────────────────────────┘
Creating Custom Dashboards
- Navigate to Cloud Monitoring > Dashboards
- Click "Create Dashboard"
- Add widgets:
- Line charts for time series
- Gauges for current values
- Tables for top-N lists
- Text for documentation
Metrics
Key Metrics to Monitor
Service Metrics
| Metric | Description | SLO |
|---|---|---|
request_count | Total requests | - |
request_latency_ms | Request duration | p99 < 500ms |
error_rate | Errors / requests | < 0.1% |
instance_count | Running instances | - |
cpu_utilization | CPU usage | < 80% |
memory_utilization | Memory usage | < 90% |
Database Metrics
| Metric | Description | SLO |
|---|---|---|
spanner/cpu_utilization | Spanner CPU | < 65% |
spanner/storage_used | Storage bytes | < 80% capacity |
clickhouse/cpu_usage | ClickHouse Cloud CPU | < 80% |
clickhouse/active_queries | Active queries | < 100 concurrent |
AI Service Metrics
| Metric | Description | SLO |
|---|---|---|
voice_ai/transcription_accuracy | STT accuracy | > 95% |
voice_ai/response_latency_ms | Voice response time | < 2000ms |
llm/tokens_used | LLM token consumption | - |
llm/cost_usd | LLM cost tracking | - |
Custom Metrics
Creating custom metrics in application code:
# Python example
from google.cloud import monitoring_v3
def record_custom_metric(value):
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/order/processing_time_ms"
series.resource.type = "cloud_run_revision"
point = monitoring_v3.Point()
point.value.double_value = value
point.interval.end_time.seconds = int(time.time())
series.points = [point]
client.create_time_series(name=project_name, time_series=[series])
Logging
Log Locations
| Source | Log Name | Retention |
|---|---|---|
| Cloud Run | run.googleapis.com/requests | 30 days |
| Application | stdout/stderr | 30 days |
| Audit | cloudaudit.googleapis.com | 400 days |
| VPC Flow | compute.googleapis.com/vpc_flows | 30 days |
Log Queries
View Recent Errors
resource.type="cloud_run_revision"
severity>=ERROR
timestamp>="2026-01-18T00:00:00Z"
Find Specific Request
resource.type="cloud_run_revision"
httpRequest.requestUrl=~"orders/12345"
Search Application Logs
resource.type="cloud_run_revision"
resource.labels.service_name="order-service"
jsonPayload.user_id="user-abc123"
Aggregate Error Counts
resource.type="cloud_run_revision"
severity=ERROR
| GROUP BY jsonPayload.error_type
| COUNT
Log-Based Metrics
Create metrics from logs for alerting:
# Terraform example
resource "google_logging_metric" "payment_errors" {
name = "payment-errors"
filter = <<-EOT
resource.type="cloud_run_revision"
resource.labels.service_name="order-service"
jsonPayload.error_type="PaymentFailed"
EOT
metric_descriptor {
metric_kind = "DELTA"
value_type = "INT64"
}
}
Tracing
Trace Analysis
View Traces in Cloud Trace
- Navigate to Cloud Trace
- Filter by:
- Service name
- Latency threshold
- Time range
- HTTP method/status
Identify Slow Spans
# In Cloud Trace, filter:
latency > 1s
service = "order-service"
Trace Sampling
| Environment | Sampling Rate |
|---|---|
| Development | 100% |
| Staging | 50% |
| Production | 10% |
Adding Custom Spans
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
with tracer.start_as_current_span("validate_order"):
validate(order_id)
with tracer.start_as_current_span("charge_payment"):
charge(order_id)
Alert Policies
Current Alert Policies
| Alert | Condition | Severity |
|---|---|---|
| High Error Rate | > 5% errors for 5 min | P1 |
| High Latency | p99 > 2s for 5 min | P2 |
| Service Down | Uptime check fails | P1 |
| Database CPU High | > 80% for 10 min | P2 |
| Spanner CPU High | > 65% for 10 min | P2 |
| Memory Exhaustion | > 95% for 5 min | P1 |
| Certificate Expiry | < 14 days | P3 |
Alert Policy Structure
# Example alert policy
displayName: "High Error Rate - API Gateway"
documentation:
content: |
## Error Rate Alert
The API gateway error rate has exceeded 5% for 5 minutes.
**Runbook**: /docs/operations/runbooks/incident-response
**Quick Actions**:
1. Check recent deployments
2. View error logs
3. Consider rollback
conditions:
- displayName: "Error rate > 5%"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision" AND
resource.labels.service_name = "api-gateway" AND
metric.type = "run.googleapis.com/request_count"
aggregations:
- alignmentPeriod: "300s"
perSeriesAligner: ALIGN_RATE
comparison: COMPARISON_GT
thresholdValue: 0.05
duration: "300s"
notificationChannels:
- projects/olympuscloud-prod/notificationChannels/pagerduty
- projects/olympuscloud-prod/notificationChannels/slack
Creating New Alerts
-
Via Console
- Cloud Monitoring > Alerting > Create Policy
- Define condition, notification, documentation
-
Via Terraform
resource "google_monitoring_alert_policy" "high_latency" {
display_name = "High Latency Alert"
combiner = "OR"
conditions {
display_name = "Latency p99 > 2s"
condition_threshold {
filter = "resource.type=\"cloud_run_revision\""
comparison = "COMPARISON_GT"
threshold_value = 2000
duration = "300s"
}
}
notification_channels = [google_monitoring_notification_channel.pagerduty.name]
}
Alert Notification Channels
| Channel | Purpose | Config |
|---|---|---|
| PagerDuty | On-call paging | Integration key |
| Slack | Team awareness | Webhook URL |
| Backup notification | Email addresses |
Uptime Monitoring
Uptime Checks
| Check | Target | Frequency |
|---|---|---|
| API Health | api.olympuscloud.ai/health | 1 min |
| Platform Portal | portal.olympuscloud.ai | 1 min |
| Status Page | status.olympuscloud.ai | 1 min |
| Edge Health | edge.olympuscloud.ai/health | 1 min |
Creating Uptime Checks
# Terraform example
resource "google_monitoring_uptime_check_config" "api_health" {
display_name = "API Health Check"
timeout = "10s"
period = "60s"
http_check {
path = "/health"
port = 443
use_ssl = true
validate_ssl = true
}
monitored_resource {
type = "uptime_url"
labels = {
host = "api.olympuscloud.ai"
}
}
}
SLOs and Error Budgets
Service Level Objectives
| Service | SLI | SLO Target |
|---|---|---|
| API Gateway | Availability | 99.9% |
| API Gateway | Latency p99 | < 500ms |
| Order Service | Success rate | 99.95% |
| Voice AI | Response time | < 2s |
Error Budget
Error Budget = 100% - SLO Target
For 99.9% availability SLO:
- Error budget: 0.1%
- Monthly budget: 43.2 minutes downtime
SLO Dashboard
┌─────────────────────────────────────────────────────────────────┐
│ SLO Dashboard Rolling 30d │
├─────────────────────────────────────────────────────────────────┤
│ │
│ API Gateway Availability │
│ ├── SLO Target: 99.9% │
│ ├── Current: 99.95% │
│ ├── Error Budget Used: 50% │
│ └── Status: ✅ Healthy │
│ │
│ API Gateway Latency │
│ ├── SLO Target: p99 < 500ms │
│ ├── Current: p99 = 245ms │
│ ├── Error Budget Used: 20% │
│ └── Status: ✅ Healthy │
│ │
│ Order Service Success Rate │
│ ├── SLO Target: 99.95% │
│ ├── Current: 99.98% │
│ ├── Error Budget Used: 40% │
│ └── Status: ✅ Healthy │
│ │
└─────────────────────────────────────────────────────────────────┘
Troubleshooting Observability
Missing Metrics
| Issue | Check | Fix |
|---|---|---|
| No Cloud Run metrics | Service deployed? | Deploy service |
| No custom metrics | Permissions? | Grant monitoring.metricWriter |
| Delayed metrics | Propagation lag | Wait 2-3 minutes |
Missing Logs
| Issue | Check | Fix |
|---|---|---|
| No logs appearing | Container running? | Check deployment |
| Logs not searchable | Index delay | Wait 1-2 minutes |
| Logs missing fields | JSON parsing | Fix log format |
Alert Not Firing
| Issue | Check | Fix |
|---|---|---|
| Condition not met | View metric | Adjust threshold |
| Notification failed | Channel config | Test channel |
| Silenced | Snooze active? | Unsnooze policy |
Related Documentation
- Incident Response - Using monitoring during incidents
- On-Call Guide - Alert triage for on-call
- Scaling - Scaling based on metrics