Skip to main content

Olympus Cloud - Cockpit Operations User Guide

Quick Summary (for RAG)

The Cockpit is the operational command center for platform engineering, DevOps, and SRE teams. Features include: AI Agent Monitoring Dashboard for agent status and performance, Human-in-the-Loop (HITL) approval queue for AI actions, ACP Agent Registry and tool permissions management, AI Safety Controls and incident management, AI Cost Analytics with model usage tracking (ACP AI Router tiers T1-T6), gating and feature flag management, canary deployments, system health monitoring with SLO/SLI dashboards, release management, on-call management, and runbooks. Target users: SRE, DevOps, Platform Engineers, Release Managers.

Version: 1.1 Last Updated: January 2026 Application: Olympus Cloud Cockpit - Operations Center Access URL: https://dev.cockpit.olympuscloud.ai (development) | https://cockpit.olympuscloud.ai (production)


Table of Contents

  1. Overview
  2. Getting Started
  3. Operations Dashboard
  4. Gating & Feature Management
  5. Canary Deployments
  6. Incident Management
  7. System Health Monitoring
  8. Release Management
  9. Configuration Management
  10. Runbooks & Automation
  11. On-Call Management
  12. Troubleshooting Guide

Overview

The Olympus Cloud Cockpit is the operational command center for the platform engineering and DevOps teams. It provides real-time visibility into system health, deployment management, incident response, and feature rollout control.

Target Users

  • Platform Engineers: System operations and maintenance
  • DevOps Engineers: Deployments and infrastructure
  • SRE Team: Reliability and incident response
  • Release Managers: Feature rollouts and canaries
  • On-Call Engineers: Incident triage and resolution

Key Features

  • Real-time system health monitoring
  • Canary deployment wizard
  • Feature flag management
  • Incident management and response
  • Automated runbooks
  • On-call scheduling and escalation
  • Performance analytics

Getting Started

Access Requirements

RoleAccess LevelCapabilities
ViewerRead-onlyView dashboards, logs
OperatorStandardManage features, run runbooks
AdminFullAll operations, configurations
Super AdminElevatedEmergency controls, secrets

Logging In

  1. Navigate to the Cockpit URL
  2. Authenticate with SSO
  3. Complete MFA verification
  4. Select operational context

First-Time Setup

  1. Configure notification preferences
  2. Join on-call rotation (if applicable)
  3. Review runbook library
  4. Set up personal dashboards

Operations Dashboard

Overview Panel

The main dashboard displays:

System Status

┌────────────────────────────────────────┐
│ SYSTEM STATUS: HEALTHY │
│ ─────────────────────────────────────│
│ ● API Gateway : Healthy (p99: 45ms)
│ ● Auth Service : Healthy (p99: 23ms)
│ ● Order Service : Healthy (p99: 67ms)
│ ● Payment Service : Healthy (p99: 89ms)
│ ● Database : Healthy (CPU: 34%)
│ ● Cache : Healthy (Hit: 94%)
└────────────────────────────────────────┘

Active Incidents

  • Critical (P1): Immediate attention
  • High (P2): Within 1 hour
  • Medium (P3): Within 4 hours
  • Low (P4): Best effort

Deployment Activity

  • Active canaries
  • Recent deployments
  • Pending releases
  • Rollback history

Key Metrics

MetricCurrentTargetStatus
Uptime99.97%99.9%
Error Rate0.02%less than 0.1%
P99 Latency234msless than 500ms
Active Users12,453N/A

Quick Actions

  • Create Incident: Start incident workflow
  • Kill Switch: Emergency feature disable
  • Deploy Canary: Start canary deployment
  • Run Runbook: Execute automation

Gating & Feature Management

Gating Overview

The gating system controls feature availability:

Gating Hierarchy

Global Settings
└── Environment (prod, staging, dev)
└── Tenant Groups
└── Individual Tenants
└── User Segments

Feature Flags

Viewing Flags

  1. Navigate to Gating > Feature Flags
  2. Filter by:
    • Status (enabled, disabled, canary)
    • Environment
    • Owner
    • Age

Flag Details

  • Flag key and description
  • Current state per environment
  • Targeting rules
  • Audit history
  • Related incidents

Managing Flags

Enable/Disable

  1. Select flag
  2. Choose environment
  3. Toggle state
  4. Add change note
  5. Confirm

Targeting Rules

{
"flag": "new_checkout_v2",
"rules": [
{
"name": "Enterprise Tenants",
"condition": {
"attribute": "tenant.plan",
"operator": "equals",
"value": "enterprise"
},
"variation": true,
"percentage": 100
},
{
"name": "Gradual Rollout",
"condition": {
"attribute": "user.id",
"operator": "percentage"
},
"variation": true,
"percentage": 25
}
],
"defaultVariation": false
}

Kill Switches

danger

Kill switches immediately disable a feature across all tenants. Always document the reason in an incident report, and monitor downstream impact after activation. Only use kill switches for genuine emergencies.

Emergency controls for critical issues:

  1. Go to Gating > Kill Switches
  2. Locate feature
  3. Click Kill button
  4. Confirm with reason
  5. Monitor impact
  6. Document in incident

Kill Switch Dashboard

  • Active kills
  • Recent activations
  • Auto-recovery settings
  • Impact metrics

Canary Deployments

Canary Wizard

The Canary Wizard guides controlled rollouts:

Best Practice

Start canary deployments at 1% traffic and allow at least 30 minutes of observation before promoting to the next stage. Define clear success criteria upfront so rollback decisions are data-driven, not subjective.

Step 1: Configuration

  • Select feature flag
  • Choose target environment
  • Set initial percentage (1-5% recommended)
  • Define success metrics

Step 2: Health Criteria

success_criteria:
error_rate:
threshold: 0.1%
comparison: less_than
latency_p99:
threshold: 500ms
comparison: less_than
success_rate:
threshold: 99.5%
comparison: greater_than

Step 3: Rollout Plan

StagePercentageDurationAuto-Promote
11%30 minYes
25%1 hourYes
325%2 hoursNo
450%4 hoursNo
5100%Manual

Step 4: Monitoring

  • Real-time metrics dashboard
  • Comparison with baseline
  • Alert thresholds
  • Rollback triggers

Managing Canaries

Active Canaries View all in-progress canaries:

  • Feature name
  • Current stage
  • Health status
  • Time remaining
  • Actions

Canary Actions

ActionDescription
PromoteMove to next stage
HoldPause advancement
RollbackRevert to baseline
CompleteMark as successful

Rollback Procedures

Automatic Rollback Triggered when:

  • Error rate exceeds threshold
  • Latency spike detected
  • Success rate drops
  • Manual trigger

Manual Rollback

  1. Go to active canary
  2. Click Rollback
  3. Confirm action
  4. Monitor recovery
  5. Create incident report

Incident Management

Creating Incidents

From Alert

  1. Click alert notification
  2. Review alert details
  3. Click Create Incident
  4. Assign severity
  5. Begin investigation

Manual Creation

  1. Go to Incidents > New
  2. Fill in details:
    • Title
    • Severity (P1-P4)
    • Affected services
    • Description
  3. Assign responders
  4. Start incident

Incident Workflow

Detection → Triage → Investigation → Mitigation → Resolution → Post-mortem

Status Updates

StatusDescription
InvestigatingIdentifying root cause
IdentifiedRoot cause known
MitigatingFix in progress
ResolvedIssue fixed
ClosedPost-mortem complete

Incident Response

During Incident

  • Update status regularly
  • Add timeline entries
  • Communicate with stakeholders
  • Execute runbooks
  • Document actions

Communication

  • Slack channel created automatically
  • Status page updates
  • Customer notifications
  • Executive briefings (P1)

Post-Mortems

After resolution:

  1. Schedule post-mortem meeting
  2. Complete incident template:
    • Timeline of events
    • Root cause analysis
    • Impact assessment
    • Action items
  3. Review with team
  4. Track action items

System Health Monitoring

Service Health

Status Indicators

  • 🟢 Healthy: All metrics normal
  • 🟡 Degraded: Minor issues
  • 🔴 Unhealthy: Major issues
  • Unknown: No data

Per-Service Metrics

  • Request rate
  • Error rate
  • Latency (p50, p95, p99)
  • Saturation
  • Active connections

Infrastructure Metrics

Compute

  • CPU utilization
  • Memory usage
  • Network I/O
  • Disk I/O

Database

  • Query latency
  • Connection pool
  • Replication lag
  • Storage usage

Cache

  • Hit rate
  • Memory usage
  • Evictions
  • Connections

Alerting

Alert Severity

SeverityResponseNotification
CriticalImmediatePage on-call
WarningWithin 15mSlack + email
InfoAwarenessSlack only

Alert Rules

- name: high_error_rate
condition: error_rate > 1%
duration: 5m
severity: critical
notification: pagerduty
runbook: /runbooks/high-error-rate

- name: elevated_latency
condition: p99_latency > 500ms
duration: 10m
severity: warning
notification: slack
runbook: /runbooks/latency-investigation

Dashboards

Pre-built Dashboards

  • System Overview
  • Service Deep-dive
  • Database Performance
  • Cache Analytics
  • Edge/CDN Metrics

Custom Dashboards

  1. Click New Dashboard
  2. Add panels
  3. Configure queries
  4. Set refresh rate
  5. Save and share

Release Management

Release Pipeline

Pipeline Stages

Build → Test → Stage → Canary → Production

Release Tracking

VersionStageStatusDeployed
v2.4.1Production✅ LiveDec 1
v2.4.2Canary🔄 25%Dec 1
v2.4.3Staging✅ PassDec 1
v2.4.4Testing🔄 RunningDec 1

Deployment Actions

Deploy to Stage

  1. Select version
  2. Choose environment
  3. Review changes
  4. Confirm deployment
  5. Monitor

Promote to Production

  1. Verify staging success
  2. Create canary
  3. Monitor metrics
  4. Complete rollout

Rollback

Quick Rollback

  1. Go to Releases > Active
  2. Find problematic release
  3. Click Rollback
  4. Select target version
  5. Confirm
  6. Monitor recovery

Configuration Management

Environment Configuration

Viewing Config

  1. Go to Config > Environments
  2. Select environment
  3. View all configuration values

Editing Config

  1. Find configuration key
  2. Click Edit
  3. Enter new value
  4. Add change note
  5. Submit for approval
  6. Deploy change

Secrets Management

Secret Categories

  • API keys
  • Database credentials
  • Third-party tokens
  • Encryption keys

Secret Operations

  • View (masked)
  • Rotate
  • Add new
  • Delete
  • Audit access

Configuration Drift

Detection

  • Compare environments
  • Highlight differences
  • Alert on unexpected changes

Resolution

  • Sync configurations
  • Document exceptions
  • Track compliance

Runbooks & Automation

Runbook Library

Categories

  • Incident Response
  • Deployment
  • Scaling
  • Database Operations
  • Cache Management
  • Security Response

Running Runbooks

Manual Execution

  1. Go to Runbooks
  2. Select runbook
  3. Review steps
  4. Enter parameters
  5. Execute
  6. Monitor progress

Automated Triggers

  • Alert-based
  • Schedule-based
  • Event-based
  • Manual

Creating Runbooks

Runbook Template

name: Restart Service
description: Safely restart a service instance
parameters:
- name: service_name
type: string
required: true
- name: instance_id
type: string
required: true
steps:
- name: Verify service exists
action: verify_service
params:
service: "{{ service_name }}"
- name: Drain connections
action: drain_service
params:
instance: "{{ instance_id }}"
timeout: 30s
- name: Restart instance
action: restart_instance
params:
instance: "{{ instance_id }}"
- name: Health check
action: health_check
params:
service: "{{ service_name }}"
timeout: 60s

On-Call Management

Schedule Overview

Current On-Call

RoleEngineerUntil
PrimaryJane SmithDec 2, 9am
SecondaryJohn DoeDec 2, 9am
ManagerAlice WongDec 4, 9am

Managing Schedules

View Schedule

  • Weekly calendar view
  • Coverage gaps
  • Upcoming shifts
  • Swap requests

Swap Shifts

  1. Find your shift
  2. Click Request Swap
  3. Select replacement
  4. Submit request
  5. Await approval

Escalation Policies

Default Policy

1. Primary On-Call (immediate)
└── 15 min no response
2. Secondary On-Call
└── 15 min no response
3. Engineering Manager
└── 15 min no response
4. VP Engineering

On-Call Tools

During Shift

  • Acknowledge alerts
  • Run diagnostics
  • Execute runbooks
  • Update incidents
  • Escalate issues

Handoff

  • Review open incidents
  • Document ongoing issues
  • Brief incoming engineer

Troubleshooting Guide

Common Issues

High Error Rate

  1. Check recent deployments
  2. Review error logs
  3. Check downstream services
  4. Verify database health
  5. Check for traffic spike

Latency Spike

  1. Identify slow endpoints
  2. Check database queries
  3. Review cache hit rate
  4. Check resource utilization
  5. Analyze traffic patterns

Service Unavailable

  1. Verify service is running
  2. Check load balancer
  3. Review health checks
  4. Check dependencies
  5. Review resource limits

Diagnostic Tools

Log Search

  • Full-text search
  • Filter by service
  • Filter by severity
  • Time range selection

Trace Analysis

  • Request tracing
  • Latency breakdown
  • Error attribution
  • Dependency mapping

Metric Explorer

  • Ad-hoc queries
  • Custom visualizations
  • Anomaly detection
  • Correlation analysis

Emergency Procedures

Critical Incident

  1. Acknowledge alert
  2. Join incident channel
  3. Assess impact
  4. Execute mitigation
  5. Communicate status
  6. Document actions

System-Wide Outage

  1. Activate war room
  2. Assess scope
  3. Coordinate response
  4. Execute recovery
  5. External communication
  6. Full post-mortem

Appendix

Keyboard Shortcuts

ShortcutAction
Cmd/Ctrl + KGlobal search
G + DDashboard
G + GGating
G + CCanaries
G + IIncidents
G + RRunbooks
Shift + ?Help

Severity Definitions

LevelNameResponseExample
P1CriticalImmediateFull outage
P2High1 hourPartial outage
P3Medium4 hoursDegraded service
P4Low24 hoursMinor issue

Contact


Olympus Cloud Business OS - Operations Cockpit