Olympus Cloud - Cockpit Operations User Guide

Quick Summary (for RAG)

The Cockpit is the operational command center for platform engineering, DevOps, and SRE teams. Features include: AI Agent Monitoring Dashboard for agent status and performance, Human-in-the-Loop (HITL) approval queue for AI actions, ACP Agent Registry and tool permissions management, AI Safety Controls and incident management, AI Cost Analytics with model usage tracking (ACP AI Router tiers T1-T6), gating and feature flag management, canary deployments, system health monitoring with SLO/SLI dashboards, release management, on-call management, and runbooks. Target users: SRE, DevOps, Platform Engineers, Release Managers.

Version: 1.1 Last Updated: January 2026 Application: Olympus Cloud Cockpit - Operations Center Access URL: https://dev.cockpit.olympuscloud.ai (development) | https://cockpit.olympuscloud.ai (production)

Overview
Getting Started
Operations Dashboard
Gating & Feature Management
Canary Deployments
Incident Management
System Health Monitoring
Release Management
Configuration Management
Runbooks & Automation
On-Call Management
Troubleshooting Guide

Overview

The Olympus Cloud Cockpit is the operational command center for the platform engineering and DevOps teams. It provides real-time visibility into system health, deployment management, incident response, and feature rollout control.

Target Users

Platform Engineers: System operations and maintenance
DevOps Engineers: Deployments and infrastructure
SRE Team: Reliability and incident response
Release Managers: Feature rollouts and canaries
On-Call Engineers: Incident triage and resolution

Key Features

Real-time system health monitoring
Canary deployment wizard
Feature flag management
Incident management and response
Automated runbooks
On-call scheduling and escalation
Performance analytics

Getting Started

Access Requirements

Role	Access Level	Capabilities
Viewer	Read-only	View dashboards, logs
Operator	Standard	Manage features, run runbooks
Admin	Full	All operations, configurations
Super Admin	Elevated	Emergency controls, secrets

Logging In

Navigate to the Cockpit URL
Authenticate with SSO
Complete MFA verification
Select operational context

First-Time Setup

Configure notification preferences
Join on-call rotation (if applicable)
Review runbook library
Set up personal dashboards

Operations Dashboard

Overview Panel

The main dashboard displays:

System Status

┌────────────────────────────────────────┐
│  SYSTEM STATUS: HEALTHY                │
│  ─────────────────────────────────────│
│  ● API Gateway      : Healthy (p99: 45ms)
│  ● Auth Service     : Healthy (p99: 23ms)
│  ● Order Service    : Healthy (p99: 67ms)
│  ● Payment Service  : Healthy (p99: 89ms)
│  ● Database         : Healthy (CPU: 34%)
│  ● Cache            : Healthy (Hit: 94%)
└────────────────────────────────────────┘

Active Incidents

Critical (P1): Immediate attention
High (P2): Within 1 hour
Medium (P3): Within 4 hours
Low (P4): Best effort

Deployment Activity

Active canaries
Recent deployments
Pending releases
Rollback history

Key Metrics

Metric	Current	Target	Status
Uptime	99.97%	99.9%	✅
Error Rate	0.02%	less than 0.1%	✅
P99 Latency	234ms	less than 500ms	✅
Active Users	12,453	N/A	—

Quick Actions

Create Incident: Start incident workflow
Kill Switch: Emergency feature disable
Deploy Canary: Start canary deployment
Run Runbook: Execute automation

Gating & Feature Management

Gating Overview

The gating system controls feature availability:

Gating Hierarchy

Global Settings
└── Environment (prod, staging, dev)
    └── Tenant Groups
        └── Individual Tenants
            └── User Segments

Feature Flags

Viewing Flags

Navigate to Gating > Feature Flags
Filter by:
- Status (enabled, disabled, canary)
- Environment
- Owner
- Age

Flag Details

Flag key and description
Current state per environment
Targeting rules
Audit history
Related incidents

Managing Flags

Enable/Disable

Select flag
Choose environment
Toggle state
Add change note
Confirm

Targeting Rules

{
  "flag": "new_checkout_v2",
  "rules": [
    {
      "name": "Enterprise Tenants",
      "condition": {
        "attribute": "tenant.plan",
        "operator": "equals",
        "value": "enterprise"
      },
      "variation": true,
      "percentage": 100
    },
    {
      "name": "Gradual Rollout",
      "condition": {
        "attribute": "user.id",
        "operator": "percentage"
      },
      "variation": true,
      "percentage": 25
    }
  ],
  "defaultVariation": false
}

Kill Switches

danger

Kill switches immediately disable a feature across all tenants. Always document the reason in an incident report, and monitor downstream impact after activation. Only use kill switches for genuine emergencies.

Emergency controls for critical issues:

Go to Gating > Kill Switches
Locate feature
Click Kill button
Confirm with reason
Monitor impact
Document in incident

Kill Switch Dashboard

Active kills
Recent activations
Auto-recovery settings
Impact metrics

Canary Deployments

Canary Wizard

The Canary Wizard guides controlled rollouts:

Best Practice

Start canary deployments at 1% traffic and allow at least 30 minutes of observation before promoting to the next stage. Define clear success criteria upfront so rollback decisions are data-driven, not subjective.

Step 1: Configuration

Select feature flag
Choose target environment
Set initial percentage (1-5% recommended)
Define success metrics

Step 2: Health Criteria

success_criteria:
  error_rate:
    threshold: 0.1%
    comparison: less_than
  latency_p99:
    threshold: 500ms
    comparison: less_than
  success_rate:
    threshold: 99.5%
    comparison: greater_than

Step 3: Rollout Plan

Stage	Percentage	Duration	Auto-Promote
1	1%	30 min	Yes
2	5%	1 hour	Yes
3	25%	2 hours	No
4	50%	4 hours	No
5	100%	—	Manual

Step 4: Monitoring

Real-time metrics dashboard
Comparison with baseline
Alert thresholds
Rollback triggers

Managing Canaries

Active Canaries View all in-progress canaries:

Feature name
Current stage
Health status
Time remaining
Actions

Canary Actions

Action	Description
Promote	Move to next stage
Hold	Pause advancement
Rollback	Revert to baseline
Complete	Mark as successful

Rollback Procedures

Automatic Rollback Triggered when:

Error rate exceeds threshold
Latency spike detected
Success rate drops
Manual trigger

Manual Rollback

Go to active canary
Click Rollback
Confirm action
Monitor recovery
Create incident report

Incident Management

Creating Incidents

From Alert

Click alert notification
Review alert details
Click Create Incident
Assign severity
Begin investigation

Manual Creation

Go to Incidents > New
Fill in details:
- Title
- Severity (P1-P4)
- Affected services
- Description
Assign responders
Start incident

Incident Workflow

Detection → Triage → Investigation → Mitigation → Resolution → Post-mortem

Status Updates

Status	Description
Investigating	Identifying root cause
Identified	Root cause known
Mitigating	Fix in progress
Resolved	Issue fixed
Closed	Post-mortem complete

Incident Response

During Incident

Update status regularly
Add timeline entries
Communicate with stakeholders
Execute runbooks
Document actions

Communication

Slack channel created automatically
Status page updates
Customer notifications
Executive briefings (P1)

Post-Mortems

After resolution:

Schedule post-mortem meeting
Complete incident template:
- Timeline of events
- Root cause analysis
- Impact assessment
- Action items
Review with team
Track action items

System Health Monitoring

Service Health

Status Indicators

🟢 Healthy: All metrics normal
🟡 Degraded: Minor issues
🔴 Unhealthy: Major issues
⚫ Unknown: No data

Per-Service Metrics

Request rate
Error rate
Latency (p50, p95, p99)
Saturation
Active connections

Infrastructure Metrics

Compute

CPU utilization
Memory usage
Network I/O
Disk I/O

Database

Query latency
Connection pool
Replication lag
Storage usage

Cache

Hit rate
Memory usage
Evictions
Connections

Alerting

Alert Severity

Severity	Response	Notification
Critical	Immediate	Page on-call
Warning	Within 15m	Slack + email
Info	Awareness	Slack only

Alert Rules

- name: high_error_rate
  condition: error_rate > 1%
  duration: 5m
  severity: critical
  notification: pagerduty
  runbook: /runbooks/high-error-rate

- name: elevated_latency
  condition: p99_latency > 500ms
  duration: 10m
  severity: warning
  notification: slack
  runbook: /runbooks/latency-investigation

Dashboards

Pre-built Dashboards

System Overview
Service Deep-dive
Database Performance
Cache Analytics
Edge/CDN Metrics

Custom Dashboards

Click New Dashboard
Add panels
Configure queries
Set refresh rate
Save and share

Release Management

Release Pipeline

Pipeline Stages

Build → Test → Stage → Canary → Production

Release Tracking

Version	Stage	Status	Deployed
v2.4.1	Production	✅ Live	Dec 1
v2.4.2	Canary	🔄 25%	Dec 1
v2.4.3	Staging	✅ Pass	Dec 1
v2.4.4	Testing	🔄 Running	Dec 1

Deployment Actions

Deploy to Stage

Select version
Choose environment
Review changes
Confirm deployment
Monitor

Promote to Production

Verify staging success
Create canary
Monitor metrics
Complete rollout

Rollback

Quick Rollback

Go to Releases > Active
Find problematic release
Click Rollback
Select target version
Confirm
Monitor recovery

Configuration Management

Environment Configuration

Viewing Config

Go to Config > Environments
Select environment
View all configuration values

Editing Config

Find configuration key
Click Edit
Enter new value
Add change note
Submit for approval
Deploy change

Secrets Management

Secret Categories

API keys
Database credentials
Third-party tokens
Encryption keys

Secret Operations

View (masked)
Rotate
Add new
Delete
Audit access

Configuration Drift

Detection

Compare environments
Highlight differences
Alert on unexpected changes

Resolution

Sync configurations
Document exceptions
Track compliance

Runbooks & Automation

Runbook Library

Categories

Incident Response
Deployment
Scaling
Database Operations
Cache Management
Security Response

Running Runbooks

Manual Execution

Go to Runbooks
Select runbook
Review steps
Enter parameters
Execute
Monitor progress

Automated Triggers

Alert-based
Schedule-based
Event-based
Manual

Creating Runbooks

Runbook Template

name: Restart Service
description: Safely restart a service instance
parameters:
  - name: service_name
    type: string
    required: true
  - name: instance_id
    type: string
    required: true
steps:
  - name: Verify service exists
    action: verify_service
    params:
      service: "{{ service_name }}"
  - name: Drain connections
    action: drain_service
    params:
      instance: "{{ instance_id }}"
      timeout: 30s
  - name: Restart instance
    action: restart_instance
    params:
      instance: "{{ instance_id }}"
  - name: Health check
    action: health_check
    params:
      service: "{{ service_name }}"
      timeout: 60s

On-Call Management

Schedule Overview

Current On-Call

Role	Engineer	Until
Primary	Jane Smith	Dec 2, 9am
Secondary	John Doe	Dec 2, 9am
Manager	Alice Wong	Dec 4, 9am

Managing Schedules

View Schedule

Weekly calendar view
Coverage gaps
Upcoming shifts
Swap requests

Swap Shifts

Find your shift
Click Request Swap
Select replacement
Submit request
Await approval

Escalation Policies

Default Policy

1. Primary On-Call (immediate)
   └── 15 min no response
2. Secondary On-Call
   └── 15 min no response
3. Engineering Manager
   └── 15 min no response
4. VP Engineering

On-Call Tools

During Shift

Acknowledge alerts
Run diagnostics
Execute runbooks
Update incidents
Escalate issues

Handoff

Review open incidents
Document ongoing issues
Brief incoming engineer

Troubleshooting Guide

Common Issues

High Error Rate

Check recent deployments
Review error logs
Check downstream services
Verify database health
Check for traffic spike

Latency Spike

Identify slow endpoints
Check database queries
Review cache hit rate
Check resource utilization
Analyze traffic patterns

Service Unavailable

Verify service is running
Check load balancer
Review health checks
Check dependencies
Review resource limits

Diagnostic Tools

Log Search

Full-text search
Filter by service
Filter by severity
Time range selection

Trace Analysis

Request tracing
Latency breakdown
Error attribution
Dependency mapping

Metric Explorer

Ad-hoc queries
Custom visualizations
Anomaly detection
Correlation analysis

Emergency Procedures

Critical Incident

Acknowledge alert
Join incident channel
Assess impact
Execute mitigation
Communicate status
Document actions

System-Wide Outage

Activate war room
Assess scope
Coordinate response
Execute recovery
External communication
Full post-mortem

Appendix

Keyboard Shortcuts

Shortcut	Action
Cmd/Ctrl + K	Global search
G + D	Dashboard
G + G	Gating
G + C	Canaries
G + I	Incidents
G + R	Runbooks
Shift + ?	Help

Severity Definitions

Level	Name	Response	Example
P1	Critical	Immediate	Full outage
P2	High	1 hour	Partial outage
P3	Medium	4 hours	Degraded service
P4	Low	24 hours	Minor issue

Contact

Platform Team: platform@nebusai.com
On-Call: oncall@nebusai.com
Security: security@nebusai.com

Olympus Cloud Business OS - Operations Cockpit

Quick Summary (for RAG)​

Table of Contents​

Overview​

Target Users​

Key Features​

Getting Started​

Access Requirements​

Logging In​

First-Time Setup​

Operations Dashboard​

Overview Panel​

Key Metrics​

Quick Actions​

Gating & Feature Management​

Gating Overview​

Feature Flags​

Managing Flags​

Kill Switches​

Canary Deployments​

Canary Wizard​

Managing Canaries​

Rollback Procedures​

Incident Management​

Creating Incidents​

Incident Workflow​

Incident Response​

Post-Mortems​

System Health Monitoring​

Service Health​

Infrastructure Metrics​

Alerting​

Dashboards​

Release Management​

Release Pipeline​

Deployment Actions​

Rollback​

Configuration Management​

Environment Configuration​

Secrets Management​

Configuration Drift​

Runbooks & Automation​

Runbook Library​

Running Runbooks​

Creating Runbooks​

On-Call Management​

Schedule Overview​

Managing Schedules​

Escalation Policies​

On-Call Tools​

Troubleshooting Guide​

Common Issues​

Diagnostic Tools​

Emergency Procedures​

Appendix​

Keyboard Shortcuts​

Severity Definitions​

Contact​