Olympus Cloud - Cockpit Operations User Guide
Quick Summary (for RAG)
The Cockpit is the operational command center for platform engineering, DevOps, and SRE teams. Features include: AI Agent Monitoring Dashboard for agent status and performance, Human-in-the-Loop (HITL) approval queue for AI actions, ACP Agent Registry and tool permissions management, AI Safety Controls and incident management, AI Cost Analytics with model usage tracking (ACP AI Router tiers T1-T6), gating and feature flag management, canary deployments, system health monitoring with SLO/SLI dashboards, release management, on-call management, and runbooks. Target users: SRE, DevOps, Platform Engineers, Release Managers.
Version: 1.1 Last Updated: January 2026 Application: Olympus Cloud Cockpit - Operations Center Access URL: https://dev.cockpit.olympuscloud.ai (development) | https://cockpit.olympuscloud.ai (production)
Table of Contents
- Overview
- Getting Started
- Operations Dashboard
- Gating & Feature Management
- Canary Deployments
- Incident Management
- System Health Monitoring
- Release Management
- Configuration Management
- Runbooks & Automation
- On-Call Management
- Troubleshooting Guide
Overview
The Olympus Cloud Cockpit is the operational command center for the platform engineering and DevOps teams. It provides real-time visibility into system health, deployment management, incident response, and feature rollout control.
Target Users
- Platform Engineers: System operations and maintenance
- DevOps Engineers: Deployments and infrastructure
- SRE Team: Reliability and incident response
- Release Managers: Feature rollouts and canaries
- On-Call Engineers: Incident triage and resolution
Key Features
- Real-time system health monitoring
- Canary deployment wizard
- Feature flag management
- Incident management and response
- Automated runbooks
- On-call scheduling and escalation
- Performance analytics
Getting Started
Access Requirements
| Role | Access Level | Capabilities |
|---|---|---|
| Viewer | Read-only | View dashboards, logs |
| Operator | Standard | Manage features, run runbooks |
| Admin | Full | All operations, configurations |
| Super Admin | Elevated | Emergency controls, secrets |
Logging In
- Navigate to the Cockpit URL
- Authenticate with SSO
- Complete MFA verification
- Select operational context
First-Time Setup
- Configure notification preferences
- Join on-call rotation (if applicable)
- Review runbook library
- Set up personal dashboards
Operations Dashboard
Overview Panel
The main dashboard displays:
System Status
┌────────────────────────────────────────┐
│ SYSTEM STATUS: HEALTHY │
│ ─────────────────────────────────────│
│ ● API Gateway : Healthy (p99: 45ms)
│ ● Auth Service : Healthy (p99: 23ms)
│ ● Order Service : Healthy (p99: 67ms)
│ ● Payment Service : Healthy (p99: 89ms)
│ ● Database : Healthy (CPU: 34%)
│ ● Cache : Healthy (Hit: 94%)
└────────────────────────────────────────┘
Active Incidents
- Critical (P1): Immediate attention
- High (P2): Within 1 hour
- Medium (P3): Within 4 hours
- Low (P4): Best effort
Deployment Activity
- Active canaries
- Recent deployments
- Pending releases
- Rollback history
Key Metrics
| Metric | Current | Target | Status |
|---|---|---|---|
| Uptime | 99.97% | 99.9% | ✅ |
| Error Rate | 0.02% | less than 0.1% | ✅ |
| P99 Latency | 234ms | less than 500ms | ✅ |
| Active Users | 12,453 | N/A | — |
Quick Actions
- Create Incident: Start incident workflow
- Kill Switch: Emergency feature disable
- Deploy Canary: Start canary deployment
- Run Runbook: Execute automation
Gating & Feature Management
Gating Overview
The gating system controls feature availability:
Gating Hierarchy
Global Settings
└── Environment (prod, staging, dev)
└── Tenant Groups
└── Individual Tenants
└── User Segments
Feature Flags
Viewing Flags
- Navigate to Gating > Feature Flags
- Filter by:
- Status (enabled, disabled, canary)
- Environment
- Owner
- Age
Flag Details
- Flag key and description
- Current state per environment
- Targeting rules
- Audit history
- Related incidents
Managing Flags
Enable/Disable
- Select flag
- Choose environment
- Toggle state
- Add change note
- Confirm
Targeting Rules
{
"flag": "new_checkout_v2",
"rules": [
{
"name": "Enterprise Tenants",
"condition": {
"attribute": "tenant.plan",
"operator": "equals",
"value": "enterprise"
},
"variation": true,
"percentage": 100
},
{
"name": "Gradual Rollout",
"condition": {
"attribute": "user.id",
"operator": "percentage"
},
"variation": true,
"percentage": 25
}
],
"defaultVariation": false
}
Kill Switches
Kill switches immediately disable a feature across all tenants. Always document the reason in an incident report, and monitor downstream impact after activation. Only use kill switches for genuine emergencies.
Emergency controls for critical issues:
- Go to Gating > Kill Switches
- Locate feature
- Click Kill button
- Confirm with reason
- Monitor impact
- Document in incident
Kill Switch Dashboard
- Active kills
- Recent activations
- Auto-recovery settings
- Impact metrics
Canary Deployments
Canary Wizard
The Canary Wizard guides controlled rollouts:
Start canary deployments at 1% traffic and allow at least 30 minutes of observation before promoting to the next stage. Define clear success criteria upfront so rollback decisions are data-driven, not subjective.
Step 1: Configuration
- Select feature flag
- Choose target environment
- Set initial percentage (1-5% recommended)
- Define success metrics
Step 2: Health Criteria
success_criteria:
error_rate:
threshold: 0.1%
comparison: less_than
latency_p99:
threshold: 500ms
comparison: less_than
success_rate:
threshold: 99.5%
comparison: greater_than
Step 3: Rollout Plan
| Stage | Percentage | Duration | Auto-Promote |
|---|---|---|---|
| 1 | 1% | 30 min | Yes |
| 2 | 5% | 1 hour | Yes |
| 3 | 25% | 2 hours | No |
| 4 | 50% | 4 hours | No |
| 5 | 100% | — | Manual |
Step 4: Monitoring
- Real-time metrics dashboard
- Comparison with baseline
- Alert thresholds
- Rollback triggers
Managing Canaries
Active Canaries View all in-progress canaries:
- Feature name
- Current stage
- Health status
- Time remaining
- Actions
Canary Actions
| Action | Description |
|---|---|
| Promote | Move to next stage |
| Hold | Pause advancement |
| Rollback | Revert to baseline |
| Complete | Mark as successful |
Rollback Procedures
Automatic Rollback Triggered when:
- Error rate exceeds threshold
- Latency spike detected
- Success rate drops
- Manual trigger
Manual Rollback
- Go to active canary
- Click Rollback
- Confirm action
- Monitor recovery
- Create incident report
Incident Management
Creating Incidents
From Alert
- Click alert notification
- Review alert details
- Click Create Incident
- Assign severity
- Begin investigation
Manual Creation
- Go to Incidents > New
- Fill in details:
- Title
- Severity (P1-P4)
- Affected services
- Description
- Assign responders
- Start incident
Incident Workflow
Detection → Triage → Investigation → Mitigation → Resolution → Post-mortem
Status Updates
| Status | Description |
|---|---|
| Investigating | Identifying root cause |
| Identified | Root cause known |
| Mitigating | Fix in progress |
| Resolved | Issue fixed |
| Closed | Post-mortem complete |
Incident Response
During Incident
- Update status regularly
- Add timeline entries
- Communicate with stakeholders
- Execute runbooks
- Document actions
Communication
- Slack channel created automatically
- Status page updates
- Customer notifications
- Executive briefings (P1)
Post-Mortems
After resolution:
- Schedule post-mortem meeting
- Complete incident template:
- Timeline of events
- Root cause analysis
- Impact assessment
- Action items
- Review with team
- Track action items
System Health Monitoring
Service Health
Status Indicators
- 🟢 Healthy: All metrics normal
- 🟡 Degraded: Minor issues
- 🔴 Unhealthy: Major issues
- ⚫ Unknown: No data
Per-Service Metrics
- Request rate
- Error rate
- Latency (p50, p95, p99)
- Saturation
- Active connections
Infrastructure Metrics
Compute
- CPU utilization
- Memory usage
- Network I/O
- Disk I/O
Database
- Query latency
- Connection pool
- Replication lag
- Storage usage
Cache
- Hit rate
- Memory usage
- Evictions
- Connections
Alerting
Alert Severity
| Severity | Response | Notification |
|---|---|---|
| Critical | Immediate | Page on-call |
| Warning | Within 15m | Slack + email |
| Info | Awareness | Slack only |
Alert Rules
- name: high_error_rate
condition: error_rate > 1%
duration: 5m
severity: critical
notification: pagerduty
runbook: /runbooks/high-error-rate
- name: elevated_latency
condition: p99_latency > 500ms
duration: 10m
severity: warning
notification: slack
runbook: /runbooks/latency-investigation
Dashboards
Pre-built Dashboards
- System Overview
- Service Deep-dive
- Database Performance
- Cache Analytics
- Edge/CDN Metrics
Custom Dashboards
- Click New Dashboard
- Add panels
- Configure queries
- Set refresh rate
- Save and share
Release Management
Release Pipeline
Pipeline Stages
Build → Test → Stage → Canary → Production
Release Tracking
| Version | Stage | Status | Deployed |
|---|---|---|---|
| v2.4.1 | Production | ✅ Live | Dec 1 |
| v2.4.2 | Canary | 🔄 25% | Dec 1 |
| v2.4.3 | Staging | ✅ Pass | Dec 1 |
| v2.4.4 | Testing | 🔄 Running | Dec 1 |
Deployment Actions
Deploy to Stage
- Select version
- Choose environment
- Review changes
- Confirm deployment
- Monitor
Promote to Production
- Verify staging success
- Create canary
- Monitor metrics
- Complete rollout
Rollback
Quick Rollback
- Go to Releases > Active
- Find problematic release
- Click Rollback
- Select target version
- Confirm
- Monitor recovery
Configuration Management
Environment Configuration
Viewing Config
- Go to Config > Environments
- Select environment
- View all configuration values
Editing Config
- Find configuration key
- Click Edit
- Enter new value
- Add change note
- Submit for approval
- Deploy change
Secrets Management
Secret Categories
- API keys
- Database credentials
- Third-party tokens
- Encryption keys
Secret Operations
- View (masked)
- Rotate
- Add new
- Delete
- Audit access
Configuration Drift
Detection
- Compare environments
- Highlight differences
- Alert on unexpected changes
Resolution
- Sync configurations
- Document exceptions
- Track compliance
Runbooks & Automation
Runbook Library
Categories
- Incident Response
- Deployment
- Scaling
- Database Operations
- Cache Management
- Security Response
Running Runbooks
Manual Execution
- Go to Runbooks
- Select runbook
- Review steps
- Enter parameters
- Execute
- Monitor progress
Automated Triggers
- Alert-based
- Schedule-based
- Event-based
- Manual
Creating Runbooks
Runbook Template
name: Restart Service
description: Safely restart a service instance
parameters:
- name: service_name
type: string
required: true
- name: instance_id
type: string
required: true
steps:
- name: Verify service exists
action: verify_service
params:
service: "{{ service_name }}"
- name: Drain connections
action: drain_service
params:
instance: "{{ instance_id }}"
timeout: 30s
- name: Restart instance
action: restart_instance
params:
instance: "{{ instance_id }}"
- name: Health check
action: health_check
params:
service: "{{ service_name }}"
timeout: 60s
On-Call Management
Schedule Overview
Current On-Call
| Role | Engineer | Until |
|---|---|---|
| Primary | Jane Smith | Dec 2, 9am |
| Secondary | John Doe | Dec 2, 9am |
| Manager | Alice Wong | Dec 4, 9am |
Managing Schedules
View Schedule
- Weekly calendar view
- Coverage gaps
- Upcoming shifts
- Swap requests
Swap Shifts
- Find your shift
- Click Request Swap
- Select replacement
- Submit request
- Await approval
Escalation Policies
Default Policy
1. Primary On-Call (immediate)
└── 15 min no response
2. Secondary On-Call
└── 15 min no response
3. Engineering Manager
└── 15 min no response
4. VP Engineering
On-Call Tools
During Shift
- Acknowledge alerts
- Run diagnostics
- Execute runbooks
- Update incidents
- Escalate issues
Handoff
- Review open incidents
- Document ongoing issues
- Brief incoming engineer
Troubleshooting Guide
Common Issues
High Error Rate
- Check recent deployments
- Review error logs
- Check downstream services
- Verify database health
- Check for traffic spike
Latency Spike
- Identify slow endpoints
- Check database queries
- Review cache hit rate
- Check resource utilization
- Analyze traffic patterns
Service Unavailable
- Verify service is running
- Check load balancer
- Review health checks
- Check dependencies
- Review resource limits
Diagnostic Tools
Log Search
- Full-text search
- Filter by service
- Filter by severity
- Time range selection
Trace Analysis
- Request tracing
- Latency breakdown
- Error attribution
- Dependency mapping
Metric Explorer
- Ad-hoc queries
- Custom visualizations
- Anomaly detection
- Correlation analysis
Emergency Procedures
Critical Incident
- Acknowledge alert
- Join incident channel
- Assess impact
- Execute mitigation
- Communicate status
- Document actions
System-Wide Outage
- Activate war room
- Assess scope
- Coordinate response
- Execute recovery
- External communication
- Full post-mortem
Appendix
Keyboard Shortcuts
| Shortcut | Action |
|---|---|
| Cmd/Ctrl + K | Global search |
| G + D | Dashboard |
| G + G | Gating |
| G + C | Canaries |
| G + I | Incidents |
| G + R | Runbooks |
| Shift + ? | Help |
Severity Definitions
| Level | Name | Response | Example |
|---|---|---|---|
| P1 | Critical | Immediate | Full outage |
| P2 | High | 1 hour | Partial outage |
| P3 | Medium | 4 hours | Degraded service |
| P4 | Low | 24 hours | Minor issue |
Contact
- Platform Team: platform@nebusai.com
- On-Call: oncall@nebusai.com
- Security: security@nebusai.com
Olympus Cloud Business OS - Operations Cockpit