Cockpit Operations Center

The Cockpit is NebusAI's internal operations center for monitoring AI agents, approving human-in-the-loop (HITL) actions, managing ACP infrastructure, and ensuring AI safety.

Overview

Cockpit provides a unified interface for platform operators to:

Monitor AI agent health and performance
Approve or reject AI-initiated actions requiring human oversight
Manage tenant operations and configurations
Track AI costs and model usage
Respond to incidents and safety alerts

Access

Cockpit is available at https://cockpit.olympuscloud.ai and requires NebusAI employee credentials with the cockpit_operator role.

Dashboard Tabs

1. Dashboard

The main overview showing system health at a glance.

Widget	Description
System Health	Service status across all regions
Active Agents	AI agents currently running
Pending Approvals	HITL items awaiting action
Recent Incidents	Last 24h safety/performance issues
Cost Tracker	Real-time AI spend vs budget

2. Gating Engine

Feature flag and policy management for the platform.

Capability	Description
Feature Flags	Enable/disable features by tenant
Canary Deployments	Gradual rollout controls
A/B Tests	Experiment configuration
Kill Switches	Emergency feature disable

3. AI Agent Monitoring

Real-time visibility into all AI agents across the platform.

┌─────────────────────────────────────────────────────────────────┐
│  AI AGENT MONITORING                              Live Status   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  AGENT STATUS                                                   │
│  ───────────────────────────────────────────────────────────    │
│  🟢 Maximus (Customer AI)      Healthy    2,847 req/min        │
│  🟢 Support Agent              Healthy    428 req/min          │
│  🟢 Scheduling Agent           Healthy    156 req/min          │
│  🟡 Marketing Agent            Degraded   42 req/min   ⚠       │
│  🟢 Voice Agent                Healthy    89 req/min           │
│                                                                  │
│  PERFORMANCE (Last Hour)                                        │
│  ───────────────────────────────────────────────────────────    │
│  Total Requests:    218,420     Avg Latency:    124ms          │
│  Success Rate:      99.8%       Cache Hit:      78.2%          │
│  Tokens Used:       12.4M       Est. Cost:      $142.80        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Agent Details View:

Metric	Description
Health Status	Green/Yellow/Red based on error rate and latency
Request Rate	Requests per minute
Latency	P50, P95, P99 response times
Error Rate	Percentage of failed requests
Token Usage	Input/output tokens consumed
Model Distribution	Which tiers being used
Active Sessions	Concurrent user sessions

Agent Configuration:

Setting	Purpose
Max Concurrency	Limit simultaneous requests
Timeout	Request timeout threshold
Fallback Tier	Model to use on failure
Rate Limits	Per-tenant request limits
Kill Switch	Emergency disable

4. AI Metrics

Comprehensive AI performance monitoring.

Metric Category	Details
Request Volume	Queries per second by tier
Latency	P50, P95, P99 by model
Cost	Real-time and projected spend
Cache Performance	Hit rate, savings
Error Rate	Failures by model/tier

Model Tier Performance:

Tier	Model	Avg Latency	Error Rate	Cache Hit
T1	Llama 4 Scout	45ms	0.1%	85%
T2	Gemini 2.0 Flash	180ms	0.2%	72%
T3	Gemini 3 Flash	320ms	0.3%	68%
T4	Claude Haiku 4.5	450ms	0.2%	65%
T5	Claude Sonnet 4.5	1,200ms	0.4%	52%
T6	Claude Opus 4.5	2,400ms	0.5%	45%

5. AI Cost Analytics

Financial tracking for AI operations with 95%+ cost optimization.

┌─────────────────────────────────────────────────────────────────┐
│  AI COST ANALYTICS                                   This Month │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  SPEND OVERVIEW                                                 │
│  ───────────────────────────────────────────────────────────    │
│  Budget:         $25,000                                        │
│  Actual:         $18,420        73.7% of budget                │
│  Projected:      $22,850        91.4% of budget                │
│  Savings:        $142,580       (vs no optimization)           │
│                                                                  │
│  ████████████████████████████████░░░░░░░░░░░░░░░░░░            │
│  $0           $12.5K          $18.4K        $25K Budget        │
│                                                                  │
│  COST BY MODEL TIER                                             │
│  ───────────────────────────────────────────────────────────    │
│  T1 Llama Scout   $0          0%    (62% of requests)          │
│  T2 Gemini Flash  $1,842      10%   (24% of requests)          │
│  T3 Gemini 3      $2,763      15%   (8% of requests)           │
│  T4 Claude Haiku  $4,605      25%   (4% of requests)           │
│  T5 Claude Sonnet $7,368      40%   (1.8% of requests)         │
│  T6 Claude Opus   $1,842      10%   (0.2% of requests)         │
│                                                                  │
│  COST SAVINGS                                                   │
│  ───────────────────────────────────────────────────────────    │
│  Smart Routing:    -$98,420   (T1 handles 62% of requests)     │
│  Caching:          -$32,840   (78% cache hit rate)             │
│  Batching:         -$11,320   (request batching)               │
│  ────────────────────────────────                              │
│  Total Savings:    -$142,580  (88.5% cost reduction)           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Cost Attribution:

Dimension	View
By Tenant	Top spending tenants
By Agent	Cost per AI agent
By Model	Spend per model tier
By Feature	Cost by feature area
By Time	Hourly/daily/weekly trends

Budget Management:

Feature	Description
Budget Allocation	Set monthly/quarterly budgets
Threshold Alerts	Notify at 50%, 75%, 90%
Tenant Limits	Per-tenant cost caps
Auto-Throttling	Reduce tier on budget exceeded
Forecasting	ML-based spend projections

6. Incidents

Safety and operational incident management.

Feature	Description
Active Incidents	Current issues requiring attention
Incident History	Past incidents and resolutions
Safety Violations	AI content/behavior issues
Escalation Queue	Items needing manager review

7. Tenant Operations

Multi-tenant management and support.

Feature	Description
Tenant List	All tenants with health status
Tenant Details	Configuration, usage, billing
Support Actions	Impersonation, config changes
Onboarding	New tenant setup wizard

Human-in-the-Loop (HITL)

Approval Queue

AI agents can request human approval for high-stakes actions:

Agent	Approval Triggers
Maximus	Transactions >$500, refunds >$100
Support Agent	Account changes, data deletion
Scheduling Agent	Overtime approval, shift cancellations
Marketing Agent	Campaign budget >$1000, content publish
Voice Agent	Complex complaints, legal requests

Approval Workflow

Approval Actions

Action	Description
Approve	Execute the action as requested
Approve with Note	Execute with comment for audit
Modify & Approve	Change parameters before execution
Reject	Deny the action with reason
Escalate	Send to senior operator

ACP Agent Registry

Manage AI agent definitions, tool permissions, and safety constraints.

Agent Registry

┌─────────────────────────────────────────────────────────────────┐
│  ACP AGENT REGISTRY                                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  REGISTERED AGENTS                                              │
│  ───────────────────────────────────────────────────────────    │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 🤖 MAXIMUS (Customer AI)                    Status: Active  │
│  │    ID: agent-maximus-v3                                  │   │
│  │    Tools: 12 enabled    HITL Triggers: 4                │   │
│  │    Safety: Level 2      Max Tier: T5                    │   │
│  │    [Configure] [View Logs] [Disable]                    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 🎧 SUPPORT AGENT                            Status: Active  │
│  │    ID: agent-support-v2                                  │   │
│  │    Tools: 8 enabled     HITL Triggers: 6                │   │
│  │    Safety: Level 3      Max Tier: T4                    │   │
│  │    [Configure] [View Logs] [Disable]                    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Tool Permissions

Define which tools each agent can use:

Tool Category	Example Tools	Risk Level
Read-Only	Query database, view orders	Low
Write	Update records, create orders	Medium
Financial	Process refunds, adjust pricing	High
Administrative	Modify settings, manage users	Critical
External	Send emails, make API calls	Medium

Permission Matrix:

Tool	Maximus	Support	Scheduling	Marketing
Query orders	✓	✓	✓	✓
Update orders	✓	✓	-	-
Process refunds	HITL	HITL	-	-
View schedule	✓	✓	✓	✓
Modify schedule	-	-	HITL	-
Send marketing	-	-	-	HITL
Access PII	Limited	✓	Limited	-

Agent Configuration

Setting	Description	Default
Max Model Tier	Highest tier agent can use	T4
Safety Level	Content filtering strictness	Level 2
HITL Threshold	When to require human approval	Configurable
Rate Limit	Max requests per minute	1000
Context Window	Max conversation length	100K tokens
Timeout	Request timeout	30s

AI Safety Controls

Monitoring

Control	Description
Content Filter	Detect harmful/inappropriate content
Jailbreak Detection	Identify prompt injection attempts
PII Scanner	Flag personal data exposure
Bias Monitor	Track response fairness metrics
Hallucination Detection	Identify factually incorrect responses
Tone Analysis	Monitor response appropriateness

Safety Dashboard

┌─────────────────────────────────────────────────────────────────┐
│  AI SAFETY DASHBOARD                               Last 24hrs   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  SAFETY METRICS                                                 │
│  ───────────────────────────────────────────────────────────    │
│  Content Filter Triggers:    42        (0.02% of requests)     │
│  Jailbreak Attempts:         8         (blocked)               │
│  PII Detections:             156       (masked)                │
│  Bias Alerts:                3         (under review)          │
│  Hallucination Flags:        28        (reviewed)              │
│                                                                  │
│  RECENT INCIDENTS                                               │
│  ───────────────────────────────────────────────────────────    │
│  🔴 14:32 - Jailbreak attempt detected (blocked)               │
│  🟡 12:45 - Potential bias in response (under review)          │
│  🟢 10:15 - PII masked in Support response (auto-handled)      │
│                                                                  │
│  SAFETY SCORE: 98.5%                                           │
│  ████████████████████████████████████████████████░░            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Response Actions

Severity	Response Time	Action
Critical	Immediate	Auto-block, page on-call
High	15 minutes	Alert operators, queue for review
Medium	1 hour	Flag for batch review
Low	24 hours	Log for analysis

Kill Switch

danger

Activating the kill switch immediately disables the agent across all tenants and automatically creates an incident. This is a last-resort action for safety-critical situations. Use "Safe Mode" or "Tenant-Specific" disable for less severe issues.

Emergency agent disable:

Go to Cockpit > Agents > Select Agent
Click Kill Switch (red button)
Confirm action
Agent immediately disabled across all tenants
Fallback behavior activated
Incident automatically created

Kill Switch Options:

Option	Effect
Full Disable	Agent completely offline
Safe Mode	Agent responds with limited capability
Fallback	Redirect to backup agent
Tenant-Specific	Disable for specific tenants only

Keyboard Shortcuts

Shortcut	Action
`G D`	Go to Dashboard
`G A`	Go to Approvals
`G I`	Go to Incidents
`A`	Approve selected item
`R`	Reject selected item
`E`	Escalate selected item
`/`	Search
`?`	Show all shortcuts

API Access

Cockpit data is available via the internal API:

# Get pending approvals
GET /api/internal/cockpit/approvals?status=pending

# Approve an item
POST /api/internal/cockpit/approvals/{id}/approve
{
  "note": "Approved per policy",
  "operator_id": "uuid"
}

# Get AI metrics
GET /api/internal/cockpit/metrics/ai?period=24h

Best Practices

tip

Aim for less than 15 minutes response time on HITL approval requests. Stale approvals block AI agent workflows and degrade the user experience for tenants waiting on agent actions.

Check approvals frequently - Aim for under 15 min response time
Document rejections - Always provide clear reasons
Monitor AI costs daily - Catch anomalies early
Review incidents promptly - Prioritize safety issues
Use keyboard shortcuts - Improve efficiency

Troubleshooting

Approval Queue Stuck

Symptom: Items not appearing in queue

Solution:

Check agent health in AI Metrics
Verify WebSocket connection
Refresh the page
Check for filtering applied

Cost Data Delayed

Symptom: Cost metrics showing old data

Solution:

Costs update every 5 minutes
Check ClickHouse pipeline status
Verify tenant attribution

Overview​

Access​

Dashboard Tabs​

1. Dashboard​

2. Gating Engine​

3. AI Agent Monitoring​

4. AI Metrics​

5. AI Cost Analytics​

6. Incidents​

7. Tenant Operations​

Human-in-the-Loop (HITL)​

Approval Queue​

Approval Workflow​

Approval Actions​

ACP Agent Registry​

Agent Registry​

Tool Permissions​

Agent Configuration​

AI Safety Controls​

Monitoring​

Safety Dashboard​

Response Actions​

Kill Switch​

Keyboard Shortcuts​

API Access​

Best Practices​

Troubleshooting​

Approval Queue Stuck​

Cost Data Delayed​

Related Documentation​