Skip to main content

Cockpit Operations Center

The Cockpit is NebusAI's internal operations center for monitoring AI agents, approving human-in-the-loop (HITL) actions, managing ACP infrastructure, and ensuring AI safety.

Overview

Cockpit provides a unified interface for platform operators to:

  • Monitor AI agent health and performance
  • Approve or reject AI-initiated actions requiring human oversight
  • Manage tenant operations and configurations
  • Track AI costs and model usage
  • Respond to incidents and safety alerts

Access

Cockpit is available at https://cockpit.olympuscloud.ai and requires NebusAI employee credentials with the cockpit_operator role.

Dashboard Tabs

1. Dashboard

The main overview showing system health at a glance.

WidgetDescription
System HealthService status across all regions
Active AgentsAI agents currently running
Pending ApprovalsHITL items awaiting action
Recent IncidentsLast 24h safety/performance issues
Cost TrackerReal-time AI spend vs budget

2. Gating Engine

Feature flag and policy management for the platform.

CapabilityDescription
Feature FlagsEnable/disable features by tenant
Canary DeploymentsGradual rollout controls
A/B TestsExperiment configuration
Kill SwitchesEmergency feature disable

3. AI Agent Monitoring

Real-time visibility into all AI agents across the platform.

┌─────────────────────────────────────────────────────────────────┐
│ AI AGENT MONITORING Live Status │
├─────────────────────────────────────────────────────────────────┤
│ │
│ AGENT STATUS │
│ ─────────────────────────────────────────────────────────── │
│ 🟢 Maximus (Customer AI) Healthy 2,847 req/min │
│ 🟢 Support Agent Healthy 428 req/min │
│ 🟢 Scheduling Agent Healthy 156 req/min │
│ 🟡 Marketing Agent Degraded 42 req/min ⚠ │
│ 🟢 Voice Agent Healthy 89 req/min │
│ │
│ PERFORMANCE (Last Hour) │
│ ─────────────────────────────────────────────────────────── │
│ Total Requests: 218,420 Avg Latency: 124ms │
│ Success Rate: 99.8% Cache Hit: 78.2% │
│ Tokens Used: 12.4M Est. Cost: $142.80 │
│ │
└─────────────────────────────────────────────────────────────────┘

Agent Details View:

MetricDescription
Health StatusGreen/Yellow/Red based on error rate and latency
Request RateRequests per minute
LatencyP50, P95, P99 response times
Error RatePercentage of failed requests
Token UsageInput/output tokens consumed
Model DistributionWhich tiers being used
Active SessionsConcurrent user sessions

Agent Configuration:

SettingPurpose
Max ConcurrencyLimit simultaneous requests
TimeoutRequest timeout threshold
Fallback TierModel to use on failure
Rate LimitsPer-tenant request limits
Kill SwitchEmergency disable

4. AI Metrics

Comprehensive AI performance monitoring.

Metric CategoryDetails
Request VolumeQueries per second by tier
LatencyP50, P95, P99 by model
CostReal-time and projected spend
Cache PerformanceHit rate, savings
Error RateFailures by model/tier

Model Tier Performance:

TierModelAvg LatencyError RateCache Hit
T1Llama 4 Scout45ms0.1%85%
T2Gemini 2.0 Flash180ms0.2%72%
T3Gemini 3 Flash320ms0.3%68%
T4Claude Haiku 4.5450ms0.2%65%
T5Claude Sonnet 4.51,200ms0.4%52%
T6Claude Opus 4.52,400ms0.5%45%

5. AI Cost Analytics

Financial tracking for AI operations with 95%+ cost optimization.

┌─────────────────────────────────────────────────────────────────┐
│ AI COST ANALYTICS This Month │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SPEND OVERVIEW │
│ ─────────────────────────────────────────────────────────── │
│ Budget: $25,000 │
│ Actual: $18,420 73.7% of budget │
│ Projected: $22,850 91.4% of budget │
│ Savings: $142,580 (vs no optimization) │
│ │
│ ████████████████████████████████░░░░░░░░░░░░░░░░░░ │
│ $0 $12.5K $18.4K $25K Budget │
│ │
│ COST BY MODEL TIER │
│ ─────────────────────────────────────────────────────────── │
│ T1 Llama Scout $0 0% (62% of requests) │
│ T2 Gemini Flash $1,842 10% (24% of requests) │
│ T3 Gemini 3 $2,763 15% (8% of requests) │
│ T4 Claude Haiku $4,605 25% (4% of requests) │
│ T5 Claude Sonnet $7,368 40% (1.8% of requests) │
│ T6 Claude Opus $1,842 10% (0.2% of requests) │
│ │
│ COST SAVINGS │
│ ─────────────────────────────────────────────────────────── │
│ Smart Routing: -$98,420 (T1 handles 62% of requests) │
│ Caching: -$32,840 (78% cache hit rate) │
│ Batching: -$11,320 (request batching) │
│ ──────────────────────────────── │
│ Total Savings: -$142,580 (88.5% cost reduction) │
│ │
└─────────────────────────────────────────────────────────────────┘

Cost Attribution:

DimensionView
By TenantTop spending tenants
By AgentCost per AI agent
By ModelSpend per model tier
By FeatureCost by feature area
By TimeHourly/daily/weekly trends

Budget Management:

FeatureDescription
Budget AllocationSet monthly/quarterly budgets
Threshold AlertsNotify at 50%, 75%, 90%
Tenant LimitsPer-tenant cost caps
Auto-ThrottlingReduce tier on budget exceeded
ForecastingML-based spend projections

6. Incidents

Safety and operational incident management.

FeatureDescription
Active IncidentsCurrent issues requiring attention
Incident HistoryPast incidents and resolutions
Safety ViolationsAI content/behavior issues
Escalation QueueItems needing manager review

7. Tenant Operations

Multi-tenant management and support.

FeatureDescription
Tenant ListAll tenants with health status
Tenant DetailsConfiguration, usage, billing
Support ActionsImpersonation, config changes
OnboardingNew tenant setup wizard

Human-in-the-Loop (HITL)

Approval Queue

AI agents can request human approval for high-stakes actions:

AgentApproval Triggers
MaximusTransactions >$500, refunds >$100
Support AgentAccount changes, data deletion
Scheduling AgentOvertime approval, shift cancellations
Marketing AgentCampaign budget >$1000, content publish
Voice AgentComplex complaints, legal requests

Approval Workflow

Approval Actions

ActionDescription
ApproveExecute the action as requested
Approve with NoteExecute with comment for audit
Modify & ApproveChange parameters before execution
RejectDeny the action with reason
EscalateSend to senior operator

ACP Agent Registry

Manage AI agent definitions, tool permissions, and safety constraints.

Agent Registry

┌─────────────────────────────────────────────────────────────────┐
│ ACP AGENT REGISTRY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ REGISTERED AGENTS │
│ ─────────────────────────────────────────────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 🤖 MAXIMUS (Customer AI) Status: Active │
│ │ ID: agent-maximus-v3 │ │
│ │ Tools: 12 enabled HITL Triggers: 4 │ │
│ │ Safety: Level 2 Max Tier: T5 │ │
│ │ [Configure] [View Logs] [Disable] │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 🎧 SUPPORT AGENT Status: Active │
│ │ ID: agent-support-v2 │ │
│ │ Tools: 8 enabled HITL Triggers: 6 │ │
│ │ Safety: Level 3 Max Tier: T4 │ │
│ │ [Configure] [View Logs] [Disable] │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Tool Permissions

Define which tools each agent can use:

Tool CategoryExample ToolsRisk Level
Read-OnlyQuery database, view ordersLow
WriteUpdate records, create ordersMedium
FinancialProcess refunds, adjust pricingHigh
AdministrativeModify settings, manage usersCritical
ExternalSend emails, make API callsMedium

Permission Matrix:

ToolMaximusSupportSchedulingMarketing
Query orders
Update orders--
Process refundsHITLHITL--
View schedule
Modify schedule--HITL-
Send marketing---HITL
Access PIILimitedLimited-

Agent Configuration

SettingDescriptionDefault
Max Model TierHighest tier agent can useT4
Safety LevelContent filtering strictnessLevel 2
HITL ThresholdWhen to require human approvalConfigurable
Rate LimitMax requests per minute1000
Context WindowMax conversation length100K tokens
TimeoutRequest timeout30s

AI Safety Controls

Monitoring

ControlDescription
Content FilterDetect harmful/inappropriate content
Jailbreak DetectionIdentify prompt injection attempts
PII ScannerFlag personal data exposure
Bias MonitorTrack response fairness metrics
Hallucination DetectionIdentify factually incorrect responses
Tone AnalysisMonitor response appropriateness

Safety Dashboard

┌─────────────────────────────────────────────────────────────────┐
│ AI SAFETY DASHBOARD Last 24hrs │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SAFETY METRICS │
│ ─────────────────────────────────────────────────────────── │
│ Content Filter Triggers: 42 (0.02% of requests) │
│ Jailbreak Attempts: 8 (blocked) │
│ PII Detections: 156 (masked) │
│ Bias Alerts: 3 (under review) │
│ Hallucination Flags: 28 (reviewed) │
│ │
│ RECENT INCIDENTS │
│ ─────────────────────────────────────────────────────────── │
│ 🔴 14:32 - Jailbreak attempt detected (blocked) │
│ 🟡 12:45 - Potential bias in response (under review) │
│ 🟢 10:15 - PII masked in Support response (auto-handled) │
│ │
│ SAFETY SCORE: 98.5% │
│ ████████████████████████████████████████████████░░ │
│ │
└─────────────────────────────────────────────────────────────────┘

Response Actions

SeverityResponse TimeAction
CriticalImmediateAuto-block, page on-call
High15 minutesAlert operators, queue for review
Medium1 hourFlag for batch review
Low24 hoursLog for analysis

Kill Switch

danger

Activating the kill switch immediately disables the agent across all tenants and automatically creates an incident. This is a last-resort action for safety-critical situations. Use "Safe Mode" or "Tenant-Specific" disable for less severe issues.

Emergency agent disable:

  1. Go to Cockpit > Agents > Select Agent
  2. Click Kill Switch (red button)
  3. Confirm action
  4. Agent immediately disabled across all tenants
  5. Fallback behavior activated
  6. Incident automatically created

Kill Switch Options:

OptionEffect
Full DisableAgent completely offline
Safe ModeAgent responds with limited capability
FallbackRedirect to backup agent
Tenant-SpecificDisable for specific tenants only

Keyboard Shortcuts

ShortcutAction
G DGo to Dashboard
G AGo to Approvals
G IGo to Incidents
AApprove selected item
RReject selected item
EEscalate selected item
/Search
?Show all shortcuts

API Access

Cockpit data is available via the internal API:

# Get pending approvals
GET /api/internal/cockpit/approvals?status=pending

# Approve an item
POST /api/internal/cockpit/approvals/{id}/approve
{
"note": "Approved per policy",
"operator_id": "uuid"
}

# Get AI metrics
GET /api/internal/cockpit/metrics/ai?period=24h

Best Practices

tip

Aim for less than 15 minutes response time on HITL approval requests. Stale approvals block AI agent workflows and degrade the user experience for tenants waiting on agent actions.

  1. Check approvals frequently - Aim for under 15 min response time
  2. Document rejections - Always provide clear reasons
  3. Monitor AI costs daily - Catch anomalies early
  4. Review incidents promptly - Prioritize safety issues
  5. Use keyboard shortcuts - Improve efficiency

Troubleshooting

Approval Queue Stuck

Symptom: Items not appearing in queue

Solution:

  1. Check agent health in AI Metrics
  2. Verify WebSocket connection
  3. Refresh the page
  4. Check for filtering applied

Cost Data Delayed

Symptom: Cost metrics showing old data

Solution:

  1. Costs update every 5 minutes
  2. Check ClickHouse pipeline status
  3. Verify tenant attribution