Cockpit Operations Center
The Cockpit is NebusAI's internal operations center for monitoring AI agents, approving human-in-the-loop (HITL) actions, managing ACP infrastructure, and ensuring AI safety.
Overview
Cockpit provides a unified interface for platform operators to:
- Monitor AI agent health and performance
- Approve or reject AI-initiated actions requiring human oversight
- Manage tenant operations and configurations
- Track AI costs and model usage
- Respond to incidents and safety alerts
Access
Cockpit is available at https://cockpit.olympuscloud.ai and requires NebusAI employee credentials with the cockpit_operator role.
Dashboard Tabs
1. Dashboard
The main overview showing system health at a glance.
| Widget | Description |
|---|---|
| System Health | Service status across all regions |
| Active Agents | AI agents currently running |
| Pending Approvals | HITL items awaiting action |
| Recent Incidents | Last 24h safety/performance issues |
| Cost Tracker | Real-time AI spend vs budget |
2. Gating Engine
Feature flag and policy management for the platform.
| Capability | Description |
|---|---|
| Feature Flags | Enable/disable features by tenant |
| Canary Deployments | Gradual rollout controls |
| A/B Tests | Experiment configuration |
| Kill Switches | Emergency feature disable |
3. AI Agent Monitoring
Real-time visibility into all AI agents across the platform.
┌─────────────────────────────────────────────────────────────────┐
│ AI AGENT MONITORING Live Status │
├─────────────────────────────────────────────────────────────────┤
│ │
│ AGENT STATUS │
│ ─────────────────────────────────────────────────────────── │
│ 🟢 Maximus (Customer AI) Healthy 2,847 req/min │
│ 🟢 Support Agent Healthy 428 req/min │
│ 🟢 Scheduling Agent Healthy 156 req/min │
│ 🟡 Marketing Agent Degraded 42 req/min ⚠ │
│ 🟢 Voice Agent Healthy 89 req/min │
│ │
│ PERFORMANCE (Last Hour) │
│ ─────────────────────────────────────────────────────────── │
│ Total Requests: 218,420 Avg Latency: 124ms │
│ Success Rate: 99.8% Cache Hit: 78.2% │
│ Tokens Used: 12.4M Est. Cost: $142.80 │
│ │
└─────────────────────────────────────────────────────────────────┘
Agent Details View:
| Metric | Description |
|---|---|
| Health Status | Green/Yellow/Red based on error rate and latency |
| Request Rate | Requests per minute |
| Latency | P50, P95, P99 response times |
| Error Rate | Percentage of failed requests |
| Token Usage | Input/output tokens consumed |
| Model Distribution | Which tiers being used |
| Active Sessions | Concurrent user sessions |
Agent Configuration:
| Setting | Purpose |
|---|---|
| Max Concurrency | Limit simultaneous requests |
| Timeout | Request timeout threshold |
| Fallback Tier | Model to use on failure |
| Rate Limits | Per-tenant request limits |
| Kill Switch | Emergency disable |
4. AI Metrics
Comprehensive AI performance monitoring.
| Metric Category | Details |
|---|---|
| Request Volume | Queries per second by tier |
| Latency | P50, P95, P99 by model |
| Cost | Real-time and projected spend |
| Cache Performance | Hit rate, savings |
| Error Rate | Failures by model/tier |
Model Tier Performance:
| Tier | Model | Avg Latency | Error Rate | Cache Hit |
|---|---|---|---|---|
| T1 | Llama 4 Scout | 45ms | 0.1% | 85% |
| T2 | Gemini 2.0 Flash | 180ms | 0.2% | 72% |
| T3 | Gemini 3 Flash | 320ms | 0.3% | 68% |
| T4 | Claude Haiku 4.5 | 450ms | 0.2% | 65% |
| T5 | Claude Sonnet 4.5 | 1,200ms | 0.4% | 52% |
| T6 | Claude Opus 4.5 | 2,400ms | 0.5% | 45% |
5. AI Cost Analytics
Financial tracking for AI operations with 95%+ cost optimization.
┌─────────────────────────────────────────────────────────────────┐
│ AI COST ANALYTICS This Month │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SPEND OVERVIEW │
│ ─────────────────────────────────────────────────────────── │
│ Budget: $25,000 │
│ Actual: $18,420 73.7% of budget │
│ Projected: $22,850 91.4% of budget │
│ Savings: $142,580 (vs no optimization) │
│ │
│ ████████████████████████████████░░░░░░░░░░░░░░░░░░ │
│ $0 $12.5K $18.4K $25K Budget │
│ │
│ COST BY MODEL TIER │
│ ─────────────────────────────────────────────────────────── │
│ T1 Llama Scout $0 0% (62% of requests) │
│ T2 Gemini Flash $1,842 10% (24% of requests) │
│ T3 Gemini 3 $2,763 15% (8% of requests) │
│ T4 Claude Haiku $4,605 25% (4% of requests) │
│ T5 Claude Sonnet $7,368 40% (1.8% of requests) │
│ T6 Claude Opus $1,842 10% (0.2% of requests) │
│ │
│ COST SAVINGS │
│ ─────────────────────────────────────────────────────────── │
│ Smart Routing: -$98,420 (T1 handles 62% of requests) │
│ Caching: -$32,840 (78% cache hit rate) │
│ Batching: -$11,320 (request batching) │
│ ──────────────────────────────── │
│ Total Savings: -$142,580 (88.5% cost reduction) │
│ │
└─────────────────────────────────────────────────────────────────┘
Cost Attribution:
| Dimension | View |
|---|---|
| By Tenant | Top spending tenants |
| By Agent | Cost per AI agent |
| By Model | Spend per model tier |
| By Feature | Cost by feature area |
| By Time | Hourly/daily/weekly trends |
Budget Management:
| Feature | Description |
|---|---|
| Budget Allocation | Set monthly/quarterly budgets |
| Threshold Alerts | Notify at 50%, 75%, 90% |
| Tenant Limits | Per-tenant cost caps |
| Auto-Throttling | Reduce tier on budget exceeded |
| Forecasting | ML-based spend projections |
6. Incidents
Safety and operational incident management.
| Feature | Description |
|---|---|
| Active Incidents | Current issues requiring attention |
| Incident History | Past incidents and resolutions |
| Safety Violations | AI content/behavior issues |
| Escalation Queue | Items needing manager review |
7. Tenant Operations
Multi-tenant management and support.
| Feature | Description |
|---|---|
| Tenant List | All tenants with health status |
| Tenant Details | Configuration, usage, billing |
| Support Actions | Impersonation, config changes |
| Onboarding | New tenant setup wizard |
Human-in-the-Loop (HITL)
Approval Queue
AI agents can request human approval for high-stakes actions:
| Agent | Approval Triggers |
|---|---|
| Maximus | Transactions >$500, refunds >$100 |
| Support Agent | Account changes, data deletion |
| Scheduling Agent | Overtime approval, shift cancellations |
| Marketing Agent | Campaign budget >$1000, content publish |
| Voice Agent | Complex complaints, legal requests |
Approval Workflow
Approval Actions
| Action | Description |
|---|---|
| Approve | Execute the action as requested |
| Approve with Note | Execute with comment for audit |
| Modify & Approve | Change parameters before execution |
| Reject | Deny the action with reason |
| Escalate | Send to senior operator |
ACP Agent Registry
Manage AI agent definitions, tool permissions, and safety constraints.
Agent Registry
┌─────────────────────────────────────────────────────────────────┐
│ ACP AGENT REGISTRY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ REGISTERED AGENTS │
│ ─────────────────────────────────────────────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 🤖 MAXIMUS (Customer AI) Status: Active │
│ │ ID: agent-maximus-v3 │ │
│ │ Tools: 12 enabled HITL Triggers: 4 │ │
│ │ Safety: Level 2 Max Tier: T5 │ │
│ │ [Configure] [View Logs] [Disable] │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 🎧 SUPPORT AGENT Status: Active │
│ │ ID: agent-support-v2 │ │
│ │ Tools: 8 enabled HITL Triggers: 6 │ │
│ │ Safety: Level 3 Max Tier: T4 │ │
│ │ [Configure] [View Logs] [Disable] │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Tool Permissions
Define which tools each agent can use:
| Tool Category | Example Tools | Risk Level |
|---|---|---|
| Read-Only | Query database, view orders | Low |
| Write | Update records, create orders | Medium |
| Financial | Process refunds, adjust pricing | High |
| Administrative | Modify settings, manage users | Critical |
| External | Send emails, make API calls | Medium |
Permission Matrix:
| Tool | Maximus | Support | Scheduling | Marketing |
|---|---|---|---|---|
| Query orders | ✓ | ✓ | ✓ | ✓ |
| Update orders | ✓ | ✓ | - | - |
| Process refunds | HITL | HITL | - | - |
| View schedule | ✓ | ✓ | ✓ | ✓ |
| Modify schedule | - | - | HITL | - |
| Send marketing | - | - | - | HITL |
| Access PII | Limited | ✓ | Limited | - |
Agent Configuration
| Setting | Description | Default |
|---|---|---|
| Max Model Tier | Highest tier agent can use | T4 |
| Safety Level | Content filtering strictness | Level 2 |
| HITL Threshold | When to require human approval | Configurable |
| Rate Limit | Max requests per minute | 1000 |
| Context Window | Max conversation length | 100K tokens |
| Timeout | Request timeout | 30s |
AI Safety Controls
Monitoring
| Control | Description |
|---|---|
| Content Filter | Detect harmful/inappropriate content |
| Jailbreak Detection | Identify prompt injection attempts |
| PII Scanner | Flag personal data exposure |
| Bias Monitor | Track response fairness metrics |
| Hallucination Detection | Identify factually incorrect responses |
| Tone Analysis | Monitor response appropriateness |
Safety Dashboard
┌─────────────────────────────────────────────────────────────────┐
│ AI SAFETY DASHBOARD Last 24hrs │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SAFETY METRICS │
│ ─────────────────────────────────────────────────────────── │
│ Content Filter Triggers: 42 (0.02% of requests) │
│ Jailbreak Attempts: 8 (blocked) │
│ PII Detections: 156 (masked) │
│ Bias Alerts: 3 (under review) │
│ Hallucination Flags: 28 (reviewed) │
│ │
│ RECENT INCIDENTS │
│ ─────────────────────────────────────────────────────────── │
│ 🔴 14:32 - Jailbreak attempt detected (blocked) │
│ 🟡 12:45 - Potential bias in response (under review) │
│ 🟢 10:15 - PII masked in Support response (auto-handled) │
│ │
│ SAFETY SCORE: 98.5% │
│ ████████████████████████████████████████████████░░ │
│ │
└─────────────────────────────────────────────────────────────────┘
Response Actions
| Severity | Response Time | Action |
|---|---|---|
| Critical | Immediate | Auto-block, page on-call |
| High | 15 minutes | Alert operators, queue for review |
| Medium | 1 hour | Flag for batch review |
| Low | 24 hours | Log for analysis |
Kill Switch
Activating the kill switch immediately disables the agent across all tenants and automatically creates an incident. This is a last-resort action for safety-critical situations. Use "Safe Mode" or "Tenant-Specific" disable for less severe issues.
Emergency agent disable:
- Go to Cockpit > Agents > Select Agent
- Click Kill Switch (red button)
- Confirm action
- Agent immediately disabled across all tenants
- Fallback behavior activated
- Incident automatically created
Kill Switch Options:
| Option | Effect |
|---|---|
| Full Disable | Agent completely offline |
| Safe Mode | Agent responds with limited capability |
| Fallback | Redirect to backup agent |
| Tenant-Specific | Disable for specific tenants only |
Keyboard Shortcuts
| Shortcut | Action |
|---|---|
G D | Go to Dashboard |
G A | Go to Approvals |
G I | Go to Incidents |
A | Approve selected item |
R | Reject selected item |
E | Escalate selected item |
/ | Search |
? | Show all shortcuts |
API Access
Cockpit data is available via the internal API:
# Get pending approvals
GET /api/internal/cockpit/approvals?status=pending
# Approve an item
POST /api/internal/cockpit/approvals/{id}/approve
{
"note": "Approved per policy",
"operator_id": "uuid"
}
# Get AI metrics
GET /api/internal/cockpit/metrics/ai?period=24h
Best Practices
Aim for less than 15 minutes response time on HITL approval requests. Stale approvals block AI agent workflows and degrade the user experience for tenants waiting on agent actions.
- Check approvals frequently - Aim for under 15 min response time
- Document rejections - Always provide clear reasons
- Monitor AI costs daily - Catch anomalies early
- Review incidents promptly - Prioritize safety issues
- Use keyboard shortcuts - Improve efficiency
Troubleshooting
Approval Queue Stuck
Symptom: Items not appearing in queue
Solution:
- Check agent health in AI Metrics
- Verify WebSocket connection
- Refresh the page
- Check for filtering applied
Cost Data Delayed
Symptom: Cost metrics showing old data
Solution:
- Costs update every 5 minutes
- Check ClickHouse pipeline status
- Verify tenant attribution