Operations Team Handbook
Comprehensive guide for NebusAI Operations team members.
Team Mission
The Operations team ensures Olympus Cloud platform reliability, security, and performance. We maintain 99.99% uptime, sub-second response times, and enable engineering teams to deploy with confidence.
Core Responsibilities
| Area | Responsibility |
|---|---|
| On-Call | 24/7 incident response and resolution |
| Infrastructure | GCP, Cloudflare, and edge server management |
| Deployment | CI/CD pipeline and release management |
| Monitoring | Observability, alerting, and AIOps |
| Security | Security operations and compliance |
| SRE | Reliability engineering and capacity planning |
Team Structure
Roles
| Role | Focus | On-Call |
|---|---|---|
| VP Operations | Strategy, escalation | P1 only |
| Operations Manager | Team lead, scheduling | Weekly backup |
| Senior SRE | Complex incidents, architecture | Primary rotation |
| SRE | Incident response, automation | Primary rotation |
| DevOps Engineer | CI/CD, tooling | Secondary rotation |
| NOC Analyst | L1 monitoring, triage | 24/7 coverage |
Team Distribution
Operations Team (12 members)
├── US West (SF) - 4
│ └── Primary: Mon-Fri 6AM-6PM PT
├── US East (NYC) - 3
│ └── Primary: Mon-Fri 9AM-9PM ET
├── EU (London) - 3
│ └── Primary: Mon-Fri 9AM-9PM GMT
└── APAC (Singapore) - 2
└── Primary: Mon-Fri 9AM-9PM SGT
On-Call Operations
Rotation Schedule
| Schedule | Duration | Coverage |
|---|---|---|
| Primary | 1 week | All incidents |
| Secondary | 1 week | Escalation backup |
| NOC | 8-hour shifts | L1 triage, monitoring |
On-Call Expectations
When on-call:
-
Response Time
- P1: Acknowledge within 5 minutes
- P2: Acknowledge within 15 minutes
- P3: Acknowledge within 1 hour
-
Availability
- Phone accessible 24/7
- Laptop within 15 minutes
- VPN/network access confirmed
- Runbook access verified
-
Handoff Requirements
- Document all open issues
- Update incident notes
- Briefing call with next on-call
- Clear escalation status
Override Requests
| Request | Approval | Notice |
|---|---|---|
| PTO coverage | Self-serve | 72 hours |
| Shift swap | Peer + Manager | 48 hours |
| Emergency | Manager | Immediate |
| Holiday | Auto-scheduled | 30 days |
Incident Management
Incident Severity
| Severity | Definition | Response |
|---|---|---|
| P1-Critical | Platform outage, data at risk | All hands, bridge |
| P2-High | Major feature down | Primary + secondary |
| P3-Medium | Degraded performance | Primary |
| P4-Low | Minor issue | Next business day |
Incident Lifecycle
┌──────────────────────────────────────────────────────────────────┐
│ INCIDENT LIFECYCLE │
├──────────────────────────────────────────────────────────────────┤
│ │
│ 1. DETECTION │
│ ──────────────────────────────────────────────────────────── │
│ • Automated alert fires │
│ • AIOps Engine performs initial triage │
│ • On-call paged if AI cannot resolve │
│ │
│ 2. TRIAGE (First 5 minutes) │
│ ──────────────────────────────────────────────────────────── │
│ • Acknowledge alert │
│ • Assess impact and scope │
│ • Determine severity level │
│ • Start incident channel (if P1/P2) │
│ │
│ 3. INVESTIGATION │
│ ──────────────────────────────────────────────────────────── │
│ • Query logs and metrics │
│ • Check related alerts │
│ • Review recent changes │
│ • Consult runbooks │
│ │
│ 4. MITIGATION │
│ ──────────────────────────────────────────────────────────── │
│ • Execute runbook actions │
│ • Apply temporary fixes │
│ • Communicate status │
│ • Monitor for improvement │
│ │
│ 5. RESOLUTION │
│ ──────────────────────────────────────────────────────────── │
│ • Confirm issue resolved │
│ • Close incident │
│ • Schedule postmortem (P1/P2) │
│ • Update documentation │
│ │
│ 6. POSTMORTEM (Within 72 hours) │
│ ──────────────────────────────────────────────────────────── │
│ • Document timeline │
│ • Identify root cause │
│ • Create action items │
│ • Share learnings │
│ │
└──────────────────────────────────────────────────────────────────┘
P1 Incident Procedure
For P1 Critical incidents:
-
Immediate (0-5 min)
- Acknowledge alert
- Start Slack incident channel:
#incident-YYYY-MM-DD-{brief} - Page secondary and manager
- Post initial status in channel
-
Triage (5-15 min)
- Assign Incident Commander (IC)
- IC posts situation report
- Begin investigation
- Draft customer communication if needed
-
Bridge Call (if needed)
- Start Google Meet: "Incident Bridge"
- IC runs the call
- 15-minute status updates
- All actions logged in Slack
-
Resolution
- Confirm metrics return to normal
- IC declares incident resolved
- Post final update
- Schedule postmortem
Communication Templates
Initial Status (Internal):
INCIDENT: [Brief description]
SEVERITY: P1/P2
STATUS: Investigating/Identified/Mitigating
IMPACT: [User impact description]
IC: [Name]
NEXT UPDATE: [Time]
Customer Communication (via Status Page):
Title: [Service] - [Brief issue]
Status: Investigating
We are currently investigating [issue description].
Some users may experience [impact].
We will provide an update within [timeframe].
Infrastructure Management
Platform Overview
| Component | Provider | Region | Purpose |
|---|---|---|---|
| Cloud Run | GCP | us-central1 | API services |
| Spanner | GCP | multi-region | Database |
| Workers | Cloudflare | Global | Edge compute |
| R2 | Cloudflare | Global | Object storage |
| Edge Servers | On-premise | Per-location | OlympusEdge |
Access Management
| System | Access Method | Approval |
|---|---|---|
| GCP Console | SSO + MFA | Role-based |
| Cloudflare | SSO + MFA | Role-based |
| Production DB | Breakglass | Manager approval |
| Customer Data | Audit-logged | Per-incident |
Infrastructure Runbooks
| Runbook | When to Use |
|---|---|
| gcp-service-restart | Cloud Run service unresponsive |
| spanner-hotspot | Database hotspot detected |
| worker-redeploy | Edge worker issues |
| cache-flush | Cache corruption suspected |
| dns-failover | Regional DNS issues |
| edge-server-recovery | OlympusEdge server offline |
Monitoring & Observability
Dashboards
| Dashboard | URL | Purpose |
|---|---|---|
| Platform Overview | cockpit.olympuscloud.ai | Health summary |
| Service Health | /dashboards/services | Per-service metrics |
| Edge Status | /dashboards/edge | Edge server health |
| Database | /dashboards/spanner | Spanner metrics |
| Cost | /dashboards/costs | Cloud spending |
Key Metrics
| Metric | SLO | Alert Threshold |
|---|---|---|
| API Latency (p99) | Under 500ms | Over 800ms |
| Error Rate | Under 0.1% | Over 0.5% |
| Availability | 99.99% | Any outage |
| Edge Sync Lag | Under 30s | Over 2min |
| Database Latency | Under 50ms | Over 100ms |
AIOps Oversight
The AIOps Engine handles L1 incidents automatically. Your responsibilities:
| Responsibility | Action |
|---|---|
| Review AI decisions | Check daily AI resolution report |
| Tune thresholds | Adjust based on false positive rate |
| Update runbooks | AI uses runbooks for remediation |
| Approve high-risk | AI requests approval for risky actions |
| Train models | Provide feedback on AI decisions |
Deployment Operations
Release Process
┌──────────────────────────────────────────────────────────────────┐
│ RELEASE PIPELINE │
├──────────────────────────────────────────────────────────────────┤
│ │
│ 1. ENGINEERING │
│ └── PR merged to develop │
│ │
│ 2. STAGING DEPLOY (Automatic) │
│ ├── All tests pass │
│ ├── Deploy to staging.olympuscloud.ai │
│ └── Smoke tests run │
│ │
│ 3. OPS VALIDATION (Manual Gate) │
│ ├── Review deployment metrics │
│ ├── Check error rates │
│ └── Approve for production │
│ │
│ 4. PRODUCTION DEPLOY (Gradual) │
│ ├── 10% canary (5 min wait) │
│ ├── 25% rollout (5 min wait) │
│ ├── 50% rollout (5 min wait) │
│ └── 100% rollout │
│ │
│ 5. POST-DEPLOY │
│ ├── Monitor for 30 minutes │
│ ├── Auto-rollback if errors spike │
│ └── Close deployment ticket │
│ │
└──────────────────────────────────────────────────────────────────┘
Deployment Windows
| Window | Time (PT) | Use |
|---|---|---|
| Regular | Tue-Thu 10AM-2PM | Standard deploys |
| Emergency | Any time | P1 fixes |
| Off-peak | Tue-Thu 2AM-4AM | Database migrations |
| Frozen | Fri 2PM - Mon 10AM | No deploys |
Rollback Procedure
- Automatic: If error rate >1% in first 5 minutes
- Manual: Run
./scripts/rollback.sh <version> - Database: Use
./scripts/db-rollback.sh(requires approval)
Edge Server Operations
OlympusEdge Fleet
| Region | Servers | Status Dashboard |
|---|---|---|
| US West | 450 | /edge/us-west |
| US East | 380 | /edge/us-east |
| EU | 120 | /edge/eu |
| APAC | 80 | /edge/apac |
Edge Health Checks
| Check | Frequency | Alert |
|---|---|---|
| Heartbeat | 30s | 3 missed = offline |
| Sync Status | 1min | over 5min lag = warning |
| Disk Space | 5min | over 90% = critical |
| Memory | 1min | over 95% = warning |
| Temperature | 5min | over 80C = critical |
Common Edge Issues
| Issue | Runbook | Escalation |
|---|---|---|
| Offline | edge-offline-recovery | Location contact |
| Sync Failed | edge-sync-repair | SRE |
| High Load | edge-load-balance | None |
| Network Issues | edge-network-diag | Location IT |
Security Operations
Security Responsibilities
| Area | Ops Team Role |
|---|---|
| Access Reviews | Monthly review of all access |
| Secret Rotation | Quarterly secret rotation |
| Security Alerts | Triage and respond to SIEM alerts |
| Compliance | Support SOC 2 audits |
| Pen Testing | Coordinate annual pen tests |
Security Incident Response
For security incidents:
- Do NOT discuss in public channels
- Page Security On-Call immediately
- Use encrypted channel:
#sec-incident-{date} - Preserve evidence (no cleanup without approval)
- Follow Security Incident Runbook
Access Request Process
| Access Type | Approver | Duration |
|---|---|---|
| Read-only production | Manager | Permanent |
| Write production | Manager + Security | 24 hours |
| Customer data | VP Ops + Legal | Per-incident |
| Database admin | CTO | 4 hours max |
Capacity Planning
Capacity Reviews
| Review | Frequency | Attendees |
|---|---|---|
| Weekly Capacity | Every Monday | Ops team |
| Monthly Planning | First week | Ops + Eng leads |
| Quarterly Forecast | Start of quarter | Ops + Finance |
Scaling Triggers
| Metric | Threshold | Action |
|---|---|---|
| CPU >70% | Sustained 10min | Auto-scale up |
| Memory >80% | Sustained 5min | Alert + scale |
| Disk >80% | Any | Alert + expand |
| Connections >80% | Sustained 5min | Scale + alert |
Cost Management
| Budget Category | Monthly Budget | Owner |
|---|---|---|
| GCP Compute | $45,000 | Ops Manager |
| GCP Database | $25,000 | Ops Manager |
| Cloudflare | $15,000 | Ops Manager |
| Monitoring | $5,000 | SRE Lead |
Tools & Access
Required Tools
| Tool | Purpose | Setup |
|---|---|---|
| Cockpit | Primary ops console | SSO |
| GCP Console | Infrastructure | SSO + MFA |
| Cloudflare Dashboard | Edge & DNS | SSO + MFA |
| PagerDuty | Legacy (migrating) | SSO |
| Slack | Communication | SSO |
| 1Password | Secrets | Team vault |
CLI Tools
# Required CLI tools
gcloud # GCP CLI
wrangler # Cloudflare Workers
kubectl # Kubernetes (edge clusters)
terraform # Infrastructure as code
olympus-cli # Internal ops CLI
# Setup
./scripts/ops-setup.sh
Useful Commands
# Check service health
olympus-cli health all
# Get on-call info
olympus-cli oncall who platform
# View active incidents
olympus-cli incidents active
# Deploy status
olympus-cli deploy status
# Edge server status
olympus-cli edge status --region us-west
# Database metrics
olympus-cli db metrics orders-db
Performance Expectations
SLOs
| Metric | Target | Measurement |
|---|---|---|
| Uptime | 99.99% | Monthly |
| MTTA | Under 5 min | P1 incidents |
| MTTR | Under 30 min | P1 incidents |
| Change Success Rate | above 99% | Per quarter |
| Alert Accuracy | above 95% | True positives |
Individual Metrics
| Metric | Expectation |
|---|---|
| Response Time | Under 5 min for P1/P2 |
| Incident Documentation | 100% complete |
| Runbook Updates | Within 48h of incident |
| On-Call Handoffs | Zero dropped issues |
| Postmortem Participation | All assigned incidents |
Career Development
Skills Matrix
| Level | Technical | Leadership |
|---|---|---|
| NOC Analyst | L1 triage, monitoring | None |
| DevOps Engineer | CI/CD, automation | Project lead |
| SRE | Incident response, architecture | Team mentor |
| Senior SRE | Complex systems, design | Tech lead |
| Ops Manager | Strategy, planning | Team management |
Training Requirements
| Training | Frequency | Provider |
|---|---|---|
| GCP Professional | Annual cert | |
| Incident Commander | Quarterly drill | Internal |
| Security Awareness | Annual | Security team |
| On-Call Training | Before first shift | Buddy system |
Runbook Index
Most Used Runbooks
| Runbook | Category | Link |
|---|---|---|
| Incident Response | Process | /runbooks/incident-response |
| Service Restart | GCP | /runbooks/gcp-service-restart |
| Database Failover | Database | /runbooks/db-failover |
| Edge Recovery | Edge | /runbooks/edge-recovery |
| SSL Certificate | Security | /runbooks/ssl-renewal |
| Capacity Scale | Scaling | /runbooks/capacity-scale |
Creating Runbooks
Every runbook must include:
- Title and ID
- When to use
- Prerequisites
- Step-by-step procedure
- Verification steps
- Rollback procedure
- Escalation path
Contacts & Escalation
Internal Contacts
| Role | Primary | Backup |
|---|---|---|
| VP Operations | Alex Thompson | CTO |
| Ops Manager | Jordan Lee | VP Ops |
| Security Lead | Sam Martinez | Ops Manager |
| Database Expert | Chris Wong | Senior SRE |
External Contacts
| Service | Support | Account Manager |
|---|---|---|
| GCP | Priority support | gcp-am@nebusai.com |
| Cloudflare | Enterprise support | cf-am@nebusai.com |
| Twilio | Support ticket | twilio-am@nebusai.com |
Related Documentation
- Incident Response Runbook - IR procedures
- Deployment Guide - Deploy process
- Edge Infrastructure - Edge servers