Operations Team Handbook

Comprehensive guide for NebusAI Operations team members.

Team Mission

The Operations team ensures Olympus Cloud platform reliability, security, and performance. We maintain 99.99% uptime, sub-second response times, and enable engineering teams to deploy with confidence.

Core Responsibilities

Area	Responsibility
On-Call	24/7 incident response and resolution
Infrastructure	GCP, Cloudflare, and edge server management
Deployment	CI/CD pipeline and release management
Monitoring	Observability, alerting, and AIOps
Security	Security operations and compliance
SRE	Reliability engineering and capacity planning

Team Structure

Roles

Role	Focus	On-Call
VP Operations	Strategy, escalation	P1 only
Operations Manager	Team lead, scheduling	Weekly backup
Senior SRE	Complex incidents, architecture	Primary rotation
SRE	Incident response, automation	Primary rotation
DevOps Engineer	CI/CD, tooling	Secondary rotation
NOC Analyst	L1 monitoring, triage	24/7 coverage

Team Distribution

Operations Team (12 members)
├── US West (SF) - 4
│   └── Primary: Mon-Fri 6AM-6PM PT
├── US East (NYC) - 3
│   └── Primary: Mon-Fri 9AM-9PM ET
├── EU (London) - 3
│   └── Primary: Mon-Fri 9AM-9PM GMT
└── APAC (Singapore) - 2
    └── Primary: Mon-Fri 9AM-9PM SGT

On-Call Operations

Rotation Schedule

Schedule	Duration	Coverage
Primary	1 week	All incidents
Secondary	1 week	Escalation backup
NOC	8-hour shifts	L1 triage, monitoring

On-Call Expectations

When on-call:

Response Time
- P1: Acknowledge within 5 minutes
- P2: Acknowledge within 15 minutes
- P3: Acknowledge within 1 hour
Availability
- Phone accessible 24/7
- Laptop within 15 minutes
- VPN/network access confirmed
- Runbook access verified
Handoff Requirements
- Document all open issues
- Update incident notes
- Briefing call with next on-call
- Clear escalation status

Override Requests

Request	Approval	Notice
PTO coverage	Self-serve	72 hours
Shift swap	Peer + Manager	48 hours
Emergency	Manager	Immediate
Holiday	Auto-scheduled	30 days

Incident Management

Incident Severity

Severity	Definition	Response
P1-Critical	Platform outage, data at risk	All hands, bridge
P2-High	Major feature down	Primary + secondary
P3-Medium	Degraded performance	Primary
P4-Low	Minor issue	Next business day

Incident Lifecycle

┌──────────────────────────────────────────────────────────────────┐
│                     INCIDENT LIFECYCLE                            │
├──────────────────────────────────────────────────────────────────┤
│                                                                    │
│  1. DETECTION                                                     │
│  ────────────────────────────────────────────────────────────     │
│  • Automated alert fires                                          │
│  • AIOps Engine performs initial triage                           │
│  • On-call paged if AI cannot resolve                             │
│                                                                    │
│  2. TRIAGE (First 5 minutes)                                      │
│  ────────────────────────────────────────────────────────────     │
│  • Acknowledge alert                                              │
│  • Assess impact and scope                                        │
│  • Determine severity level                                       │
│  • Start incident channel (if P1/P2)                              │
│                                                                    │
│  3. INVESTIGATION                                                 │
│  ────────────────────────────────────────────────────────────     │
│  • Query logs and metrics                                         │
│  • Check related alerts                                           │
│  • Review recent changes                                          │
│  • Consult runbooks                                               │
│                                                                    │
│  4. MITIGATION                                                    │
│  ────────────────────────────────────────────────────────────     │
│  • Execute runbook actions                                        │
│  • Apply temporary fixes                                          │
│  • Communicate status                                             │
│  • Monitor for improvement                                        │
│                                                                    │
│  5. RESOLUTION                                                    │
│  ────────────────────────────────────────────────────────────     │
│  • Confirm issue resolved                                         │
│  • Close incident                                                 │
│  • Schedule postmortem (P1/P2)                                    │
│  • Update documentation                                           │
│                                                                    │
│  6. POSTMORTEM (Within 72 hours)                                  │
│  ────────────────────────────────────────────────────────────     │
│  • Document timeline                                              │
│  • Identify root cause                                            │
│  • Create action items                                            │
│  • Share learnings                                                │
│                                                                    │
└──────────────────────────────────────────────────────────────────┘

P1 Incident Procedure

For P1 Critical incidents:

Immediate (0-5 min)
- Acknowledge alert
- Start Slack incident channel: #incident-YYYY-MM-DD-{brief}
- Page secondary and manager
- Post initial status in channel
Triage (5-15 min)
- Assign Incident Commander (IC)
- IC posts situation report
- Begin investigation
- Draft customer communication if needed
Bridge Call (if needed)
- Start Google Meet: "Incident Bridge"
- IC runs the call
- 15-minute status updates
- All actions logged in Slack
Resolution
- Confirm metrics return to normal
- IC declares incident resolved
- Post final update
- Schedule postmortem

Communication Templates

Initial Status (Internal):

INCIDENT: [Brief description]
SEVERITY: P1/P2
STATUS: Investigating/Identified/Mitigating
IMPACT: [User impact description]
IC: [Name]
NEXT UPDATE: [Time]

Customer Communication (via Status Page):

Title: [Service] - [Brief issue]
Status: Investigating

We are currently investigating [issue description].
Some users may experience [impact].
We will provide an update within [timeframe].

Infrastructure Management

Platform Overview

Component	Provider	Region	Purpose
Cloud Run	GCP	us-central1	API services
Spanner	GCP	multi-region	Database
Workers	Cloudflare	Global	Edge compute
R2	Cloudflare	Global	Object storage
Edge Servers	On-premise	Per-location	OlympusEdge

Access Management

System	Access Method	Approval
GCP Console	SSO + MFA	Role-based
Cloudflare	SSO + MFA	Role-based
Production DB	Breakglass	Manager approval
Customer Data	Audit-logged	Per-incident

Infrastructure Runbooks

Runbook	When to Use
gcp-service-restart	Cloud Run service unresponsive
spanner-hotspot	Database hotspot detected
worker-redeploy	Edge worker issues
cache-flush	Cache corruption suspected
dns-failover	Regional DNS issues
edge-server-recovery	OlympusEdge server offline

Monitoring & Observability

Dashboards

Dashboard	URL	Purpose
Platform Overview	cockpit.olympuscloud.ai	Health summary
Service Health	/dashboards/services	Per-service metrics
Edge Status	/dashboards/edge	Edge server health
Database	/dashboards/spanner	Spanner metrics
Cost	/dashboards/costs	Cloud spending

Key Metrics

Metric	SLO	Alert Threshold
API Latency (p99)	Under 500ms	Over 800ms
Error Rate	Under 0.1%	Over 0.5%
Availability	99.99%	Any outage
Edge Sync Lag	Under 30s	Over 2min
Database Latency	Under 50ms	Over 100ms

AIOps Oversight

The AIOps Engine handles L1 incidents automatically. Your responsibilities:

Responsibility	Action
Review AI decisions	Check daily AI resolution report
Tune thresholds	Adjust based on false positive rate
Update runbooks	AI uses runbooks for remediation
Approve high-risk	AI requests approval for risky actions
Train models	Provide feedback on AI decisions

Deployment Operations

Release Process

┌──────────────────────────────────────────────────────────────────┐
│                     RELEASE PIPELINE                              │
├──────────────────────────────────────────────────────────────────┤
│                                                                    │
│  1. ENGINEERING                                                   │
│     └── PR merged to develop                                      │
│                                                                    │
│  2. STAGING DEPLOY (Automatic)                                    │
│     ├── All tests pass                                            │
│     ├── Deploy to staging.olympuscloud.ai                         │
│     └── Smoke tests run                                           │
│                                                                    │
│  3. OPS VALIDATION (Manual Gate)                                  │
│     ├── Review deployment metrics                                 │
│     ├── Check error rates                                         │
│     └── Approve for production                                    │
│                                                                    │
│  4. PRODUCTION DEPLOY (Gradual)                                   │
│     ├── 10% canary (5 min wait)                                   │
│     ├── 25% rollout (5 min wait)                                  │
│     ├── 50% rollout (5 min wait)                                  │
│     └── 100% rollout                                              │
│                                                                    │
│  5. POST-DEPLOY                                                   │
│     ├── Monitor for 30 minutes                                    │
│     ├── Auto-rollback if errors spike                             │
│     └── Close deployment ticket                                   │
│                                                                    │
└──────────────────────────────────────────────────────────────────┘

Deployment Windows

Window	Time (PT)	Use
Regular	Tue-Thu 10AM-2PM	Standard deploys
Emergency	Any time	P1 fixes
Off-peak	Tue-Thu 2AM-4AM	Database migrations
Frozen	Fri 2PM - Mon 10AM	No deploys

Rollback Procedure

Automatic: If error rate >1% in first 5 minutes
Manual: Run ./scripts/rollback.sh <version>
Database: Use ./scripts/db-rollback.sh (requires approval)

Edge Server Operations

OlympusEdge Fleet

Region	Servers	Status Dashboard
US West	450	/edge/us-west
US East	380	/edge/us-east
EU	120	/edge/eu
APAC	80	/edge/apac

Edge Health Checks

Check	Frequency	Alert
Heartbeat	30s	3 missed = offline
Sync Status	1min	over 5min lag = warning
Disk Space	5min	over 90% = critical
Memory	1min	over 95% = warning
Temperature	5min	over 80C = critical

Common Edge Issues

Issue	Runbook	Escalation
Offline	edge-offline-recovery	Location contact
Sync Failed	edge-sync-repair	SRE
High Load	edge-load-balance	None
Network Issues	edge-network-diag	Location IT

Security Operations

Security Responsibilities

Area	Ops Team Role
Access Reviews	Monthly review of all access
Secret Rotation	Quarterly secret rotation
Security Alerts	Triage and respond to SIEM alerts
Compliance	Support SOC 2 audits
Pen Testing	Coordinate annual pen tests

Security Incident Response

For security incidents:

Do NOT discuss in public channels
Page Security On-Call immediately
Use encrypted channel: #sec-incident-{date}
Preserve evidence (no cleanup without approval)
Follow Security Incident Runbook

Access Request Process

Access Type	Approver	Duration
Read-only production	Manager	Permanent
Write production	Manager + Security	24 hours
Customer data	VP Ops + Legal	Per-incident
Database admin	CTO	4 hours max

Capacity Planning

Capacity Reviews

Review	Frequency	Attendees
Weekly Capacity	Every Monday	Ops team
Monthly Planning	First week	Ops + Eng leads
Quarterly Forecast	Start of quarter	Ops + Finance

Scaling Triggers

Metric	Threshold	Action
CPU >70%	Sustained 10min	Auto-scale up
Memory >80%	Sustained 5min	Alert + scale
Disk >80%	Any	Alert + expand
Connections >80%	Sustained 5min	Scale + alert

Cost Management

Budget Category	Monthly Budget	Owner
GCP Compute	$45,000	Ops Manager
GCP Database	$25,000	Ops Manager
Cloudflare	$15,000	Ops Manager
Monitoring	$5,000	SRE Lead

Tools & Access

Required Tools

Tool	Purpose	Setup
Cockpit	Primary ops console	SSO
GCP Console	Infrastructure	SSO + MFA
Cloudflare Dashboard	Edge & DNS	SSO + MFA
PagerDuty	Legacy (migrating)	SSO
Slack	Communication	SSO
1Password	Secrets	Team vault

CLI Tools

# Required CLI tools
gcloud      # GCP CLI
wrangler    # Cloudflare Workers
kubectl     # Kubernetes (edge clusters)
terraform   # Infrastructure as code
olympus-cli # Internal ops CLI

# Setup
./scripts/ops-setup.sh

Useful Commands

# Check service health
olympus-cli health all

# Get on-call info
olympus-cli oncall who platform

# View active incidents
olympus-cli incidents active

# Deploy status
olympus-cli deploy status

# Edge server status
olympus-cli edge status --region us-west

# Database metrics
olympus-cli db metrics orders-db

Performance Expectations

SLOs

Metric	Target	Measurement
Uptime	99.99%	Monthly
MTTA	Under 5 min	P1 incidents
MTTR	Under 30 min	P1 incidents
Change Success Rate	above 99%	Per quarter
Alert Accuracy	above 95%	True positives

Individual Metrics

Metric	Expectation
Response Time	Under 5 min for P1/P2
Incident Documentation	100% complete
Runbook Updates	Within 48h of incident
On-Call Handoffs	Zero dropped issues
Postmortem Participation	All assigned incidents

Career Development

Skills Matrix

Level	Technical	Leadership
NOC Analyst	L1 triage, monitoring	None
DevOps Engineer	CI/CD, automation	Project lead
SRE	Incident response, architecture	Team mentor
Senior SRE	Complex systems, design	Tech lead
Ops Manager	Strategy, planning	Team management

Training Requirements

Training	Frequency	Provider
GCP Professional	Annual cert	Google
Incident Commander	Quarterly drill	Internal
Security Awareness	Annual	Security team
On-Call Training	Before first shift	Buddy system

Runbook Index

Most Used Runbooks

Runbook	Category	Link
Incident Response	Process	/runbooks/incident-response
Service Restart	GCP	/runbooks/gcp-service-restart
Database Failover	Database	/runbooks/db-failover
Edge Recovery	Edge	/runbooks/edge-recovery
SSL Certificate	Security	/runbooks/ssl-renewal
Capacity Scale	Scaling	/runbooks/capacity-scale

Creating Runbooks

Every runbook must include:

Title and ID
When to use
Prerequisites
Step-by-step procedure
Verification steps
Rollback procedure
Escalation path

Contacts & Escalation

Internal Contacts

Role	Primary	Backup
VP Operations	Alex Thompson	CTO
Ops Manager	Jordan Lee	VP Ops
Security Lead	Sam Martinez	Ops Manager
Database Expert	Chris Wong	Senior SRE

External Contacts

Service	Support	Account Manager
GCP	Priority support	gcp-am@nebusai.com
Cloudflare	Enterprise support	cf-am@nebusai.com
Twilio	Support ticket	twilio-am@nebusai.com

Incident Response Runbook - IR procedures
Deployment Guide - Deploy process
Edge Infrastructure - Edge servers

Team Mission​

Core Responsibilities​

Team Structure​

Roles​

Team Distribution​

On-Call Operations​

Rotation Schedule​

On-Call Expectations​

Override Requests​

Incident Management​

Incident Severity​

Incident Lifecycle​

P1 Incident Procedure​

Communication Templates​

Infrastructure Management​

Platform Overview​

Access Management​

Infrastructure Runbooks​

Monitoring & Observability​

Dashboards​

Key Metrics​

AIOps Oversight​

Deployment Operations​

Release Process​

Deployment Windows​

Rollback Procedure​

Edge Server Operations​

OlympusEdge Fleet​

Edge Health Checks​

Common Edge Issues​

Security Operations​

Security Responsibilities​

Security Incident Response​

Access Request Process​

Capacity Planning​

Capacity Reviews​

Scaling Triggers​

Cost Management​

Tools & Access​

Required Tools​

CLI Tools​

Useful Commands​

Performance Expectations​

SLOs​

Individual Metrics​

Career Development​

Skills Matrix​

Training Requirements​

Runbook Index​

Most Used Runbooks​

Creating Runbooks​

Contacts & Escalation​

Internal Contacts​

External Contacts​

Related Documentation​