Skip to main content

Operations Team Handbook

Comprehensive guide for NebusAI Operations team members.

Team Mission

The Operations team ensures Olympus Cloud platform reliability, security, and performance. We maintain 99.99% uptime, sub-second response times, and enable engineering teams to deploy with confidence.

Core Responsibilities

AreaResponsibility
On-Call24/7 incident response and resolution
InfrastructureGCP, Cloudflare, and edge server management
DeploymentCI/CD pipeline and release management
MonitoringObservability, alerting, and AIOps
SecuritySecurity operations and compliance
SREReliability engineering and capacity planning

Team Structure

Roles

RoleFocusOn-Call
VP OperationsStrategy, escalationP1 only
Operations ManagerTeam lead, schedulingWeekly backup
Senior SREComplex incidents, architecturePrimary rotation
SREIncident response, automationPrimary rotation
DevOps EngineerCI/CD, toolingSecondary rotation
NOC AnalystL1 monitoring, triage24/7 coverage

Team Distribution

Operations Team (12 members)
├── US West (SF) - 4
│ └── Primary: Mon-Fri 6AM-6PM PT
├── US East (NYC) - 3
│ └── Primary: Mon-Fri 9AM-9PM ET
├── EU (London) - 3
│ └── Primary: Mon-Fri 9AM-9PM GMT
└── APAC (Singapore) - 2
└── Primary: Mon-Fri 9AM-9PM SGT

On-Call Operations

Rotation Schedule

ScheduleDurationCoverage
Primary1 weekAll incidents
Secondary1 weekEscalation backup
NOC8-hour shiftsL1 triage, monitoring

On-Call Expectations

When on-call:

  1. Response Time

    • P1: Acknowledge within 5 minutes
    • P2: Acknowledge within 15 minutes
    • P3: Acknowledge within 1 hour
  2. Availability

    • Phone accessible 24/7
    • Laptop within 15 minutes
    • VPN/network access confirmed
    • Runbook access verified
  3. Handoff Requirements

    • Document all open issues
    • Update incident notes
    • Briefing call with next on-call
    • Clear escalation status

Override Requests

RequestApprovalNotice
PTO coverageSelf-serve72 hours
Shift swapPeer + Manager48 hours
EmergencyManagerImmediate
HolidayAuto-scheduled30 days

Incident Management

Incident Severity

SeverityDefinitionResponse
P1-CriticalPlatform outage, data at riskAll hands, bridge
P2-HighMajor feature downPrimary + secondary
P3-MediumDegraded performancePrimary
P4-LowMinor issueNext business day

Incident Lifecycle

┌──────────────────────────────────────────────────────────────────┐
│ INCIDENT LIFECYCLE │
├──────────────────────────────────────────────────────────────────┤
│ │
│ 1. DETECTION │
│ ──────────────────────────────────────────────────────────── │
│ • Automated alert fires │
│ • AIOps Engine performs initial triage │
│ • On-call paged if AI cannot resolve │
│ │
│ 2. TRIAGE (First 5 minutes) │
│ ──────────────────────────────────────────────────────────── │
│ • Acknowledge alert │
│ • Assess impact and scope │
│ • Determine severity level │
│ • Start incident channel (if P1/P2) │
│ │
│ 3. INVESTIGATION │
│ ──────────────────────────────────────────────────────────── │
│ • Query logs and metrics │
│ • Check related alerts │
│ • Review recent changes │
│ • Consult runbooks │
│ │
│ 4. MITIGATION │
│ ──────────────────────────────────────────────────────────── │
│ • Execute runbook actions │
│ • Apply temporary fixes │
│ • Communicate status │
│ • Monitor for improvement │
│ │
│ 5. RESOLUTION │
│ ──────────────────────────────────────────────────────────── │
│ • Confirm issue resolved │
│ • Close incident │
│ • Schedule postmortem (P1/P2) │
│ • Update documentation │
│ │
│ 6. POSTMORTEM (Within 72 hours) │
│ ──────────────────────────────────────────────────────────── │
│ • Document timeline │
│ • Identify root cause │
│ • Create action items │
│ • Share learnings │
│ │
└──────────────────────────────────────────────────────────────────┘

P1 Incident Procedure

For P1 Critical incidents:

  1. Immediate (0-5 min)

    • Acknowledge alert
    • Start Slack incident channel: #incident-YYYY-MM-DD-{brief}
    • Page secondary and manager
    • Post initial status in channel
  2. Triage (5-15 min)

    • Assign Incident Commander (IC)
    • IC posts situation report
    • Begin investigation
    • Draft customer communication if needed
  3. Bridge Call (if needed)

    • Start Google Meet: "Incident Bridge"
    • IC runs the call
    • 15-minute status updates
    • All actions logged in Slack
  4. Resolution

    • Confirm metrics return to normal
    • IC declares incident resolved
    • Post final update
    • Schedule postmortem

Communication Templates

Initial Status (Internal):

INCIDENT: [Brief description]
SEVERITY: P1/P2
STATUS: Investigating/Identified/Mitigating
IMPACT: [User impact description]
IC: [Name]
NEXT UPDATE: [Time]

Customer Communication (via Status Page):

Title: [Service] - [Brief issue]
Status: Investigating

We are currently investigating [issue description].
Some users may experience [impact].
We will provide an update within [timeframe].

Infrastructure Management

Platform Overview

ComponentProviderRegionPurpose
Cloud RunGCPus-central1API services
SpannerGCPmulti-regionDatabase
WorkersCloudflareGlobalEdge compute
R2CloudflareGlobalObject storage
Edge ServersOn-premisePer-locationOlympusEdge

Access Management

SystemAccess MethodApproval
GCP ConsoleSSO + MFARole-based
CloudflareSSO + MFARole-based
Production DBBreakglassManager approval
Customer DataAudit-loggedPer-incident

Infrastructure Runbooks

RunbookWhen to Use
gcp-service-restartCloud Run service unresponsive
spanner-hotspotDatabase hotspot detected
worker-redeployEdge worker issues
cache-flushCache corruption suspected
dns-failoverRegional DNS issues
edge-server-recoveryOlympusEdge server offline

Monitoring & Observability

Dashboards

DashboardURLPurpose
Platform Overviewcockpit.olympuscloud.aiHealth summary
Service Health/dashboards/servicesPer-service metrics
Edge Status/dashboards/edgeEdge server health
Database/dashboards/spannerSpanner metrics
Cost/dashboards/costsCloud spending

Key Metrics

MetricSLOAlert Threshold
API Latency (p99)Under 500msOver 800ms
Error RateUnder 0.1%Over 0.5%
Availability99.99%Any outage
Edge Sync LagUnder 30sOver 2min
Database LatencyUnder 50msOver 100ms

AIOps Oversight

The AIOps Engine handles L1 incidents automatically. Your responsibilities:

ResponsibilityAction
Review AI decisionsCheck daily AI resolution report
Tune thresholdsAdjust based on false positive rate
Update runbooksAI uses runbooks for remediation
Approve high-riskAI requests approval for risky actions
Train modelsProvide feedback on AI decisions

Deployment Operations

Release Process

┌──────────────────────────────────────────────────────────────────┐
│ RELEASE PIPELINE │
├──────────────────────────────────────────────────────────────────┤
│ │
│ 1. ENGINEERING │
│ └── PR merged to develop │
│ │
│ 2. STAGING DEPLOY (Automatic) │
│ ├── All tests pass │
│ ├── Deploy to staging.olympuscloud.ai │
│ └── Smoke tests run │
│ │
│ 3. OPS VALIDATION (Manual Gate) │
│ ├── Review deployment metrics │
│ ├── Check error rates │
│ └── Approve for production │
│ │
│ 4. PRODUCTION DEPLOY (Gradual) │
│ ├── 10% canary (5 min wait) │
│ ├── 25% rollout (5 min wait) │
│ ├── 50% rollout (5 min wait) │
│ └── 100% rollout │
│ │
│ 5. POST-DEPLOY │
│ ├── Monitor for 30 minutes │
│ ├── Auto-rollback if errors spike │
│ └── Close deployment ticket │
│ │
└──────────────────────────────────────────────────────────────────┘

Deployment Windows

WindowTime (PT)Use
RegularTue-Thu 10AM-2PMStandard deploys
EmergencyAny timeP1 fixes
Off-peakTue-Thu 2AM-4AMDatabase migrations
FrozenFri 2PM - Mon 10AMNo deploys

Rollback Procedure

  1. Automatic: If error rate >1% in first 5 minutes
  2. Manual: Run ./scripts/rollback.sh <version>
  3. Database: Use ./scripts/db-rollback.sh (requires approval)

Edge Server Operations

OlympusEdge Fleet

RegionServersStatus Dashboard
US West450/edge/us-west
US East380/edge/us-east
EU120/edge/eu
APAC80/edge/apac

Edge Health Checks

CheckFrequencyAlert
Heartbeat30s3 missed = offline
Sync Status1minover 5min lag = warning
Disk Space5minover 90% = critical
Memory1minover 95% = warning
Temperature5minover 80C = critical

Common Edge Issues

IssueRunbookEscalation
Offlineedge-offline-recoveryLocation contact
Sync Failededge-sync-repairSRE
High Loadedge-load-balanceNone
Network Issuesedge-network-diagLocation IT

Security Operations

Security Responsibilities

AreaOps Team Role
Access ReviewsMonthly review of all access
Secret RotationQuarterly secret rotation
Security AlertsTriage and respond to SIEM alerts
ComplianceSupport SOC 2 audits
Pen TestingCoordinate annual pen tests

Security Incident Response

For security incidents:

  1. Do NOT discuss in public channels
  2. Page Security On-Call immediately
  3. Use encrypted channel: #sec-incident-{date}
  4. Preserve evidence (no cleanup without approval)
  5. Follow Security Incident Runbook

Access Request Process

Access TypeApproverDuration
Read-only productionManagerPermanent
Write productionManager + Security24 hours
Customer dataVP Ops + LegalPer-incident
Database adminCTO4 hours max

Capacity Planning

Capacity Reviews

ReviewFrequencyAttendees
Weekly CapacityEvery MondayOps team
Monthly PlanningFirst weekOps + Eng leads
Quarterly ForecastStart of quarterOps + Finance

Scaling Triggers

MetricThresholdAction
CPU >70%Sustained 10minAuto-scale up
Memory >80%Sustained 5minAlert + scale
Disk >80%AnyAlert + expand
Connections >80%Sustained 5minScale + alert

Cost Management

Budget CategoryMonthly BudgetOwner
GCP Compute$45,000Ops Manager
GCP Database$25,000Ops Manager
Cloudflare$15,000Ops Manager
Monitoring$5,000SRE Lead

Tools & Access

Required Tools

ToolPurposeSetup
CockpitPrimary ops consoleSSO
GCP ConsoleInfrastructureSSO + MFA
Cloudflare DashboardEdge & DNSSSO + MFA
PagerDutyLegacy (migrating)SSO
SlackCommunicationSSO
1PasswordSecretsTeam vault

CLI Tools

# Required CLI tools
gcloud # GCP CLI
wrangler # Cloudflare Workers
kubectl # Kubernetes (edge clusters)
terraform # Infrastructure as code
olympus-cli # Internal ops CLI

# Setup
./scripts/ops-setup.sh

Useful Commands

# Check service health
olympus-cli health all

# Get on-call info
olympus-cli oncall who platform

# View active incidents
olympus-cli incidents active

# Deploy status
olympus-cli deploy status

# Edge server status
olympus-cli edge status --region us-west

# Database metrics
olympus-cli db metrics orders-db

Performance Expectations

SLOs

MetricTargetMeasurement
Uptime99.99%Monthly
MTTAUnder 5 minP1 incidents
MTTRUnder 30 minP1 incidents
Change Success Rateabove 99%Per quarter
Alert Accuracyabove 95%True positives

Individual Metrics

MetricExpectation
Response TimeUnder 5 min for P1/P2
Incident Documentation100% complete
Runbook UpdatesWithin 48h of incident
On-Call HandoffsZero dropped issues
Postmortem ParticipationAll assigned incidents

Career Development

Skills Matrix

LevelTechnicalLeadership
NOC AnalystL1 triage, monitoringNone
DevOps EngineerCI/CD, automationProject lead
SREIncident response, architectureTeam mentor
Senior SREComplex systems, designTech lead
Ops ManagerStrategy, planningTeam management

Training Requirements

TrainingFrequencyProvider
GCP ProfessionalAnnual certGoogle
Incident CommanderQuarterly drillInternal
Security AwarenessAnnualSecurity team
On-Call TrainingBefore first shiftBuddy system

Runbook Index

Most Used Runbooks

RunbookCategoryLink
Incident ResponseProcess/runbooks/incident-response
Service RestartGCP/runbooks/gcp-service-restart
Database FailoverDatabase/runbooks/db-failover
Edge RecoveryEdge/runbooks/edge-recovery
SSL CertificateSecurity/runbooks/ssl-renewal
Capacity ScaleScaling/runbooks/capacity-scale

Creating Runbooks

Every runbook must include:

  1. Title and ID
  2. When to use
  3. Prerequisites
  4. Step-by-step procedure
  5. Verification steps
  6. Rollback procedure
  7. Escalation path

Contacts & Escalation

Internal Contacts

RolePrimaryBackup
VP OperationsAlex ThompsonCTO
Ops ManagerJordan LeeVP Ops
Security LeadSam MartinezOps Manager
Database ExpertChris WongSenior SRE

External Contacts

ServiceSupportAccount Manager
GCPPriority supportgcp-am@nebusai.com
CloudflareEnterprise supportcf-am@nebusai.com
TwilioSupport tickettwilio-am@nebusai.com