Skip to main content

On-Call Guide

Complete guide for on-call engineers at Olympus Cloud.

Overview

On-call engineers are the first line of defense for production issues. This guide covers expectations, tools, and procedures for effective on-call shifts.

On-Call Philosophy

  • Alert fatigue is real - We tune alerts to be actionable
  • Blameless culture - Focus on learning, not blame
  • Sustainable on-call - Reasonable workload and compensation
  • Knowledge sharing - Document everything for future on-call

On-Call Rotation

Schedule Structure

RotationCoverageDuration
Primary24/71 week
SecondaryBackup1 week
Incident CommanderSEV-1/SEV-21 week

Current Rotations

All rotations managed in PagerDuty:

Handoff Procedure

Beginning of Shift

  1. Review open incidents and recent alerts
  2. Check for any ongoing issues
  3. Read handoff notes from previous on-call
  4. Verify PagerDuty mobile app is working
  5. Confirm contact information is current

End of Shift

  1. Document any ongoing issues
  2. Write handoff notes in #on-call-handoff
  3. Update any runbooks with new learnings
  4. Thank your relief :)

Handoff Template

## On-Call Handoff: [Date]

### Ongoing Issues
- [Issue 1]: Status, next steps

### Recent Incidents
- [Incident 1]: Brief summary, resolution

### Notable Alerts
- [Alert pattern]: Noise vs signal assessment

### Action Items
- [ ] Item needing follow-up

### Notes for Next On-Call
- Any tips, warnings, or context

Tools and Access

Required Tools

ToolPurposeAccess
PagerDutyAlertingolympus.pagerduty.com
SlackCommunication#incident-response, #on-call
GCP ConsoleCloud infrastructureconsole.cloud.google.com
Cloud MonitoringMetrics, dashboardsMonitoring
Cloud LoggingLog analysisLogging
CloudflareEdge, WAFdash.cloudflare.com
GitHubCode, deploymentsgithub.com/olympuscloud

Mobile Setup

  1. PagerDuty Mobile App

    • Install on personal phone
    • Enable push notifications
    • Test alert delivery
  2. Slack Mobile App

    • Enable notifications for #incident-response
    • Star #on-call channel
  3. Google Authenticator

    • Required for GCP access
    • Backup codes stored securely

Access Verification

Before on-call shift, verify:

# GCP access
gcloud config get-value project
gcloud run services list --region=us-central1

# GitHub access
gh auth status

# Can view Cloudflare
curl -H "Authorization: Bearer $CF_TOKEN" \
https://api.cloudflare.com/client/v4/user/tokens/verify

Alert Triage

Alert Classification

┌─────────────────────────────────────────────────────────────────┐
│ Alert Arrives │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Read alert details │
│ └── Service, metric, threshold, duration │
│ │
│ 2. Check dashboard │
│ └── Is this real? False positive? │
│ │
│ 3. Assess impact │
│ └── Users affected? Data at risk? │
│ │
│ 4. Determine action │
│ └── Acknowledge, snooze, or escalate │
│ │
└─────────────────────────────────────────────────────────────────┘

Response Expectations

danger

P1 alerts require acknowledgment within 5 minutes, 24/7. If you cannot respond in time, immediately escalate to secondary on-call via PagerDuty. Unacknowledged P1 alerts auto-escalate after 5 minutes and trigger management notification after 15 minutes.

PriorityResponse TimeAction
P1 (Critical)< 5 minutesAcknowledge immediately, begin investigation
P2 (High)< 15 minutesAcknowledge, triage within business hours
P3 (Medium)< 1 hourAcknowledge, may defer to next day
P4 (Low)< 4 hoursAcknowledge, queue for review

Common Alert Patterns

High Error Rate

  1. Check recent deployments
  2. View error logs
  3. Consider rollback
  4. See: Incident Response

High Latency

  1. Check database metrics
  2. Review external dependencies
  3. Check for traffic spike
  4. See: Database Operations

Resource Exhaustion

  1. Check CPU/memory utilization
  2. Scale up if needed
  3. Investigate cause
  4. See: Scaling Runbook

Certificate Expiry

  1. Check cert expiration dates
  2. Renew if within 30 days
  3. Usually auto-renewed

Daily Responsibilities

Morning Check (< 15 minutes)

  1. Review overnight alerts

    • Any P1/P2 that were resolved?
    • Noisy alerts to tune?
  2. Check system health

    • Main dashboard green?
    • Any degraded services?
  3. Review deployments

    • Any deployments happening today?
    • Potential for issues?
  4. Post status

    Good morning! On-call status:
    - Systems: All green ✅
    - Overnight: 2 P4 alerts (acknowledged)
    - Deployments today: api-gateway v1.25.0 @ 2pm

Ongoing Vigilance

  • Keep Slack and PagerDuty visible
  • Be reachable within 5 minutes
  • If stepping away, notify in #on-call
  • Coordinate coverage for meals, appointments

End of Day Check

  1. Review all alerts from the day
  2. Note any patterns for handoff
  3. Document any unresolved items
  4. Update #on-call-handoff if needed

Escalation Procedures

When to Escalate

Escalate to Secondary On-Call:

  • You need a second opinion
  • Issue spans multiple systems
  • You need to step away
  • Investigation exceeds 30 minutes

Escalate to Incident Commander:

  • SEV-1 or SEV-2 confirmed
  • Customer-impacting outage
  • Data loss or security concern
  • You're uncertain about decisions

Escalate to Management:

  • Extended outage (> 2 hours)
  • Financial impact
  • Legal/compliance issues
  • External communication needed

How to Escalate

In PagerDuty:

  1. Open the incident
  2. Click "Escalate"
  3. Select target escalation policy
  4. Add escalation note

In Slack:

@secondary-oncall I need assistance with [issue]
Current status: [brief description]
Help needed: [specific ask]

Escalation Contacts

RoleMethodWhen
Secondary On-CallPagerDutyFirst escalation
Incident CommanderPagerDutySEV-1/SEV-2
Engineering ManagerSlack/PhoneExtended outage
VP EngineeringPhoneMajor incident
CEOPhoneCompany-wide impact

Self-Care

Sustainable On-Call

  • Sleep: Don't sacrifice sleep for non-urgent alerts
  • Coverage: Ask for coverage if you need it
  • Breaks: Take them. Someone can cover for 30 min
  • Debrief: Talk about stressful incidents

When to Ask for Coverage

  • Medical appointments
  • Important personal events
  • Feeling unwell
  • After stressful incidents

Requesting Coverage

  1. Post in #on-call with dates/times needed
  2. Find someone willing to cover
  3. Update PagerDuty override
  4. Confirm handoff
Looking for coverage:
- Date: January 20, 2026
- Time: 6pm - 10pm
- Reason: Family event
Can anyone cover? I can swap for your shift.

After Your Shift

Reflect and Improve

  1. What alerts were noisy?

    • File tickets to tune thresholds
    • Add to noise reduction backlog
  2. What documentation was missing?

    • Update runbooks
    • Add to tribal knowledge
  3. What tools were frustrating?

    • File improvements
    • Discuss with team
  4. What went well?

    • Share wins in #on-call
    • Document good patterns

Incident Follow-Up

If you handled incidents during your shift:

  • Ensure PIR is scheduled (SEV-1/SEV-2)
  • Document any quick wins
  • Share learnings in team meeting

Quick Reference

Key Dashboards

DashboardLink
Main Healthdashboard/main
Servicesdashboard/services
Databasedashboard/database
Edgecloudflare/analytics

Key Commands

# Check service status
gcloud run services describe SERVICE --region=us-central1

# View logs
gcloud logging read "resource.type=cloud_run_revision" --limit=50

# Quick rollback
gcloud run services update-traffic SERVICE --to-revisions=REVISION=100

# Check database
gcloud sql instances describe olympus-pg-prod

Emergency Contacts

ContactPhone
GCP Support1-855-831-3592
CloudflareEnterprise Portal
Engineering Manager[In PagerDuty contacts]