On-Call Guide

Complete guide for on-call engineers at Olympus Cloud.

Overview

On-call engineers are the first line of defense for production issues. This guide covers expectations, tools, and procedures for effective on-call shifts.

On-Call Philosophy

Alert fatigue is real - We tune alerts to be actionable
Blameless culture - Focus on learning, not blame
Sustainable on-call - Reasonable workload and compensation
Knowledge sharing - Document everything for future on-call

On-Call Rotation

Schedule Structure

Rotation	Coverage	Duration
Primary	24/7	1 week
Secondary	Backup	1 week
Incident Commander	SEV-1/SEV-2	1 week

Current Rotations

All rotations managed in PagerDuty:

Handoff Procedure

Beginning of Shift

Review open incidents and recent alerts
Check for any ongoing issues
Read handoff notes from previous on-call
Verify PagerDuty mobile app is working
Confirm contact information is current

End of Shift

Document any ongoing issues
Write handoff notes in #on-call-handoff
Update any runbooks with new learnings
Thank your relief :)

Handoff Template

## On-Call Handoff: [Date]

### Ongoing Issues
- [Issue 1]: Status, next steps

### Recent Incidents
- [Incident 1]: Brief summary, resolution

### Notable Alerts
- [Alert pattern]: Noise vs signal assessment

### Action Items
- [ ] Item needing follow-up

### Notes for Next On-Call
- Any tips, warnings, or context

Tools and Access

Required Tools

Tool	Purpose	Access
PagerDuty	Alerting	olympus.pagerduty.com
Slack	Communication	#incident-response, #on-call
GCP Console	Cloud infrastructure	console.cloud.google.com
Cloud Monitoring	Metrics, dashboards	Monitoring
Cloud Logging	Log analysis	Logging
Cloudflare	Edge, WAF	dash.cloudflare.com
GitHub	Code, deployments	github.com/olympuscloud

Mobile Setup

PagerDuty Mobile App
- Install on personal phone
- Enable push notifications
- Test alert delivery
Slack Mobile App
- Enable notifications for #incident-response
- Star #on-call channel
Google Authenticator
- Required for GCP access
- Backup codes stored securely

Access Verification

Before on-call shift, verify:

# GCP access
gcloud config get-value project
gcloud run services list --region=us-central1

# GitHub access
gh auth status

# Can view Cloudflare
curl -H "Authorization: Bearer $CF_TOKEN" \
  https://api.cloudflare.com/client/v4/user/tokens/verify

Alert Triage

Alert Classification

┌─────────────────────────────────────────────────────────────────┐
│  Alert Arrives                                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  1. Read alert details                                           │
│     └── Service, metric, threshold, duration                    │
│                                                                   │
│  2. Check dashboard                                              │
│     └── Is this real? False positive?                           │
│                                                                   │
│  3. Assess impact                                                │
│     └── Users affected? Data at risk?                           │
│                                                                   │
│  4. Determine action                                             │
│     └── Acknowledge, snooze, or escalate                        │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Response Expectations

danger

P1 alerts require acknowledgment within 5 minutes, 24/7. If you cannot respond in time, immediately escalate to secondary on-call via PagerDuty. Unacknowledged P1 alerts auto-escalate after 5 minutes and trigger management notification after 15 minutes.

Priority	Response Time	Action
P1 (Critical)	< 5 minutes	Acknowledge immediately, begin investigation
P2 (High)	< 15 minutes	Acknowledge, triage within business hours
P3 (Medium)	< 1 hour	Acknowledge, may defer to next day
P4 (Low)	< 4 hours	Acknowledge, queue for review

Common Alert Patterns

High Error Rate

Check recent deployments
View error logs
Consider rollback
See: Incident Response

High Latency

Check database metrics
Review external dependencies
Check for traffic spike
See: Database Operations

Resource Exhaustion

Check CPU/memory utilization
Scale up if needed
Investigate cause
See: Scaling Runbook

Certificate Expiry

Check cert expiration dates
Renew if within 30 days
Usually auto-renewed

Daily Responsibilities

Morning Check (< 15 minutes)

Review overnight alerts
- Any P1/P2 that were resolved?
- Noisy alerts to tune?
Check system health
- Main dashboard green?
- Any degraded services?
Review deployments
- Any deployments happening today?
- Potential for issues?

Post status

Good morning! On-call status:
- Systems: All green ✅
- Overnight: 2 P4 alerts (acknowledged)
- Deployments today: api-gateway v1.25.0 @ 2pm

Ongoing Vigilance

Keep Slack and PagerDuty visible
Be reachable within 5 minutes
If stepping away, notify in #on-call
Coordinate coverage for meals, appointments

End of Day Check

Review all alerts from the day
Note any patterns for handoff
Document any unresolved items
Update #on-call-handoff if needed

Escalation Procedures

When to Escalate

Escalate to Secondary On-Call:

You need a second opinion
Issue spans multiple systems
You need to step away
Investigation exceeds 30 minutes

Escalate to Incident Commander:

SEV-1 or SEV-2 confirmed
Customer-impacting outage
Data loss or security concern
You're uncertain about decisions

Escalate to Management:

Extended outage (> 2 hours)
Financial impact
Legal/compliance issues
External communication needed

How to Escalate

In PagerDuty:

Open the incident
Click "Escalate"
Select target escalation policy
Add escalation note

In Slack:

@secondary-oncall I need assistance with [issue]
Current status: [brief description]
Help needed: [specific ask]

Escalation Contacts

Role	Method	When
Secondary On-Call	PagerDuty	First escalation
Incident Commander	PagerDuty	SEV-1/SEV-2
Engineering Manager	Slack/Phone	Extended outage
VP Engineering	Phone	Major incident
CEO	Phone	Company-wide impact

Self-Care

Sustainable On-Call

Sleep: Don't sacrifice sleep for non-urgent alerts
Coverage: Ask for coverage if you need it
Breaks: Take them. Someone can cover for 30 min
Debrief: Talk about stressful incidents

When to Ask for Coverage

Medical appointments
Important personal events
Feeling unwell
After stressful incidents

Requesting Coverage

Post in #on-call with dates/times needed
Find someone willing to cover
Update PagerDuty override
Confirm handoff

Looking for coverage:
- Date: January 20, 2026
- Time: 6pm - 10pm
- Reason: Family event
Can anyone cover? I can swap for your shift.

After Your Shift

Reflect and Improve

What alerts were noisy?
- File tickets to tune thresholds
- Add to noise reduction backlog
What documentation was missing?
- Update runbooks
- Add to tribal knowledge
What tools were frustrating?
- File improvements
- Discuss with team
What went well?
- Share wins in #on-call
- Document good patterns

Incident Follow-Up

If you handled incidents during your shift:

Ensure PIR is scheduled (SEV-1/SEV-2)
Document any quick wins
Share learnings in team meeting

Quick Reference

Key Dashboards

Dashboard	Link
Main Health	dashboard/main
Services	dashboard/services
Database	dashboard/database
Edge	cloudflare/analytics

Key Commands

# Check service status
gcloud run services describe SERVICE --region=us-central1

# View logs
gcloud logging read "resource.type=cloud_run_revision" --limit=50

# Quick rollback
gcloud run services update-traffic SERVICE --to-revisions=REVISION=100

# Check database
gcloud sql instances describe olympus-pg-prod

Emergency Contacts

Contact	Phone
GCP Support	1-855-831-3592
Cloudflare	Enterprise Portal
Engineering Manager	[In PagerDuty contacts]

Incident Response - Full incident procedures
Database Operations - Database troubleshooting
Deployment - Rollback procedures
Scaling - Scaling operations

Overview​

On-Call Philosophy​

On-Call Rotation​

Schedule Structure​

Current Rotations​

Handoff Procedure​

Tools and Access​

Required Tools​

Mobile Setup​

Access Verification​

Alert Triage​

Alert Classification​

Response Expectations​

Common Alert Patterns​

Daily Responsibilities​

Morning Check (< 15 minutes)​

Ongoing Vigilance​

End of Day Check​

Escalation Procedures​

When to Escalate​

How to Escalate​

Escalation Contacts​

Self-Care​

Sustainable On-Call​

When to Ask for Coverage​

Requesting Coverage​

After Your Shift​

Reflect and Improve​

Incident Follow-Up​

Quick Reference​

Key Dashboards​

Key Commands​

Emergency Contacts​

Related Documentation​