On-Call Guide
Complete guide for on-call engineers at Olympus Cloud.
Overview
On-call engineers are the first line of defense for production issues. This guide covers expectations, tools, and procedures for effective on-call shifts.
On-Call Philosophy
- Alert fatigue is real - We tune alerts to be actionable
- Blameless culture - Focus on learning, not blame
- Sustainable on-call - Reasonable workload and compensation
- Knowledge sharing - Document everything for future on-call
On-Call Rotation
Schedule Structure
| Rotation | Coverage | Duration |
|---|---|---|
| Primary | 24/7 | 1 week |
| Secondary | Backup | 1 week |
| Incident Commander | SEV-1/SEV-2 | 1 week |
Current Rotations
All rotations managed in PagerDuty:
Handoff Procedure
Beginning of Shift
- Review open incidents and recent alerts
- Check for any ongoing issues
- Read handoff notes from previous on-call
- Verify PagerDuty mobile app is working
- Confirm contact information is current
End of Shift
- Document any ongoing issues
- Write handoff notes in #on-call-handoff
- Update any runbooks with new learnings
- Thank your relief :)
Handoff Template
## On-Call Handoff: [Date]
### Ongoing Issues
- [Issue 1]: Status, next steps
### Recent Incidents
- [Incident 1]: Brief summary, resolution
### Notable Alerts
- [Alert pattern]: Noise vs signal assessment
### Action Items
- [ ] Item needing follow-up
### Notes for Next On-Call
- Any tips, warnings, or context
Tools and Access
Required Tools
| Tool | Purpose | Access |
|---|---|---|
| PagerDuty | Alerting | olympus.pagerduty.com |
| Slack | Communication | #incident-response, #on-call |
| GCP Console | Cloud infrastructure | console.cloud.google.com |
| Cloud Monitoring | Metrics, dashboards | Monitoring |
| Cloud Logging | Log analysis | Logging |
| Cloudflare | Edge, WAF | dash.cloudflare.com |
| GitHub | Code, deployments | github.com/olympuscloud |
Mobile Setup
-
PagerDuty Mobile App
- Install on personal phone
- Enable push notifications
- Test alert delivery
-
Slack Mobile App
- Enable notifications for #incident-response
- Star #on-call channel
-
Google Authenticator
- Required for GCP access
- Backup codes stored securely
Access Verification
Before on-call shift, verify:
# GCP access
gcloud config get-value project
gcloud run services list --region=us-central1
# GitHub access
gh auth status
# Can view Cloudflare
curl -H "Authorization: Bearer $CF_TOKEN" \
https://api.cloudflare.com/client/v4/user/tokens/verify
Alert Triage
Alert Classification
┌─────────────────────────────────────────────────────────────────┐
│ Alert Arrives │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Read alert details │
│ └── Service, metric, threshold, duration │
│ │
│ 2. Check dashboard │
│ └── Is this real? False positive? │
│ │
│ 3. Assess impact │
│ └── Users affected? Data at risk? │
│ │
│ 4. Determine action │
│ └── Acknowledge, snooze, or escalate │
│ │
└─────────────────────────────────────────────────────────────────┘
Response Expectations
P1 alerts require acknowledgment within 5 minutes, 24/7. If you cannot respond in time, immediately escalate to secondary on-call via PagerDuty. Unacknowledged P1 alerts auto-escalate after 5 minutes and trigger management notification after 15 minutes.
| Priority | Response Time | Action |
|---|---|---|
| P1 (Critical) | < 5 minutes | Acknowledge immediately, begin investigation |
| P2 (High) | < 15 minutes | Acknowledge, triage within business hours |
| P3 (Medium) | < 1 hour | Acknowledge, may defer to next day |
| P4 (Low) | < 4 hours | Acknowledge, queue for review |
Common Alert Patterns
High Error Rate
- Check recent deployments
- View error logs
- Consider rollback
- See: Incident Response
High Latency
- Check database metrics
- Review external dependencies
- Check for traffic spike
- See: Database Operations
Resource Exhaustion
- Check CPU/memory utilization
- Scale up if needed
- Investigate cause
- See: Scaling Runbook
Certificate Expiry
- Check cert expiration dates
- Renew if within 30 days
- Usually auto-renewed
Daily Responsibilities
Morning Check (< 15 minutes)
-
Review overnight alerts
- Any P1/P2 that were resolved?
- Noisy alerts to tune?
-
Check system health
- Main dashboard green?
- Any degraded services?
-
Review deployments
- Any deployments happening today?
- Potential for issues?
-
Post status
Good morning! On-call status:
- Systems: All green ✅
- Overnight: 2 P4 alerts (acknowledged)
- Deployments today: api-gateway v1.25.0 @ 2pm
Ongoing Vigilance
- Keep Slack and PagerDuty visible
- Be reachable within 5 minutes
- If stepping away, notify in #on-call
- Coordinate coverage for meals, appointments
End of Day Check
- Review all alerts from the day
- Note any patterns for handoff
- Document any unresolved items
- Update #on-call-handoff if needed
Escalation Procedures
When to Escalate
Escalate to Secondary On-Call:
- You need a second opinion
- Issue spans multiple systems
- You need to step away
- Investigation exceeds 30 minutes
Escalate to Incident Commander:
- SEV-1 or SEV-2 confirmed
- Customer-impacting outage
- Data loss or security concern
- You're uncertain about decisions
Escalate to Management:
- Extended outage (> 2 hours)
- Financial impact
- Legal/compliance issues
- External communication needed
How to Escalate
In PagerDuty:
- Open the incident
- Click "Escalate"
- Select target escalation policy
- Add escalation note
In Slack:
@secondary-oncall I need assistance with [issue]
Current status: [brief description]
Help needed: [specific ask]
Escalation Contacts
| Role | Method | When |
|---|---|---|
| Secondary On-Call | PagerDuty | First escalation |
| Incident Commander | PagerDuty | SEV-1/SEV-2 |
| Engineering Manager | Slack/Phone | Extended outage |
| VP Engineering | Phone | Major incident |
| CEO | Phone | Company-wide impact |
Self-Care
Sustainable On-Call
- Sleep: Don't sacrifice sleep for non-urgent alerts
- Coverage: Ask for coverage if you need it
- Breaks: Take them. Someone can cover for 30 min
- Debrief: Talk about stressful incidents
When to Ask for Coverage
- Medical appointments
- Important personal events
- Feeling unwell
- After stressful incidents
Requesting Coverage
- Post in #on-call with dates/times needed
- Find someone willing to cover
- Update PagerDuty override
- Confirm handoff
Looking for coverage:
- Date: January 20, 2026
- Time: 6pm - 10pm
- Reason: Family event
Can anyone cover? I can swap for your shift.
After Your Shift
Reflect and Improve
-
What alerts were noisy?
- File tickets to tune thresholds
- Add to noise reduction backlog
-
What documentation was missing?
- Update runbooks
- Add to tribal knowledge
-
What tools were frustrating?
- File improvements
- Discuss with team
-
What went well?
- Share wins in #on-call
- Document good patterns
Incident Follow-Up
If you handled incidents during your shift:
- Ensure PIR is scheduled (SEV-1/SEV-2)
- Document any quick wins
- Share learnings in team meeting
Quick Reference
Key Dashboards
| Dashboard | Link |
|---|---|
| Main Health | dashboard/main |
| Services | dashboard/services |
| Database | dashboard/database |
| Edge | cloudflare/analytics |
Key Commands
# Check service status
gcloud run services describe SERVICE --region=us-central1
# View logs
gcloud logging read "resource.type=cloud_run_revision" --limit=50
# Quick rollback
gcloud run services update-traffic SERVICE --to-revisions=REVISION=100
# Check database
gcloud sql instances describe olympus-pg-prod
Emergency Contacts
| Contact | Phone |
|---|---|
| GCP Support | 1-855-831-3592 |
| Cloudflare | Enterprise Portal |
| Engineering Manager | [In PagerDuty contacts] |
Related Documentation
- Incident Response - Full incident procedures
- Database Operations - Database troubleshooting
- Deployment - Rollback procedures
- Scaling - Scaling operations