Production Security Runbook
Security procedures and incident response for Olympus Cloud production environments.
Quick Reference
| Issue | Severity | Action |
|---|---|---|
| Suspected breach | P0 | Isolate, preserve evidence, escalate immediately |
| Unauthorized access attempt | P1 | Block source, investigate, report |
| Vulnerability discovered | P1-P2 | Assess impact, patch timeline based on CVSS |
| Certificate expiring | P2 | Renew 30 days before expiry |
| Failed security scan | P2 | Review findings, remediate within SLA |
| Compliance audit finding | P2-P3 | Document remediation plan |
Security Architecture
Network Security Layers
┌─────────────────────────────────────────────────────────────────────┐
│ Internet Traffic │
└────────────────────────────────┬────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Cloudflare WAF & DDoS Protection │
│ - Rate limiting (1000 req/min per IP) │
│ - Bot detection & challenge │
│ - Geo-blocking (configurable) │
│ - OWASP rule sets enabled │
└────────────────────────────────┬────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Cloud Armor (GCP) │
│ - Additional WAF rules │
│ - Adaptive protection │
│ - Pre-configured rules for common attacks │
└────────────────────────────────┬────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ VPC Network │
│ - Private subnets for services │
│ - Firewall rules (deny by default) │
│ - VPC Service Controls │
└─────────────────────────────────────────────────────────────────────┘
Service Authentication
| Service | Auth Method | Token Lifetime |
|---|---|---|
| API Gateway | JWT with RS256 | 1 hour |
| Service-to-Service | Workload Identity | 1 hour |
| Admin Access | OAuth 2.0 + MFA | 8 hours |
| CI/CD | Service Account | Per-run |
Access Control Procedures
Granting Production Access
Prerequisites:
- Completed security training
- Signed NDA and acceptable use policy
- Manager approval documented
- Role justification
Process:
# 1. Verify user in access request system
gcloud identity groups memberships search-transitive-memberships \
--group-email="prod-access@olympuscloud.ai" \
--member-email="user@olympuscloud.ai"
# 2. Add to appropriate group (requires admin)
gcloud identity groups memberships add \
--group-email="sre-prod@olympuscloud.ai" \
--member-email="user@olympuscloud.ai"
# 3. Verify access
gcloud projects get-iam-policy olympuscloud-prod \
--flatten="bindings[].members" \
--filter="bindings.members:user@olympuscloud.ai"
Emergency Access (Break Glass)
For emergency situations when normal access is unavailable:
# 1. Document the emergency
echo "Emergency access request: $(date)" >> /var/log/emergency-access.log
# 2. Use break-glass account
gcloud auth activate-service-account \
--key-file=/secure/break-glass-key.json
# 3. Perform necessary actions
# ... (all commands logged)
# 4. Revoke and report
gcloud auth revoke break-glass@olympuscloud-prod.iam.gserviceaccount.com
Post-Emergency:
- Create incident report within 24 hours
- Review all actions taken
- Rotate break-glass credentials
- Update access control if needed
Revoking Access
# Immediate revocation
gcloud identity groups memberships delete \
--group-email="prod-access@olympuscloud.ai" \
--member-email="user@olympuscloud.ai"
# Verify removal
gcloud projects get-iam-policy olympuscloud-prod \
--flatten="bindings[].members" \
--filter="bindings.members:user@olympuscloud.ai"
# Check for service account keys
gcloud iam service-accounts keys list \
--iam-account=user-sa@olympuscloud-prod.iam.gserviceaccount.com
Security Monitoring
Key Security Metrics
| Metric | Warning | Critical | Check Command |
|---|---|---|---|
| Failed logins | over 10/min | over 50/min | See Cloud Logging |
| WAF blocks | over 100/min | over 1000/min | Cloudflare dashboard |
| Privilege escalation | Any | Any | IAM audit logs |
| Unusual API patterns | 2σ deviation | 3σ deviation | Cloud Monitoring |
Checking Security Logs
# Failed authentication attempts (last hour)
gcloud logging read \
'resource.type="cloud_run_revision" AND
jsonPayload.severity="WARNING" AND
jsonPayload.message=~"authentication failed"' \
--limit=100 \
--format="table(timestamp,jsonPayload.sourceIP,jsonPayload.message)"
# IAM changes (last 24 hours)
gcloud logging read \
'protoPayload.serviceName="iam.googleapis.com" AND
protoPayload.methodName=~"SetIamPolicy"' \
--limit=50 \
--freshness=1d
# Firewall rule changes
gcloud logging read \
'resource.type="gce_firewall_rule" AND
protoPayload.methodName=~"compute.firewalls"' \
--limit=20
Alert Response
High-Volume Failed Logins:
-
Identify source IPs
gcloud logging read \
'jsonPayload.message=~"authentication failed"' \
--limit=1000 | grep -oP 'sourceIP:\s*\K[\d.]+' | sort | uniq -c | sort -rn | head -
Check if IPs are legitimate users or attackers
-
If attack: Add to Cloudflare block list
-
If legitimate: Check for service misconfiguration
Vulnerability Management
Severity SLAs
| CVSS Score | Severity | Remediation SLA |
|---|---|---|
| 9.0 - 10.0 | Critical | 24 hours |
| 7.0 - 8.9 | High | 7 days |
| 4.0 - 6.9 | Medium | 30 days |
| 0.1 - 3.9 | Low | 90 days |
Running Security Scans
# Container vulnerability scan
gcloud artifacts docker images scan \
gcr.io/olympuscloud-prod/api-gateway:latest \
--remote
# Check scan results
gcloud artifacts docker images list-vulnerabilities \
gcr.io/olympuscloud-prod/api-gateway:latest \
--format="table(vulnerability.severity,vulnerability.cve,vulnerability.description)"
# Infrastructure security scan
terraform plan -var-file=environments/prod/terraform.tfvars | \
tfsec --format=json
Patching Procedures
Automated Patching (Low/Medium):
- Dependabot PRs auto-merged after tests pass
- Base images updated weekly
Manual Patching (High/Critical):
- Assess impact and create incident
- Develop and test patch
- Deploy to staging
- Verify fix with security team
- Deploy to production (expedited approval)
- Monitor for regressions
Secrets Management
Accessing Secrets
# List available secrets
gcloud secrets list --project=olympuscloud-prod
# View secret value (authorized users only)
gcloud secrets versions access latest \
--secret="api-signing-key" \
--project=olympuscloud-prod
# Check who has access to a secret
gcloud secrets get-iam-policy api-signing-key \
--project=olympuscloud-prod
Rotating Secrets
# 1. Create new secret version
gcloud secrets versions add api-signing-key \
--data-file=/secure/new-key.txt \
--project=olympuscloud-prod
# 2. Deploy services to pick up new version
gcloud run services update api-gateway \
--region=us-central1 \
--project=olympuscloud-prod
# 3. Verify new secret is being used
# Check logs for successful auth with new key
# 4. Disable old version (after verification)
gcloud secrets versions disable OLD_VERSION_ID \
--secret=api-signing-key \
--project=olympuscloud-prod
Secret Rotation Schedule
| Secret Type | Rotation Frequency | Automated |
|---|---|---|
| API Keys | 90 days | Yes |
| Database passwords | 90 days | Yes |
| Service account keys | 365 days | No |
| TLS certificates | Auto (Let's Encrypt) | Yes |
| Encryption keys | 365 days | No |
Security Incident Response
Incident Classification
| Class | Description | Response |
|---|---|---|
| SEV-1 | Active breach, data exposure | All hands, 24/7 |
| SEV-2 | Vulnerability being exploited | Security team + on-call |
| SEV-3 | Suspicious activity | Security team |
| SEV-4 | Policy violation, audit finding | Standard process |
SEV-1 Response Checklist
-
Contain (First 15 minutes)
- Isolate affected systems
- Preserve evidence (don't delete logs)
- Block attacker IPs/accounts
- Notify security lead
-
Assess (First hour)
- Identify scope of breach
- Determine data affected
- Identify attack vector
- Document timeline
-
Eradicate (As needed)
- Remove attacker access
- Patch vulnerabilities
- Reset compromised credentials
- Verify no persistence
-
Recover
- Restore from clean backups if needed
- Gradually restore services
- Monitor for re-compromise
-
Post-Incident
- Write incident report
- Conduct blameless postmortem
- Implement preventive measures
- Notify affected parties if required
Isolating a Compromised Service
# 1. Block all traffic to service
gcloud run services update COMPROMISED_SERVICE \
--no-traffic \
--region=us-central1
# 2. Revoke service account permissions
gcloud projects remove-iam-policy-binding olympuscloud-prod \
--member="serviceAccount:COMPROMISED_SA@olympuscloud-prod.iam.gserviceaccount.com" \
--role="roles/spanner.databaseUser"
# 3. Capture logs before they rotate
gcloud logging read \
'resource.labels.service_name="COMPROMISED_SERVICE"' \
--limit=10000 \
--format=json > /secure/incident-logs-$(date +%Y%m%d).json
# 4. Take database snapshot (if DB access suspected)
gcloud sql backups create \
--instance=olympus-pg-prod \
--description="Security incident $(date)"
Compliance Verification
SOC 2 Controls Check
# Access control verification
echo "=== Access Control Audit ==="
gcloud projects get-iam-policy olympuscloud-prod --format=json > iam-audit.json
# Encryption verification
echo "=== Encryption Audit ==="
gcloud sql instances describe olympus-pg-prod \
--format="value(settings.ipConfiguration.requireSsl,settings.dataDiskEncryptionConfig)"
# Logging verification
echo "=== Logging Audit ==="
gcloud logging sinks list --project=olympuscloud-prod
Compliance Checklist (Monthly)
- Review IAM permissions (remove unused)
- Verify MFA enforcement
- Check certificate expiration dates
- Review security group memberships
- Audit API key usage
- Verify backup encryption
- Review WAF rule effectiveness
- Check vulnerability scan results
Generating Compliance Reports
# Generate access report
gcloud asset search-all-iam-policies \
--scope=projects/olympuscloud-prod \
--query="policy:*@olympuscloud.ai" \
--format=json > compliance/access-report-$(date +%Y%m).json
# Generate resource inventory
gcloud asset search-all-resources \
--scope=projects/olympuscloud-prod \
--format=json > compliance/resource-inventory-$(date +%Y%m).json
Certificate Management
Checking Certificate Expiration
# Check all managed certificates
gcloud compute ssl-certificates list \
--format="table(name,type,expireTime,managed.status)"
# Check specific domain
echo | openssl s_client -servername api.olympuscloud.ai \
-connect api.olympuscloud.ai:443 2>/dev/null | \
openssl x509 -noout -dates
Renewing Certificates
Managed Certificates (Auto-renewed):
- Cloud Run automatically renews
- Cloudflare Universal SSL auto-renews
Manual Certificates:
# Generate new certificate request
openssl req -new -key private.key -out request.csr
# Upload new certificate
gcloud compute ssl-certificates create api-cert-$(date +%Y%m) \
--certificate=new-cert.pem \
--private-key=private.key
# Update load balancer
gcloud compute target-https-proxies update api-proxy \
--ssl-certificates=api-cert-$(date +%Y%m)
Escalation Matrix
| Issue Type | First Response | Escalate To | Executive |
|---|---|---|---|
| Active breach | Security on-call | Security Lead | CTO within 1 hour |
| Data exposure | Security on-call | Security Lead + Legal | CEO within 2 hours |
| Vulnerability (Critical) | Security team | Engineering Lead | CTO within 24 hours |
| Compliance finding | Security team | Compliance Officer | CFO within 1 week |
| Access request | On-call SRE | Security team | N/A |