Skip to main content

Production Security Runbook

Security procedures and incident response for Olympus Cloud production environments.

Quick Reference

IssueSeverityAction
Suspected breachP0Isolate, preserve evidence, escalate immediately
Unauthorized access attemptP1Block source, investigate, report
Vulnerability discoveredP1-P2Assess impact, patch timeline based on CVSS
Certificate expiringP2Renew 30 days before expiry
Failed security scanP2Review findings, remediate within SLA
Compliance audit findingP2-P3Document remediation plan

Security Architecture

Network Security Layers

┌─────────────────────────────────────────────────────────────────────┐
│ Internet Traffic │
└────────────────────────────────┬────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ Cloudflare WAF & DDoS Protection │
│ - Rate limiting (1000 req/min per IP) │
│ - Bot detection & challenge │
│ - Geo-blocking (configurable) │
│ - OWASP rule sets enabled │
└────────────────────────────────┬────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ Cloud Armor (GCP) │
│ - Additional WAF rules │
│ - Adaptive protection │
│ - Pre-configured rules for common attacks │
└────────────────────────────────┬────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ VPC Network │
│ - Private subnets for services │
│ - Firewall rules (deny by default) │
│ - VPC Service Controls │
└─────────────────────────────────────────────────────────────────────┘

Service Authentication

ServiceAuth MethodToken Lifetime
API GatewayJWT with RS2561 hour
Service-to-ServiceWorkload Identity1 hour
Admin AccessOAuth 2.0 + MFA8 hours
CI/CDService AccountPer-run

Access Control Procedures

Granting Production Access

Prerequisites:

  • Completed security training
  • Signed NDA and acceptable use policy
  • Manager approval documented
  • Role justification

Process:

# 1. Verify user in access request system
gcloud identity groups memberships search-transitive-memberships \
--group-email="prod-access@olympuscloud.ai" \
--member-email="user@olympuscloud.ai"

# 2. Add to appropriate group (requires admin)
gcloud identity groups memberships add \
--group-email="sre-prod@olympuscloud.ai" \
--member-email="user@olympuscloud.ai"

# 3. Verify access
gcloud projects get-iam-policy olympuscloud-prod \
--flatten="bindings[].members" \
--filter="bindings.members:user@olympuscloud.ai"

Emergency Access (Break Glass)

For emergency situations when normal access is unavailable:

# 1. Document the emergency
echo "Emergency access request: $(date)" >> /var/log/emergency-access.log

# 2. Use break-glass account
gcloud auth activate-service-account \
--key-file=/secure/break-glass-key.json

# 3. Perform necessary actions
# ... (all commands logged)

# 4. Revoke and report
gcloud auth revoke break-glass@olympuscloud-prod.iam.gserviceaccount.com

Post-Emergency:

  1. Create incident report within 24 hours
  2. Review all actions taken
  3. Rotate break-glass credentials
  4. Update access control if needed

Revoking Access

# Immediate revocation
gcloud identity groups memberships delete \
--group-email="prod-access@olympuscloud.ai" \
--member-email="user@olympuscloud.ai"

# Verify removal
gcloud projects get-iam-policy olympuscloud-prod \
--flatten="bindings[].members" \
--filter="bindings.members:user@olympuscloud.ai"

# Check for service account keys
gcloud iam service-accounts keys list \
--iam-account=user-sa@olympuscloud-prod.iam.gserviceaccount.com

Security Monitoring

Key Security Metrics

MetricWarningCriticalCheck Command
Failed loginsover 10/minover 50/minSee Cloud Logging
WAF blocksover 100/minover 1000/minCloudflare dashboard
Privilege escalationAnyAnyIAM audit logs
Unusual API patterns2σ deviation3σ deviationCloud Monitoring

Checking Security Logs

# Failed authentication attempts (last hour)
gcloud logging read \
'resource.type="cloud_run_revision" AND
jsonPayload.severity="WARNING" AND
jsonPayload.message=~"authentication failed"' \
--limit=100 \
--format="table(timestamp,jsonPayload.sourceIP,jsonPayload.message)"

# IAM changes (last 24 hours)
gcloud logging read \
'protoPayload.serviceName="iam.googleapis.com" AND
protoPayload.methodName=~"SetIamPolicy"' \
--limit=50 \
--freshness=1d

# Firewall rule changes
gcloud logging read \
'resource.type="gce_firewall_rule" AND
protoPayload.methodName=~"compute.firewalls"' \
--limit=20

Alert Response

High-Volume Failed Logins:

  1. Identify source IPs

    gcloud logging read \
    'jsonPayload.message=~"authentication failed"' \
    --limit=1000 | grep -oP 'sourceIP:\s*\K[\d.]+' | sort | uniq -c | sort -rn | head
  2. Check if IPs are legitimate users or attackers

  3. If attack: Add to Cloudflare block list

  4. If legitimate: Check for service misconfiguration


Vulnerability Management

Severity SLAs

CVSS ScoreSeverityRemediation SLA
9.0 - 10.0Critical24 hours
7.0 - 8.9High7 days
4.0 - 6.9Medium30 days
0.1 - 3.9Low90 days

Running Security Scans

# Container vulnerability scan
gcloud artifacts docker images scan \
gcr.io/olympuscloud-prod/api-gateway:latest \
--remote

# Check scan results
gcloud artifacts docker images list-vulnerabilities \
gcr.io/olympuscloud-prod/api-gateway:latest \
--format="table(vulnerability.severity,vulnerability.cve,vulnerability.description)"

# Infrastructure security scan
terraform plan -var-file=environments/prod/terraform.tfvars | \
tfsec --format=json

Patching Procedures

Automated Patching (Low/Medium):

  • Dependabot PRs auto-merged after tests pass
  • Base images updated weekly

Manual Patching (High/Critical):

  1. Assess impact and create incident
  2. Develop and test patch
  3. Deploy to staging
  4. Verify fix with security team
  5. Deploy to production (expedited approval)
  6. Monitor for regressions

Secrets Management

Accessing Secrets

# List available secrets
gcloud secrets list --project=olympuscloud-prod

# View secret value (authorized users only)
gcloud secrets versions access latest \
--secret="api-signing-key" \
--project=olympuscloud-prod

# Check who has access to a secret
gcloud secrets get-iam-policy api-signing-key \
--project=olympuscloud-prod

Rotating Secrets

# 1. Create new secret version
gcloud secrets versions add api-signing-key \
--data-file=/secure/new-key.txt \
--project=olympuscloud-prod

# 2. Deploy services to pick up new version
gcloud run services update api-gateway \
--region=us-central1 \
--project=olympuscloud-prod

# 3. Verify new secret is being used
# Check logs for successful auth with new key

# 4. Disable old version (after verification)
gcloud secrets versions disable OLD_VERSION_ID \
--secret=api-signing-key \
--project=olympuscloud-prod

Secret Rotation Schedule

Secret TypeRotation FrequencyAutomated
API Keys90 daysYes
Database passwords90 daysYes
Service account keys365 daysNo
TLS certificatesAuto (Let's Encrypt)Yes
Encryption keys365 daysNo

Security Incident Response

Incident Classification

ClassDescriptionResponse
SEV-1Active breach, data exposureAll hands, 24/7
SEV-2Vulnerability being exploitedSecurity team + on-call
SEV-3Suspicious activitySecurity team
SEV-4Policy violation, audit findingStandard process

SEV-1 Response Checklist

  1. Contain (First 15 minutes)

    • Isolate affected systems
    • Preserve evidence (don't delete logs)
    • Block attacker IPs/accounts
    • Notify security lead
  2. Assess (First hour)

    • Identify scope of breach
    • Determine data affected
    • Identify attack vector
    • Document timeline
  3. Eradicate (As needed)

    • Remove attacker access
    • Patch vulnerabilities
    • Reset compromised credentials
    • Verify no persistence
  4. Recover

    • Restore from clean backups if needed
    • Gradually restore services
    • Monitor for re-compromise
  5. Post-Incident

    • Write incident report
    • Conduct blameless postmortem
    • Implement preventive measures
    • Notify affected parties if required

Isolating a Compromised Service

# 1. Block all traffic to service
gcloud run services update COMPROMISED_SERVICE \
--no-traffic \
--region=us-central1

# 2. Revoke service account permissions
gcloud projects remove-iam-policy-binding olympuscloud-prod \
--member="serviceAccount:COMPROMISED_SA@olympuscloud-prod.iam.gserviceaccount.com" \
--role="roles/spanner.databaseUser"

# 3. Capture logs before they rotate
gcloud logging read \
'resource.labels.service_name="COMPROMISED_SERVICE"' \
--limit=10000 \
--format=json > /secure/incident-logs-$(date +%Y%m%d).json

# 4. Take database snapshot (if DB access suspected)
gcloud sql backups create \
--instance=olympus-pg-prod \
--description="Security incident $(date)"

Compliance Verification

SOC 2 Controls Check

# Access control verification
echo "=== Access Control Audit ==="
gcloud projects get-iam-policy olympuscloud-prod --format=json > iam-audit.json

# Encryption verification
echo "=== Encryption Audit ==="
gcloud sql instances describe olympus-pg-prod \
--format="value(settings.ipConfiguration.requireSsl,settings.dataDiskEncryptionConfig)"

# Logging verification
echo "=== Logging Audit ==="
gcloud logging sinks list --project=olympuscloud-prod

Compliance Checklist (Monthly)

  • Review IAM permissions (remove unused)
  • Verify MFA enforcement
  • Check certificate expiration dates
  • Review security group memberships
  • Audit API key usage
  • Verify backup encryption
  • Review WAF rule effectiveness
  • Check vulnerability scan results

Generating Compliance Reports

# Generate access report
gcloud asset search-all-iam-policies \
--scope=projects/olympuscloud-prod \
--query="policy:*@olympuscloud.ai" \
--format=json > compliance/access-report-$(date +%Y%m).json

# Generate resource inventory
gcloud asset search-all-resources \
--scope=projects/olympuscloud-prod \
--format=json > compliance/resource-inventory-$(date +%Y%m).json

Certificate Management

Checking Certificate Expiration

# Check all managed certificates
gcloud compute ssl-certificates list \
--format="table(name,type,expireTime,managed.status)"

# Check specific domain
echo | openssl s_client -servername api.olympuscloud.ai \
-connect api.olympuscloud.ai:443 2>/dev/null | \
openssl x509 -noout -dates

Renewing Certificates

Managed Certificates (Auto-renewed):

  • Cloud Run automatically renews
  • Cloudflare Universal SSL auto-renews

Manual Certificates:

# Generate new certificate request
openssl req -new -key private.key -out request.csr

# Upload new certificate
gcloud compute ssl-certificates create api-cert-$(date +%Y%m) \
--certificate=new-cert.pem \
--private-key=private.key

# Update load balancer
gcloud compute target-https-proxies update api-proxy \
--ssl-certificates=api-cert-$(date +%Y%m)

Escalation Matrix

Issue TypeFirst ResponseEscalate ToExecutive
Active breachSecurity on-callSecurity LeadCTO within 1 hour
Data exposureSecurity on-callSecurity Lead + LegalCEO within 2 hours
Vulnerability (Critical)Security teamEngineering LeadCTO within 24 hours
Compliance findingSecurity teamCompliance OfficerCFO within 1 week
Access requestOn-call SRESecurity teamN/A