Skip to main content

Data Sync & IoT Troubleshooting Runbook

Operational procedures for diagnosing and resolving data synchronization and IoT issues.

Quick Reference

IssueSeverityOn-Call Action
Location offline >15 minP1Check connectivity, escalate if hardware
Sync queue >1000 itemsP2Check cloud service health
Temperature sensor alertP1Verify reading, contact location
Device offlineP3Check device health, remote restart
Conflict rate >1%P2Review sync strategy settings

Sync Issues

Location Not Syncing

Symptoms:

  • Location shows "Offline" in dashboard
  • Pending sync count growing
  • No recent sync timestamp

Diagnosis:

# Check location sync status
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/sync/status/{location_id}"

# Check edge agent logs (if accessible)
ssh edge@{location_ip} "journalctl -u olympus-edge-agent -n 100"

# Check cloud sync service health
gcloud run services describe sync-service --region us-central1 --format json \
| jq '.status.conditions'

Resolution Steps:

  1. Verify network connectivity

    # From edge device
    ping -c 5 api.olympuscloud.ai
    curl -I https://api.olympuscloud.ai/health
  2. Check edge agent status

    systemctl status olympus-edge-agent
    # If stopped, restart
    systemctl restart olympus-edge-agent
  3. Check local database

    sqlite3 /var/lib/olympus/edge.db "SELECT COUNT(*) FROM sync_queue"
  4. Force sync attempt

    curl -X POST -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/sync/trigger" \
    -d '{"location_id": "{location_id}", "priority": "high"}'
  5. If persistent, check firewall rules

    • Outbound 443 (HTTPS) must be open
    • WebSocket upgrade must be allowed

High Sync Latency

Symptoms:

  • Sync taking >30 seconds
  • Orders appearing delayed at other locations
  • Dashboard showing stale data

Diagnosis:

# Check sync latency metrics
gcloud monitoring metrics list --filter="metric.type:custom.googleapis.com/sync/latency"

# Check database performance
# Use the appropriate instance for your environment:
# dev-olympus-spanner | staging-olympus-spanner | prod-olympus-spanner
gcloud spanner databases execute-sql olympus-db \
--instance=${SPANNER_INSTANCE} \
--sql="SELECT AVG(sync_duration_ms) FROM sync_audit WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)"

Resolution:

  1. Check batch size configuration

    • Default: 100 records per batch
    • If queue is large, consider temporary increase
  2. Review conflict rate

    SELECT conflict_type, COUNT(*)
    FROM sync_conflicts
    WHERE resolved_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
    GROUP BY conflict_type
  3. Check for large payloads

    • Menu updates with images can be large
    • Consider splitting into smaller batches
  4. Verify Cloud Run scaling

    gcloud run services describe sync-service --region us-central1 \
    --format="value(spec.template.spec.containerConcurrency)"

Sync Conflicts

Symptoms:

  • Data inconsistencies between locations
  • Order items missing or duplicated
  • Time punch discrepancies

Diagnosis:

-- Recent conflicts
SELECT * FROM sync_conflicts
WHERE created_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY created_at DESC
LIMIT 50;

-- Conflict rate by table
SELECT table_name, COUNT(*) as conflicts,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as percentage
FROM sync_conflicts
WHERE created_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY table_name;

Resolution:

  1. Review conflict strategy for affected table

    {
    "orders": {"strategy": "server_wins"},
    "menu_items": {"strategy": "server_wins"},
    "time_punches": {"strategy": "merge"}
    }
  2. For orders: Server-wins is correct (cloud is authoritative)

  3. For time punches: Merge strategy may need adjustment if conflicts persist

  4. For menu items: Always server-wins (menu managed centrally)

  5. If custom resolution needed, escalate to engineering


IoT Issues

Temperature Alert

Severity: P1 - Immediate Response Required

Symptoms:

  • Alert notification received
  • Temperature out of range on dashboard
  • Possible food safety concern

Immediate Actions:

  1. Verify the reading is accurate

    # Get recent readings
    curl -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/iot/devices/{device_id}/telemetry?period=1h"
  2. Check for sensor malfunction

    • Single spike vs sustained reading
    • Compare with other sensors in same area
  3. Contact location manager

    • Verify physical equipment status
    • Check if door was left open
    • Check if power interruption occurred
  4. Document in incident log

    # Log the incident
    curl -X POST -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/incidents" \
    -d '{
    "type": "temperature_alert",
    "device_id": "{device_id}",
    "location_id": "{location_id}",
    "severity": "high",
    "description": "Walk-in cooler temperature exceeded threshold"
    }'

Escalation:

  • If food safety is compromised, follow food safety protocols
  • If equipment failure suspected, dispatch maintenance

Device Offline

Symptoms:

  • Device shows "Offline" in dashboard
  • No telemetry data received
  • Last seen timestamp is stale

Diagnosis:

# Check device status
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/mdm/devices/{device_id}"

# Check MQTT broker connection
mosquitto_sub -h mqtt.olympuscloud.ai -t "olympus/+/{location_id}/+/status" -v

Resolution:

  1. Attempt remote restart

    curl -X POST -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/restart"
  2. Check network connectivity at location

    • WiFi signal strength
    • MQTT port 8883 accessibility
  3. Check device power

    • Battery level (if applicable)
    • Power supply status
  4. If device remains offline after 30 minutes

    • Create maintenance ticket
    • Consider replacement if hardware failure

MQTT Connection Issues

Symptoms:

  • Devices intermittently disconnecting
  • Telemetry gaps in data
  • High message latency

Diagnosis:

# Check MQTT broker health
gcloud monitoring dashboards list --filter="displayName:MQTT"

# Check connection count
curl -s "https://mqtt-admin.olympuscloud.ai/api/connections" \
| jq '.total_connections'

# Check message throughput
curl -s "https://mqtt-admin.olympuscloud.ai/api/stats" \
| jq '.messages_per_second'

Resolution:

  1. Check client certificate expiration

    # List expiring certificates
    curl -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/mdm/certificates?expiring_within=30d"
  2. Verify topic ACLs

    • Devices should only publish to their assigned topics
    • Check for permission denied errors in logs
  3. Review QoS settings

    • Critical alerts: QoS 2 (exactly once)
    • Telemetry: QoS 1 (at least once)
    • Status updates: QoS 0 (best effort)
  4. Check for message queue buildup

    • If broker is overloaded, scale up resources

Edge Agent Issues

Edge Agent Crash Loop

Symptoms:

  • Agent repeatedly restarting
  • systemd showing failed status
  • Partial data in sync queue

Diagnosis:

# Check service status
systemctl status olympus-edge-agent

# Check logs for errors
journalctl -u olympus-edge-agent -n 500 --no-pager | grep -i error

# Check disk space
df -h /var/lib/olympus

# Check memory
free -m

Resolution:

  1. Check for database corruption

    sqlite3 /var/lib/olympus/edge.db "PRAGMA integrity_check"
  2. If disk full, clean up old logs

    journalctl --vacuum-size=100M
  3. Check for config issues

    cat /etc/olympus/edge-agent.yaml
  4. Reset to last known good config

    cp /etc/olympus/edge-agent.yaml.backup /etc/olympus/edge-agent.yaml
    systemctl restart olympus-edge-agent

SQLite Database Issues

Symptoms:

  • Edge agent unable to read/write local data
  • "database is locked" errors
  • Corrupted data warnings

Diagnosis:

# Check database integrity
sqlite3 /var/lib/olympus/edge.db "PRAGMA integrity_check"

# Check for lock issues
fuser /var/lib/olympus/edge.db

# Check file permissions
ls -la /var/lib/olympus/edge.db

Resolution:

  1. For lock issues

    # Stop the agent
    systemctl stop olympus-edge-agent
    # Remove any stale locks
    rm -f /var/lib/olympus/edge.db-wal /var/lib/olympus/edge.db-shm
    # Restart
    systemctl start olympus-edge-agent
  2. For corruption

    # Backup current database
    cp /var/lib/olympus/edge.db /var/lib/olympus/edge.db.corrupt

    # Attempt recovery
    sqlite3 /var/lib/olympus/edge.db.corrupt ".dump" | sqlite3 /var/lib/olympus/edge.db.new

    # Verify new database
    sqlite3 /var/lib/olympus/edge.db.new "PRAGMA integrity_check"

    # If OK, replace
    mv /var/lib/olympus/edge.db.new /var/lib/olympus/edge.db
    systemctl restart olympus-edge-agent
  3. If unrecoverable

    • Data will need to resync from cloud
    • Create fresh database and trigger full sync

MDM Issues

Device Enrollment Failure

Symptoms:

  • Device stuck in "pending_activation"
  • Enrollment token expired
  • Certificate provisioning failed

Resolution:

  1. Generate new enrollment token

    curl -X POST -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/regenerate-token"
  2. Verify device serial number matches

    curl -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/mdm/devices/{device_id}" \
    | jq '.serial_number'
  3. Check certificate chain

    • Device must trust our CA
    • Intermediate certificates must be present

OTA Update Failure

Symptoms:

  • Update stuck at percentage
  • Rollback triggered
  • Device in unknown state

Resolution:

  1. Check update status

    curl -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/updates/status"
  2. If stuck, cancel and retry

    curl -X POST -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/updates/cancel"
  3. If rollback occurred, check logs

    • Verify firmware image integrity
    • Check for compatibility issues
    • Consider staged rollout for problematic updates

Monitoring & Alerts

Key Metrics to Monitor

MetricWarningCriticalAction
sync_queue_depthover 500over 1000Check sync service
sync_latency_p99over 10sover 30sScale sync service
iot_message_rateunder 50% baselineunder 25% baselineCheck MQTT broker
device_offline_countover 5%over 10%Investigate connectivity
conflict_rateover 0.5%over 1%Review sync strategies

Alert Queries

# Sync queue growing
increase(sync_queue_depth[5m]) > 100

# High sync latency
histogram_quantile(0.99, rate(sync_latency_bucket[5m])) > 30

# Device offline rate
sum(device_status{status="offline"}) / sum(device_status) > 0.05

# IoT message rate drop
rate(iot_messages_received_total[5m]) < rate(iot_messages_received_total[1h] offset 1d) * 0.5

Escalation Matrix

Issue TypeL1 SupportL2 EngineeringP0 On-Call
Single device offline
Location offlineAfter 30 minAfter 2 hours
Temperature alertIf food safety risk
Sync service downImmediately
Data corruptionIf affects orders
MQTT broker issuesIf >10% devices affected