Data Sync & IoT Troubleshooting Runbook
Operational procedures for diagnosing and resolving data synchronization and IoT issues.
Quick Reference
| Issue | Severity | On-Call Action |
|---|---|---|
| Location offline >15 min | P1 | Check connectivity, escalate if hardware |
| Sync queue >1000 items | P2 | Check cloud service health |
| Temperature sensor alert | P1 | Verify reading, contact location |
| Device offline | P3 | Check device health, remote restart |
| Conflict rate >1% | P2 | Review sync strategy settings |
Sync Issues
Location Not Syncing
Symptoms:
- Location shows "Offline" in dashboard
- Pending sync count growing
- No recent sync timestamp
Diagnosis:
# Check location sync status
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/sync/status/{location_id}"
# Check edge agent logs (if accessible)
ssh edge@{location_ip} "journalctl -u olympus-edge-agent -n 100"
# Check cloud sync service health
gcloud run services describe sync-service --region us-central1 --format json \
| jq '.status.conditions'
Resolution Steps:
-
Verify network connectivity
# From edge device
ping -c 5 api.olympuscloud.ai
curl -I https://api.olympuscloud.ai/health -
Check edge agent status
systemctl status olympus-edge-agent
# If stopped, restart
systemctl restart olympus-edge-agent -
Check local database
sqlite3 /var/lib/olympus/edge.db "SELECT COUNT(*) FROM sync_queue" -
Force sync attempt
curl -X POST -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/sync/trigger" \
-d '{"location_id": "{location_id}", "priority": "high"}' -
If persistent, check firewall rules
- Outbound 443 (HTTPS) must be open
- WebSocket upgrade must be allowed
High Sync Latency
Symptoms:
- Sync taking >30 seconds
- Orders appearing delayed at other locations
- Dashboard showing stale data
Diagnosis:
# Check sync latency metrics
gcloud monitoring metrics list --filter="metric.type:custom.googleapis.com/sync/latency"
# Check database performance
# Use the appropriate instance for your environment:
# dev-olympus-spanner | staging-olympus-spanner | prod-olympus-spanner
gcloud spanner databases execute-sql olympus-db \
--instance=${SPANNER_INSTANCE} \
--sql="SELECT AVG(sync_duration_ms) FROM sync_audit WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)"
Resolution:
-
Check batch size configuration
- Default: 100 records per batch
- If queue is large, consider temporary increase
-
Review conflict rate
SELECT conflict_type, COUNT(*)
FROM sync_conflicts
WHERE resolved_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY conflict_type -
Check for large payloads
- Menu updates with images can be large
- Consider splitting into smaller batches
-
Verify Cloud Run scaling
gcloud run services describe sync-service --region us-central1 \
--format="value(spec.template.spec.containerConcurrency)"
Sync Conflicts
Symptoms:
- Data inconsistencies between locations
- Order items missing or duplicated
- Time punch discrepancies
Diagnosis:
-- Recent conflicts
SELECT * FROM sync_conflicts
WHERE created_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY created_at DESC
LIMIT 50;
-- Conflict rate by table
SELECT table_name, COUNT(*) as conflicts,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as percentage
FROM sync_conflicts
WHERE created_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY table_name;
Resolution:
-
Review conflict strategy for affected table
{
"orders": {"strategy": "server_wins"},
"menu_items": {"strategy": "server_wins"},
"time_punches": {"strategy": "merge"}
} -
For orders: Server-wins is correct (cloud is authoritative)
-
For time punches: Merge strategy may need adjustment if conflicts persist
-
For menu items: Always server-wins (menu managed centrally)
-
If custom resolution needed, escalate to engineering
IoT Issues
Temperature Alert
Severity: P1 - Immediate Response Required
Symptoms:
- Alert notification received
- Temperature out of range on dashboard
- Possible food safety concern
Immediate Actions:
-
Verify the reading is accurate
# Get recent readings
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/iot/devices/{device_id}/telemetry?period=1h" -
Check for sensor malfunction
- Single spike vs sustained reading
- Compare with other sensors in same area
-
Contact location manager
- Verify physical equipment status
- Check if door was left open
- Check if power interruption occurred
-
Document in incident log
# Log the incident
curl -X POST -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/incidents" \
-d '{
"type": "temperature_alert",
"device_id": "{device_id}",
"location_id": "{location_id}",
"severity": "high",
"description": "Walk-in cooler temperature exceeded threshold"
}'
Escalation:
- If food safety is compromised, follow food safety protocols
- If equipment failure suspected, dispatch maintenance
Device Offline
Symptoms:
- Device shows "Offline" in dashboard
- No telemetry data received
- Last seen timestamp is stale
Diagnosis:
# Check device status
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/mdm/devices/{device_id}"
# Check MQTT broker connection
mosquitto_sub -h mqtt.olympuscloud.ai -t "olympus/+/{location_id}/+/status" -v
Resolution:
-
Attempt remote restart
curl -X POST -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/restart" -
Check network connectivity at location
- WiFi signal strength
- MQTT port 8883 accessibility
-
Check device power
- Battery level (if applicable)
- Power supply status
-
If device remains offline after 30 minutes
- Create maintenance ticket
- Consider replacement if hardware failure
MQTT Connection Issues
Symptoms:
- Devices intermittently disconnecting
- Telemetry gaps in data
- High message latency
Diagnosis:
# Check MQTT broker health
gcloud monitoring dashboards list --filter="displayName:MQTT"
# Check connection count
curl -s "https://mqtt-admin.olympuscloud.ai/api/connections" \
| jq '.total_connections'
# Check message throughput
curl -s "https://mqtt-admin.olympuscloud.ai/api/stats" \
| jq '.messages_per_second'
Resolution:
-
Check client certificate expiration
# List expiring certificates
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/mdm/certificates?expiring_within=30d" -
Verify topic ACLs
- Devices should only publish to their assigned topics
- Check for permission denied errors in logs
-
Review QoS settings
- Critical alerts: QoS 2 (exactly once)
- Telemetry: QoS 1 (at least once)
- Status updates: QoS 0 (best effort)
-
Check for message queue buildup
- If broker is overloaded, scale up resources
Edge Agent Issues
Edge Agent Crash Loop
Symptoms:
- Agent repeatedly restarting
- systemd showing failed status
- Partial data in sync queue
Diagnosis:
# Check service status
systemctl status olympus-edge-agent
# Check logs for errors
journalctl -u olympus-edge-agent -n 500 --no-pager | grep -i error
# Check disk space
df -h /var/lib/olympus
# Check memory
free -m
Resolution:
-
Check for database corruption
sqlite3 /var/lib/olympus/edge.db "PRAGMA integrity_check" -
If disk full, clean up old logs
journalctl --vacuum-size=100M -
Check for config issues
cat /etc/olympus/edge-agent.yaml -
Reset to last known good config
cp /etc/olympus/edge-agent.yaml.backup /etc/olympus/edge-agent.yaml
systemctl restart olympus-edge-agent
SQLite Database Issues
Symptoms:
- Edge agent unable to read/write local data
- "database is locked" errors
- Corrupted data warnings
Diagnosis:
# Check database integrity
sqlite3 /var/lib/olympus/edge.db "PRAGMA integrity_check"
# Check for lock issues
fuser /var/lib/olympus/edge.db
# Check file permissions
ls -la /var/lib/olympus/edge.db
Resolution:
-
For lock issues
# Stop the agent
systemctl stop olympus-edge-agent
# Remove any stale locks
rm -f /var/lib/olympus/edge.db-wal /var/lib/olympus/edge.db-shm
# Restart
systemctl start olympus-edge-agent -
For corruption
# Backup current database
cp /var/lib/olympus/edge.db /var/lib/olympus/edge.db.corrupt
# Attempt recovery
sqlite3 /var/lib/olympus/edge.db.corrupt ".dump" | sqlite3 /var/lib/olympus/edge.db.new
# Verify new database
sqlite3 /var/lib/olympus/edge.db.new "PRAGMA integrity_check"
# If OK, replace
mv /var/lib/olympus/edge.db.new /var/lib/olympus/edge.db
systemctl restart olympus-edge-agent -
If unrecoverable
- Data will need to resync from cloud
- Create fresh database and trigger full sync
MDM Issues
Device Enrollment Failure
Symptoms:
- Device stuck in "pending_activation"
- Enrollment token expired
- Certificate provisioning failed
Resolution:
-
Generate new enrollment token
curl -X POST -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/regenerate-token" -
Verify device serial number matches
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/mdm/devices/{device_id}" \
| jq '.serial_number' -
Check certificate chain
- Device must trust our CA
- Intermediate certificates must be present
OTA Update Failure
Symptoms:
- Update stuck at percentage
- Rollback triggered
- Device in unknown state
Resolution:
-
Check update status
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/updates/status" -
If stuck, cancel and retry
curl -X POST -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/updates/cancel" -
If rollback occurred, check logs
- Verify firmware image integrity
- Check for compatibility issues
- Consider staged rollout for problematic updates
Monitoring & Alerts
Key Metrics to Monitor
| Metric | Warning | Critical | Action |
|---|---|---|---|
| sync_queue_depth | over 500 | over 1000 | Check sync service |
| sync_latency_p99 | over 10s | over 30s | Scale sync service |
| iot_message_rate | under 50% baseline | under 25% baseline | Check MQTT broker |
| device_offline_count | over 5% | over 10% | Investigate connectivity |
| conflict_rate | over 0.5% | over 1% | Review sync strategies |
Alert Queries
# Sync queue growing
increase(sync_queue_depth[5m]) > 100
# High sync latency
histogram_quantile(0.99, rate(sync_latency_bucket[5m])) > 30
# Device offline rate
sum(device_status{status="offline"}) / sum(device_status) > 0.05
# IoT message rate drop
rate(iot_messages_received_total[5m]) < rate(iot_messages_received_total[1h] offset 1d) * 0.5
Escalation Matrix
| Issue Type | L1 Support | L2 Engineering | P0 On-Call |
|---|---|---|---|
| Single device offline | ✓ | ||
| Location offline | ✓ | After 30 min | After 2 hours |
| Temperature alert | ✓ | If food safety risk | |
| Sync service down | ✓ | Immediately | |
| Data corruption | ✓ | If affects orders | |
| MQTT broker issues | ✓ | If >10% devices affected |