Data Sync & IoT Troubleshooting Runbook

Operational procedures for diagnosing and resolving data synchronization and IoT issues.

Quick Reference

Issue	Severity	On-Call Action
Location offline >15 min	P1	Check connectivity, escalate if hardware
Sync queue >1000 items	P2	Check cloud service health
Temperature sensor alert	P1	Verify reading, contact location
Device offline	P3	Check device health, remote restart
Conflict rate >1%	P2	Review sync strategy settings

Sync Issues

Location Not Syncing

Symptoms:

Location shows "Offline" in dashboard
Pending sync count growing
No recent sync timestamp

Diagnosis:

# Check location sync status
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.olympuscloud.ai/v1/sync/status/{location_id}"

# Check edge agent logs (if accessible)
ssh edge@{location_ip} "journalctl -u olympus-edge-agent -n 100"

# Check cloud sync service health
gcloud run services describe sync-service --region us-central1 --format json \
  | jq '.status.conditions'

Resolution Steps:

Verify network connectivity

# From edge device
ping -c 5 api.olympuscloud.ai
curl -I https://api.olympuscloud.ai/health

Check edge agent status

systemctl status olympus-edge-agent
# If stopped, restart
systemctl restart olympus-edge-agent

Check local database

sqlite3 /var/lib/olympus/edge.db "SELECT COUNT(*) FROM sync_queue"

Force sync attempt

curl -X POST -H "Authorization: Bearer $TOKEN" \
  "https://api.olympuscloud.ai/v1/sync/trigger" \
  -d '{"location_id": "{location_id}", "priority": "high"}'

If persistent, check firewall rules
- Outbound 443 (HTTPS) must be open
- WebSocket upgrade must be allowed

High Sync Latency

Symptoms:

Sync taking >30 seconds
Orders appearing delayed at other locations
Dashboard showing stale data

Diagnosis:

# Check sync latency metrics
gcloud monitoring metrics list --filter="metric.type:custom.googleapis.com/sync/latency"

# Check database performance
# Use the appropriate instance for your environment:
# dev-olympus-spanner | staging-olympus-spanner | prod-olympus-spanner
gcloud spanner databases execute-sql olympus-db \
  --instance=${SPANNER_INSTANCE} \
  --sql="SELECT AVG(sync_duration_ms) FROM sync_audit WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)"

Resolution:

Check batch size configuration
- Default: 100 records per batch
- If queue is large, consider temporary increase

Review conflict rate

SELECT conflict_type, COUNT(*)
FROM sync_conflicts
WHERE resolved_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY conflict_type

Check for large payloads
- Menu updates with images can be large
- Consider splitting into smaller batches

Verify Cloud Run scaling

gcloud run services describe sync-service --region us-central1 \
  --format="value(spec.template.spec.containerConcurrency)"

Sync Conflicts

Symptoms:

Data inconsistencies between locations
Order items missing or duplicated
Time punch discrepancies

Diagnosis:

-- Recent conflicts
SELECT * FROM sync_conflicts
WHERE created_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY created_at DESC
LIMIT 50;

-- Conflict rate by table
SELECT table_name, COUNT(*) as conflicts,
       COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as percentage
FROM sync_conflicts
WHERE created_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY table_name;

Resolution:

Review conflict strategy for affected table

{
  "orders": {"strategy": "server_wins"},
  "menu_items": {"strategy": "server_wins"},
  "time_punches": {"strategy": "merge"}
}

For orders: Server-wins is correct (cloud is authoritative)
For time punches: Merge strategy may need adjustment if conflicts persist
For menu items: Always server-wins (menu managed centrally)
If custom resolution needed, escalate to engineering

IoT Issues

Temperature Alert

Severity: P1 - Immediate Response Required

Symptoms:

Alert notification received
Temperature out of range on dashboard
Possible food safety concern

Immediate Actions:

Verify the reading is accurate

# Get recent readings
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.olympuscloud.ai/v1/iot/devices/{device_id}/telemetry?period=1h"

Check for sensor malfunction
- Single spike vs sustained reading
- Compare with other sensors in same area
Contact location manager
- Verify physical equipment status
- Check if door was left open
- Check if power interruption occurred

Document in incident log

# Log the incident
curl -X POST -H "Authorization: Bearer $TOKEN" \
  "https://api.olympuscloud.ai/v1/incidents" \
  -d '{
    "type": "temperature_alert",
    "device_id": "{device_id}",
    "location_id": "{location_id}",
    "severity": "high",
    "description": "Walk-in cooler temperature exceeded threshold"
  }'

Escalation:

If food safety is compromised, follow food safety protocols
If equipment failure suspected, dispatch maintenance

Device Offline

Symptoms:

Device shows "Offline" in dashboard
No telemetry data received
Last seen timestamp is stale

Diagnosis:

# Check device status
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.olympuscloud.ai/v1/mdm/devices/{device_id}"

# Check MQTT broker connection
mosquitto_sub -h mqtt.olympuscloud.ai -t "olympus/+/{location_id}/+/status" -v

Resolution:

Attempt remote restart

curl -X POST -H "Authorization: Bearer $TOKEN" \
  "https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/restart"

Check network connectivity at location
- WiFi signal strength
- MQTT port 8883 accessibility
Check device power
- Battery level (if applicable)
- Power supply status
If device remains offline after 30 minutes
- Create maintenance ticket
- Consider replacement if hardware failure

MQTT Connection Issues

Symptoms:

Devices intermittently disconnecting
Telemetry gaps in data
High message latency

Diagnosis:

# Check MQTT broker health
gcloud monitoring dashboards list --filter="displayName:MQTT"

# Check connection count
curl -s "https://mqtt-admin.olympuscloud.ai/api/connections" \
  | jq '.total_connections'

# Check message throughput
curl -s "https://mqtt-admin.olympuscloud.ai/api/stats" \
  | jq '.messages_per_second'

Resolution:

Check client certificate expiration

# List expiring certificates
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.olympuscloud.ai/v1/mdm/certificates?expiring_within=30d"

Verify topic ACLs
- Devices should only publish to their assigned topics
- Check for permission denied errors in logs
Review QoS settings
- Critical alerts: QoS 2 (exactly once)
- Telemetry: QoS 1 (at least once)
- Status updates: QoS 0 (best effort)
Check for message queue buildup
- If broker is overloaded, scale up resources

Edge Agent Issues

Edge Agent Crash Loop

Symptoms:

Agent repeatedly restarting
systemd showing failed status
Partial data in sync queue

Diagnosis:

# Check service status
systemctl status olympus-edge-agent

# Check logs for errors
journalctl -u olympus-edge-agent -n 500 --no-pager | grep -i error

# Check disk space
df -h /var/lib/olympus

# Check memory
free -m

Resolution:

Check for database corruption

sqlite3 /var/lib/olympus/edge.db "PRAGMA integrity_check"

If disk full, clean up old logs
```
journalctl --vacuum-size=100M
```
Check for config issues
```
cat /etc/olympus/edge-agent.yaml
```

Reset to last known good config

cp /etc/olympus/edge-agent.yaml.backup /etc/olympus/edge-agent.yaml
systemctl restart olympus-edge-agent

SQLite Database Issues

Symptoms:

Edge agent unable to read/write local data
"database is locked" errors
Corrupted data warnings

Diagnosis:

# Check database integrity
sqlite3 /var/lib/olympus/edge.db "PRAGMA integrity_check"

# Check for lock issues
fuser /var/lib/olympus/edge.db

# Check file permissions
ls -la /var/lib/olympus/edge.db

Resolution:

For lock issues

# Stop the agent
systemctl stop olympus-edge-agent
# Remove any stale locks
rm -f /var/lib/olympus/edge.db-wal /var/lib/olympus/edge.db-shm
# Restart
systemctl start olympus-edge-agent

For corruption

# Backup current database
cp /var/lib/olympus/edge.db /var/lib/olympus/edge.db.corrupt

# Attempt recovery
sqlite3 /var/lib/olympus/edge.db.corrupt ".dump" | sqlite3 /var/lib/olympus/edge.db.new

# Verify new database
sqlite3 /var/lib/olympus/edge.db.new "PRAGMA integrity_check"

# If OK, replace
mv /var/lib/olympus/edge.db.new /var/lib/olympus/edge.db
systemctl restart olympus-edge-agent

If unrecoverable
- Data will need to resync from cloud
- Create fresh database and trigger full sync

MDM Issues

Device Enrollment Failure

Symptoms:

Device stuck in "pending_activation"
Enrollment token expired
Certificate provisioning failed

Resolution:

Generate new enrollment token

curl -X POST -H "Authorization: Bearer $TOKEN" \
  "https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/regenerate-token"

Verify device serial number matches

curl -H "Authorization: Bearer $TOKEN" \
  "https://api.olympuscloud.ai/v1/mdm/devices/{device_id}" \
  | jq '.serial_number'

Check certificate chain
- Device must trust our CA
- Intermediate certificates must be present

OTA Update Failure

Symptoms:

Update stuck at percentage
Rollback triggered
Device in unknown state

Resolution:

Check update status

curl -H "Authorization: Bearer $TOKEN" \
  "https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/updates/status"

If stuck, cancel and retry

curl -X POST -H "Authorization: Bearer $TOKEN" \
  "https://api.olympuscloud.ai/v1/mdm/devices/{device_id}/updates/cancel"

If rollback occurred, check logs
- Verify firmware image integrity
- Check for compatibility issues
- Consider staged rollout for problematic updates

Monitoring & Alerts

Key Metrics to Monitor

Metric	Warning	Critical	Action
sync_queue_depth	over 500	over 1000	Check sync service
sync_latency_p99	over 10s	over 30s	Scale sync service
iot_message_rate	under 50% baseline	under 25% baseline	Check MQTT broker
device_offline_count	over 5%	over 10%	Investigate connectivity
conflict_rate	over 0.5%	over 1%	Review sync strategies

Alert Queries

# Sync queue growing
increase(sync_queue_depth[5m]) > 100

# High sync latency
histogram_quantile(0.99, rate(sync_latency_bucket[5m])) > 30

# Device offline rate
sum(device_status{status="offline"}) / sum(device_status) > 0.05

# IoT message rate drop
rate(iot_messages_received_total[5m]) < rate(iot_messages_received_total[1h] offset 1d) * 0.5

Escalation Matrix

Issue Type	L1 Support	L2 Engineering	P0 On-Call
Single device offline	✓
Location offline	✓	After 30 min	After 2 hours
Temperature alert	✓		If food safety risk
Sync service down		✓	Immediately
Data corruption		✓	If affects orders
MQTT broker issues		✓	If >10% devices affected

Quick Reference​

Sync Issues​

Location Not Syncing​

High Sync Latency​

Sync Conflicts​

IoT Issues​

Temperature Alert​

Device Offline​

MQTT Connection Issues​

Edge Agent Issues​

Edge Agent Crash Loop​

SQLite Database Issues​

MDM Issues​

Device Enrollment Failure​

OTA Update Failure​

Monitoring & Alerts​

Key Metrics to Monitor​

Alert Queries​

Escalation Matrix​

Related Documentation​

Quick Reference

Sync Issues

Location Not Syncing

High Sync Latency

Sync Conflicts

IoT Issues

Temperature Alert

Device Offline

MQTT Connection Issues

Edge Agent Issues

Edge Agent Crash Loop

SQLite Database Issues

MDM Issues

Device Enrollment Failure

OTA Update Failure

Monitoring & Alerts

Key Metrics to Monitor

Alert Queries

Escalation Matrix

Related Documentation