Vision AI Troubleshooting Runbook
Operational procedures for diagnosing and resolving Vision AI platform issues.
Quick Reference
| Issue | Severity | On-Call Action |
|---|---|---|
| All cameras offline | P1 | Check network, restart edge |
| Low detection accuracy | P2 | Verify lighting, recalibrate |
| Edge device unresponsive | P2 | Remote restart, check resources |
| Model inference slow | P3 | Check GPU utilization |
| False positive alerts | P3 | Adjust thresholds |
Camera Issues
Camera Not Streaming
Symptoms:
- Camera shows "Offline" in dashboard
- No live feed available
- Last frame timestamp is stale
Diagnosis:
# Check camera connectivity from edge device
ssh edge@{edge_ip} "ping -c 5 {camera_ip}"
# Test RTSP stream directly
ssh edge@{edge_ip} "ffprobe -v error rtsp://{camera_ip}:554/stream1"
# Check Vision AI camera status
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}"
Resolution Steps:
-
Verify network connectivity
- Ensure camera and edge device on same network
- Check for VLAN isolation issues
- Verify firewall allows RTSP (port 554)
-
Test camera directly
# Access camera web interface
curl -I http://{camera_ip}
# Try VLC to view stream
vlc rtsp://{camera_ip}:554/stream1 -
Restart camera connection
# Restart camera module on edge
ssh edge@{edge_ip} "sudo systemctl restart vision-rtsp" -
Power cycle camera
- If POE: Check switch port
- If standalone: Verify power adapter
Multiple Cameras Down
Symptoms:
- Several or all cameras offline simultaneously
- Edge device shows degraded status
Diagnosis:
# Check edge device network
ssh edge@{edge_ip} "ip addr show"
# Check camera manager service
ssh edge@{edge_ip} "systemctl status vision-camera-manager"
# Check network switch
ping {switch_ip}
Resolution:
-
Check network infrastructure
- Verify switch is powered and operational
- Check for network loop or broadcast storm
- Verify DHCP server if cameras use dynamic IPs
-
Restart camera manager
ssh edge@{edge_ip} "sudo systemctl restart vision-camera-manager" -
Check edge device resources
ssh edge@{edge_ip} "free -m && df -h"
Low Video Quality
Symptoms:
- Blurry or pixelated video
- Detection accuracy reduced
- Artifacts in stream
Resolution:
-
Clean camera lens - Check for dust, condensation, or obstruction
-
Check camera settings
# View current camera settings
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings"
# Adjust quality settings
curl -X PUT -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
-d '{"resolution": "1080p", "bitrate": 4000, "fps": 30}' -
Check network bandwidth
ssh edge@{edge_ip} "iperf3 -c {camera_ip} -t 10"
Detection Issues
Low Detection Accuracy
Symptoms:
- High miss rate on food items
- False negatives in order verification
- Confidence scores below threshold
Diagnosis:
# Check model metrics
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/models/{model_id}/metrics?period=24h"
# View recent detections
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/detections?limit=50"
Resolution:
-
Check environmental factors
- Lighting: Ensure consistent, sufficient lighting
- Glare: Reposition camera to avoid reflective surfaces
- Obstructions: Clear any objects blocking camera view
-
Verify camera alignment
- Ensure camera hasn't moved from calibrated position
- Check field of view covers expected area
-
Recalibrate model
# Trigger recalibration
curl -X POST -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/calibrate" -
Adjust confidence threshold
# Lower threshold if missing valid detections
curl -X PUT -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
-d '{"min_confidence": 0.7}' # Default is 0.8 -
Consider model retraining
- Collect new training data if menu changed
- Contact support for custom model training
High False Positive Rate
Symptoms:
- Detecting items that aren't present
- Misidentifying items
- Spurious alerts
Resolution:
-
Increase confidence threshold
curl -X PUT -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
-d '{"min_confidence": 0.85}' # Increase from default 0.8 -
Review detection zones
- Ensure zones don't include irrelevant areas
- Remove zones covering clutter or similar-looking objects
-
Check for confusing items
- Identify items being confused
- Add negative examples to training set
-
Adjust IOU threshold
# Increase IOU for stricter matching
curl -X PUT -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
-d '{"iou_threshold": 0.6}' # Default is 0.5
Drive-Thru Detection Issues
Symptoms:
- Incorrect car count
- Wait time predictions inaccurate
- Cars not being tracked
Diagnosis:
# Check drive-thru status
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/drive-thru/status"
# View tracking data
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/drive-thru/tracks?period=1h"
Resolution:
-
Verify lane zones
- Ensure detection zones cover entire lane
- Check for zone overlap causing double-counting
-
Check camera angle
- Overhead angle works best
- Avoid side-angle views with occlusion
-
Adjust tracker settings
curl -X PUT -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/drive-thru/settings" \
-d '{
"track_timeout_seconds": 30,
"min_track_length": 5,
"vehicle_min_size": 100
}' -
Weather considerations
- Rain/snow can affect detection
- Bright sunlight may cause glare
- Check IR mode at night
Edge Device Issues
Edge Device Unresponsive
Symptoms:
- Cannot SSH to edge device
- No metrics being reported
- Dashboard shows device offline
Resolution:
-
Check network connectivity
ping {edge_ip} -
Remote power cycle (if IPMI/PDU available)
# Via PDU
curl -X POST "https://pdu.local/outlets/3/reboot" -
Physical intervention
- Power cycle device manually
- Check power LED and ethernet link lights
-
After recovery, check logs
ssh edge@{edge_ip} "journalctl -u vision-ai --since '1 hour ago'"
High GPU Utilization
Symptoms:
- Inference latency increasing
- Dropped frames
- Edge device fan at maximum
Diagnosis:
ssh edge@{edge_ip} "nvidia-smi" # or tegrastats for Jetson
Resolution:
-
Reduce concurrent streams
# Lower max concurrent cameras
curl -X PUT -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/edge/{edge_id}/settings" \
-d '{"max_concurrent_streams": 4}' # Default is 8 -
Reduce frame rate
curl -X PUT -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
-d '{"inference_fps": 10}' # Default is 15 -
Use lighter model
# Switch to smaller model variant
curl -X PUT -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
-d '{"model_variant": "yolov8m"}' # Instead of yolov8l -
Check for thermal throttling
ssh edge@{edge_ip} "cat /sys/class/thermal/thermal_zone*/temp"- If >80°C, improve cooling or reduce load
Disk Full on Edge Device
Symptoms:
- Video recording stopped
- Edge services failing
- Cannot write logs
Diagnosis:
ssh edge@{edge_ip} "df -h"
Resolution:
-
Clear old video
ssh edge@{edge_ip} "find /var/lib/vision/video -mtime +7 -delete" -
Clear journal logs
ssh edge@{edge_ip} "sudo journalctl --vacuum-size=500M" -
Check retention settings
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/edge/{edge_id}/settings"
# Reduce retention
curl -X PUT -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/edge/{edge_id}/settings" \
-d '{"video_retention_days": 7}' # Reduce from 14 -
Check for runaway processes
ssh edge@{edge_ip} "du -sh /var/lib/vision/*"
Model Issues
Model Not Loading
Symptoms:
- Inference returning errors
- "Model not found" in logs
- Zero detections
Diagnosis:
ssh edge@{edge_ip} "ls -la /var/lib/vision/models/"
ssh edge@{edge_ip} "cat /var/log/vision/model-loader.log"
Resolution:
-
Re-download model
curl -X POST -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/edge/{edge_id}/models/sync" -
Check model file integrity
ssh edge@{edge_ip} "md5sum /var/lib/vision/models/*.onnx"
# Compare with expected checksums -
Verify GPU drivers
ssh edge@{edge_ip} "nvidia-smi" # Should show GPU info -
Restart inference service
ssh edge@{edge_ip} "sudo systemctl restart vision-inference"
Model Inference Slow
Symptoms:
- Latency >500ms per frame
- Detection lag visible
- FPS dropping
Resolution:
-
Check batch size
# Reduce batch size for lower latency
curl -X PUT -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/edge/{edge_id}/settings" \
-d '{"inference_batch_size": 1}' # Default is 4 -
Enable TensorRT optimization (Jetson)
ssh edge@{edge_ip} "sudo /opt/vision/tools/optimize-model.sh" -
Reduce input resolution
curl -X PUT -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
-d '{"inference_resolution": "640x480"}' # Default is 1280x720
Monitoring & Alerts
Key Metrics to Monitor
| Metric | Warning | Critical | Action |
|---|---|---|---|
| camera_fps | under 10 | under 5 | Check camera/network |
| inference_latency_p99 | over 500ms | over 1000ms | Optimize model |
| detection_accuracy | under 90% | under 80% | Recalibrate/retrain |
| edge_gpu_util | over 90% | over 95% | Reduce load |
| edge_disk_used | over 80% | over 90% | Clear old data |
| edge_temp | over 75°C | over 85°C | Check cooling |
Alert Queries
# Camera offline
vision_camera_status{status="offline"} == 1
# High inference latency
histogram_quantile(0.99, rate(vision_inference_latency_bucket[5m])) > 0.5
# Low detection accuracy
vision_detection_accuracy < 0.9
# Edge device high temperature
vision_edge_temperature_celsius > 75
Escalation Matrix
| Issue Type | L1 Support | L2 Engineering | P0 On-Call |
|---|---|---|---|
| Single camera offline | ✓ | ||
| All cameras offline | ✓ | After 15 min | After 30 min |
| Low detection accuracy | ✓ | After investigation | If food safety impact |
| Edge device down | ✓ | ✓ | If affects orders |
| Model accuracy degraded | ✓ | ||
| Security false alerts | ✓ |
Related Documentation
- Vision AI Platform - Technical overview
- Vision AI Privacy & Compliance - Privacy requirements
- Edge Infrastructure - OlympusEdge
- Monitoring & Alerts - Monitoring setup