Skip to main content

Vision AI Troubleshooting Runbook

Operational procedures for diagnosing and resolving Vision AI platform issues.

Quick Reference

IssueSeverityOn-Call Action
All cameras offlineP1Check network, restart edge
Low detection accuracyP2Verify lighting, recalibrate
Edge device unresponsiveP2Remote restart, check resources
Model inference slowP3Check GPU utilization
False positive alertsP3Adjust thresholds

Camera Issues

Camera Not Streaming

Symptoms:

  • Camera shows "Offline" in dashboard
  • No live feed available
  • Last frame timestamp is stale

Diagnosis:

# Check camera connectivity from edge device
ssh edge@{edge_ip} "ping -c 5 {camera_ip}"

# Test RTSP stream directly
ssh edge@{edge_ip} "ffprobe -v error rtsp://{camera_ip}:554/stream1"

# Check Vision AI camera status
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}"

Resolution Steps:

  1. Verify network connectivity

    • Ensure camera and edge device on same network
    • Check for VLAN isolation issues
    • Verify firewall allows RTSP (port 554)
  2. Test camera directly

    # Access camera web interface
    curl -I http://{camera_ip}

    # Try VLC to view stream
    vlc rtsp://{camera_ip}:554/stream1
  3. Restart camera connection

    # Restart camera module on edge
    ssh edge@{edge_ip} "sudo systemctl restart vision-rtsp"
  4. Power cycle camera

    • If POE: Check switch port
    • If standalone: Verify power adapter

Multiple Cameras Down

Symptoms:

  • Several or all cameras offline simultaneously
  • Edge device shows degraded status

Diagnosis:

# Check edge device network
ssh edge@{edge_ip} "ip addr show"

# Check camera manager service
ssh edge@{edge_ip} "systemctl status vision-camera-manager"

# Check network switch
ping {switch_ip}

Resolution:

  1. Check network infrastructure

    • Verify switch is powered and operational
    • Check for network loop or broadcast storm
    • Verify DHCP server if cameras use dynamic IPs
  2. Restart camera manager

    ssh edge@{edge_ip} "sudo systemctl restart vision-camera-manager"
  3. Check edge device resources

    ssh edge@{edge_ip} "free -m && df -h"

Low Video Quality

Symptoms:

  • Blurry or pixelated video
  • Detection accuracy reduced
  • Artifacts in stream

Resolution:

  1. Clean camera lens - Check for dust, condensation, or obstruction

  2. Check camera settings

    # View current camera settings
    curl -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings"

    # Adjust quality settings
    curl -X PUT -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
    -d '{"resolution": "1080p", "bitrate": 4000, "fps": 30}'
  3. Check network bandwidth

    ssh edge@{edge_ip} "iperf3 -c {camera_ip} -t 10"

Detection Issues

Low Detection Accuracy

Symptoms:

  • High miss rate on food items
  • False negatives in order verification
  • Confidence scores below threshold

Diagnosis:

# Check model metrics
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/models/{model_id}/metrics?period=24h"

# View recent detections
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/detections?limit=50"

Resolution:

  1. Check environmental factors

    • Lighting: Ensure consistent, sufficient lighting
    • Glare: Reposition camera to avoid reflective surfaces
    • Obstructions: Clear any objects blocking camera view
  2. Verify camera alignment

    • Ensure camera hasn't moved from calibrated position
    • Check field of view covers expected area
  3. Recalibrate model

    # Trigger recalibration
    curl -X POST -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/calibrate"
  4. Adjust confidence threshold

    # Lower threshold if missing valid detections
    curl -X PUT -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
    -d '{"min_confidence": 0.7}' # Default is 0.8
  5. Consider model retraining

    • Collect new training data if menu changed
    • Contact support for custom model training

High False Positive Rate

Symptoms:

  • Detecting items that aren't present
  • Misidentifying items
  • Spurious alerts

Resolution:

  1. Increase confidence threshold

    curl -X PUT -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
    -d '{"min_confidence": 0.85}' # Increase from default 0.8
  2. Review detection zones

    • Ensure zones don't include irrelevant areas
    • Remove zones covering clutter or similar-looking objects
  3. Check for confusing items

    • Identify items being confused
    • Add negative examples to training set
  4. Adjust IOU threshold

    # Increase IOU for stricter matching
    curl -X PUT -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
    -d '{"iou_threshold": 0.6}' # Default is 0.5

Drive-Thru Detection Issues

Symptoms:

  • Incorrect car count
  • Wait time predictions inaccurate
  • Cars not being tracked

Diagnosis:

# Check drive-thru status
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/drive-thru/status"

# View tracking data
curl -H "Authorization: Bearer $TOKEN" \
"https://api.olympuscloud.ai/v1/vision/drive-thru/tracks?period=1h"

Resolution:

  1. Verify lane zones

    • Ensure detection zones cover entire lane
    • Check for zone overlap causing double-counting
  2. Check camera angle

    • Overhead angle works best
    • Avoid side-angle views with occlusion
  3. Adjust tracker settings

    curl -X PUT -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/drive-thru/settings" \
    -d '{
    "track_timeout_seconds": 30,
    "min_track_length": 5,
    "vehicle_min_size": 100
    }'
  4. Weather considerations

    • Rain/snow can affect detection
    • Bright sunlight may cause glare
    • Check IR mode at night

Edge Device Issues

Edge Device Unresponsive

Symptoms:

  • Cannot SSH to edge device
  • No metrics being reported
  • Dashboard shows device offline

Resolution:

  1. Check network connectivity

    ping {edge_ip}
  2. Remote power cycle (if IPMI/PDU available)

    # Via PDU
    curl -X POST "https://pdu.local/outlets/3/reboot"
  3. Physical intervention

    • Power cycle device manually
    • Check power LED and ethernet link lights
  4. After recovery, check logs

    ssh edge@{edge_ip} "journalctl -u vision-ai --since '1 hour ago'"

High GPU Utilization

Symptoms:

  • Inference latency increasing
  • Dropped frames
  • Edge device fan at maximum

Diagnosis:

ssh edge@{edge_ip} "nvidia-smi"  # or tegrastats for Jetson

Resolution:

  1. Reduce concurrent streams

    # Lower max concurrent cameras
    curl -X PUT -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/edge/{edge_id}/settings" \
    -d '{"max_concurrent_streams": 4}' # Default is 8
  2. Reduce frame rate

    curl -X PUT -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
    -d '{"inference_fps": 10}' # Default is 15
  3. Use lighter model

    # Switch to smaller model variant
    curl -X PUT -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
    -d '{"model_variant": "yolov8m"}' # Instead of yolov8l
  4. Check for thermal throttling

    ssh edge@{edge_ip} "cat /sys/class/thermal/thermal_zone*/temp"
    • If >80°C, improve cooling or reduce load

Disk Full on Edge Device

Symptoms:

  • Video recording stopped
  • Edge services failing
  • Cannot write logs

Diagnosis:

ssh edge@{edge_ip} "df -h"

Resolution:

  1. Clear old video

    ssh edge@{edge_ip} "find /var/lib/vision/video -mtime +7 -delete"
  2. Clear journal logs

    ssh edge@{edge_ip} "sudo journalctl --vacuum-size=500M"
  3. Check retention settings

    curl -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/edge/{edge_id}/settings"

    # Reduce retention
    curl -X PUT -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/edge/{edge_id}/settings" \
    -d '{"video_retention_days": 7}' # Reduce from 14
  4. Check for runaway processes

    ssh edge@{edge_ip} "du -sh /var/lib/vision/*"

Model Issues

Model Not Loading

Symptoms:

  • Inference returning errors
  • "Model not found" in logs
  • Zero detections

Diagnosis:

ssh edge@{edge_ip} "ls -la /var/lib/vision/models/"

ssh edge@{edge_ip} "cat /var/log/vision/model-loader.log"

Resolution:

  1. Re-download model

    curl -X POST -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/edge/{edge_id}/models/sync"
  2. Check model file integrity

    ssh edge@{edge_ip} "md5sum /var/lib/vision/models/*.onnx"
    # Compare with expected checksums
  3. Verify GPU drivers

    ssh edge@{edge_ip} "nvidia-smi"  # Should show GPU info
  4. Restart inference service

    ssh edge@{edge_ip} "sudo systemctl restart vision-inference"

Model Inference Slow

Symptoms:

  • Latency >500ms per frame
  • Detection lag visible
  • FPS dropping

Resolution:

  1. Check batch size

    # Reduce batch size for lower latency
    curl -X PUT -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/edge/{edge_id}/settings" \
    -d '{"inference_batch_size": 1}' # Default is 4
  2. Enable TensorRT optimization (Jetson)

    ssh edge@{edge_ip} "sudo /opt/vision/tools/optimize-model.sh"
  3. Reduce input resolution

    curl -X PUT -H "Authorization: Bearer $TOKEN" \
    "https://api.olympuscloud.ai/v1/vision/cameras/{camera_id}/settings" \
    -d '{"inference_resolution": "640x480"}' # Default is 1280x720

Monitoring & Alerts

Key Metrics to Monitor

MetricWarningCriticalAction
camera_fpsunder 10under 5Check camera/network
inference_latency_p99over 500msover 1000msOptimize model
detection_accuracyunder 90%under 80%Recalibrate/retrain
edge_gpu_utilover 90%over 95%Reduce load
edge_disk_usedover 80%over 90%Clear old data
edge_tempover 75°Cover 85°CCheck cooling

Alert Queries

# Camera offline
vision_camera_status{status="offline"} == 1

# High inference latency
histogram_quantile(0.99, rate(vision_inference_latency_bucket[5m])) > 0.5

# Low detection accuracy
vision_detection_accuracy < 0.9

# Edge device high temperature
vision_edge_temperature_celsius > 75

Escalation Matrix

Issue TypeL1 SupportL2 EngineeringP0 On-Call
Single camera offline
All cameras offlineAfter 15 minAfter 30 min
Low detection accuracyAfter investigationIf food safety impact
Edge device downIf affects orders
Model accuracy degraded
Security false alerts