Skip to main content

ACP AI Router

Smart model routing for cost-optimized AI inference, delivering up to 95% cost reduction by routing queries to the cheapest capable model.

Overview

The ACP AI Router is the central intelligence layer for all AI inference in Olympus Cloud. It analyzes incoming queries and routes them to the most cost-effective model that can handle the request, dramatically reducing AI costs while maintaining response quality.

Cost Impact

MetricBeforeAfterSavings
Monthly AI Cost$1,000-2,000$45-8595%+

Cost savings verified in production deployment.

Model Tiers

The router uses a 6-tier model hierarchy, selecting the cheapest capable model for each query:

TierModelCost (per M tokens)Use Cases
T1Llama 4 Scout (Workers AI)FREESimple queries, greetings, FAQ
T2Gemini 2.0 Flash$0.10 / $0.40Real-time conversation
T3Gemini 3 Flash$0.50 / $3.00Complex conversation
T4Claude Haiku 4.5$1.00 / $5.00Fast reasoning
T5Claude Sonnet 4.5$3.00 / $15.00High-quality analysis
T6Claude Opus 4.5$5.00 / $25.00Strategic planning

Supported Providers

The AI Gateway integrates with 6 major AI providers:

ProviderModelsUse Cases
Workers AILlama 4, Mistral 3.1, Gemma 3, QwQ-32B, Whisper, FLUX.2, BGEFREE tier, embeddings, transcription
AnthropicClaude 4.5 Haiku/Sonnet/OpusHigh-quality reasoning, analysis
GoogleGemini 2.0/2.5/3 Flash/ProCost-effective conversation
OpenAIGPT-4o, GPT-4o-miniGeneral purpose
Grok (xAI)Grok modelsReal-time data, vision
ElevenLabsVoice TTS modelsText-to-speech synthesis

Architecture

Components

ComponentLocationPurpose
AI GatewayCloudflare WorkerRequest routing, caching, rate limiting
Complexity AnalyzerEdgeQuery classification for tier selection
Model ClientsEdge + GCPProvider-specific API clients
Response CacheCloudflare KVCache common queries to reduce costs
Vectorize RAGCloudflareSemantic search across 4 indexes
LangGraph AgentPythonMulti-step workflow orchestration

Routing Logic

Complexity Classification

The router analyzes queries across multiple dimensions:

FactorDescription
Token CountShorter queries route to cheaper tiers
Query TypeGreetings/FAQ vs analysis/planning
Context NeededSimple vs multi-turn with RAG
Domain ComplexityGeneral vs technical/financial

Routing Rules

The router selects tiers based on complexity analysis:

ComplexityTierModel
Simple (greetings, FAQ)T1Workers AI (free)
Basic conversationT2Gemini Flash
Complex conversationT3Gemini 3 Flash
Reasoning requiredT4Claude Haiku
Analysis/generationT5Claude Sonnet
Strategic/planningT6Claude Opus

Configuration

Environment Variables

VariableDescription
AI_GATEWAY_URLGateway endpoint URL
AI_DEFAULT_TIERFallback tier when routing fails
AI_CACHE_TTLResponse cache TTL in seconds
AI_MAX_RETRIESMaximum retry attempts

See deployment configuration for actual values.

Per-Tenant Overrides

Tenants can configure tier restrictions in the platform settings:

{
"ai_config": {
"max_tier": "T4",
"min_tier": "T1",
"budget_limit_monthly": 100.00,
"allowed_models": ["workers-ai", "gemini"]
}
}

API Reference

Chat Completion (Tier-Routed)

POST /api/ai/chat
Content-Type: application/json
Authorization: Bearer YOUR_TOKEN

{
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"context": {
"tenant_id": "uuid",
"shell": "staff"
}
}

# Response
{
"response": "I'm doing well, thank you! How can I help you today?",
"metadata": {
"model": "llama-4-scout",
"tier": "T1",
"tokens_in": 12,
"tokens_out": 18,
"cost_usd": 0.0,
"latency_ms": 45,
"cached": false
}
}

Direct Model Access

POST /api/ai/chat/direct
{
"model": "claude-haiku-4.5",
"messages": [...],
"max_tokens": 1000
}

List Available Tiers

GET /api/ai/tiers

# Response
{
"tiers": [
{"id": "T1", "model": "llama-4-scout", "cost_input": 0, "cost_output": 0},
{"id": "T2", "model": "gemini-2.0-flash", "cost_input": 0.10, "cost_output": 0.40},
...
]
}

List Available Models

GET /api/ai/models

# Response
{
"providers": {
"workers-ai": ["llama-4-scout", "mistral-3.1", "whisper"],
"anthropic": ["claude-haiku-4.5", "claude-sonnet-4.5", "claude-opus-4.5"],
"google": ["gemini-2.0-flash", "gemini-3-flash"],
...
}
}

Provider Health

GET /api/ai/health

# Response (example)
{
"status": "healthy",
"providers": {
"workers-ai": {"status": "up"},
"anthropic": {"status": "up"},
"google": {"status": "up"}
}
}

Usage Analytics

GET /api/ai/usage/{tenant_id}?start=2026-01-01&end=2026-01-31

# Response (example structure)
{
"tenant_id": "uuid",
"period": {"start": "...", "end": "..."},
"total_requests": 0,
"total_cost_usd": 0.00,
"by_tier": {
"T1": {"requests": 0, "cost": 0},
"T2": {"requests": 0, "cost": 0},
...
}
}

Vectorize RAG Service

The AI Router includes semantic search over 4 Vectorize indexes.

Available Indexes

IndexDimensionsContent
menu-rag384Menu items, ingredients, allergens
policy-rag384HR policies, procedures
support-rag384FAQ, tutorials, troubleshooting
training-rag384Onboarding guides, role training

RAG API Endpoints

# Single-index semantic search
POST /rag/query
{
"query": "vegetarian options",
"index": "menu-rag",
"limit": 5
}

# Multi-index search with score fusion
POST /rag/multi-query
{
"query": "new employee onboarding",
"indexes": ["policy-rag", "training-rag"],
"limit": 10
}

# Document indexing (admin)
POST /rag/index
{
"index": "support-rag",
"documents": [...]
}

# Health check
GET /rag/health
# Returns: {"status":"healthy","bindings":{"ai":true,"menu_rag":true,...}}

LangGraph Agent Orchestrator

Multi-step workflow execution with human-in-the-loop capabilities.

Features

FeatureDescription
Multi-Step WorkflowsChain complex operations across services
Human ApprovalPause workflows for human review
State PersistenceResume interrupted workflows
Intent ClassificationRoute to appropriate workflow
Graceful DegradationFalls back when dependencies unavailable

Agent API Endpoints

# Start agent conversation/workflow
POST /api/agent/chat
{
"message": "Generate weekly sales report",
"context": {"tenant_id": "uuid"}
}

# Approve pending action
POST /api/agent/approve
{
"workflow_id": "uuid",
"approved": true,
"notes": "Looks good, proceed"
}

# Check agent status
GET /api/agent/status
# Returns: {"available": true, "pending_approvals": 2}

Workflow Types

TypeTriggerActions
Report Generation"Generate report"Query data, format, email
Menu Updates"Update menu"Validate, sync, notify
Scheduling"Schedule staff"Analyze, propose, await approval
Support EscalationHigh urgencyCreate ticket, assign, alert

Text-to-Speech

ElevenLabs integration for voice synthesis.

POST /api/ai/tts
{
"text": "Your order is ready for pickup",
"voice": "rachel",
"model": "eleven_multilingual_v2"
}

# Returns audio stream (MP3)

Monitoring

Metrics

MetricDescription
ai_requests_totalTotal requests by tier
ai_cost_usdCost by tier/tenant
ai_latency_p9999th percentile latency
ai_cache_hit_rateCache effectiveness
ai_tier_distributionRequests per tier

Alert thresholds configurable per deployment.

Cockpit Dashboard

The AI Router metrics are displayed in the Cockpit AI Cost Analytics tab:

  • Real-time cost tracking by tenant
  • Tier distribution visualization
  • Model performance comparison
  • Budget alerts and forecasting

Best Practices

  1. Use RAG for context - Reduces token count by providing relevant context
  2. Implement caching - Cache common queries to reduce costs
  3. Set budget limits - Configure per-tenant spending caps
  4. Monitor tier distribution - Alert on excessive T5/T6 usage
  5. Batch similar requests - Group related queries for efficiency

Troubleshooting

High T5/T6 Usage

Symptom: Excessive queries routing to expensive tiers

Causes:

  • Complex prompts without context
  • Missing RAG integration
  • Incorrect complexity classification

Solution:

  1. Review query patterns in Cockpit
  2. Add RAG context for domain-specific queries
  3. Tune complexity thresholds

Cache Miss Rate High

Symptom: Low cache hit rate increasing costs

Solution:

  1. Increase TTL for stable responses
  2. Normalize query formatting
  3. Implement semantic caching