ACP AI Router
Smart model routing for cost-optimized AI inference, delivering up to 95% cost reduction by routing queries to the cheapest capable model.
Overview
The ACP AI Router is the central intelligence layer for all AI inference in Olympus Cloud. It analyzes incoming queries and routes them to the most cost-effective model that can handle the request, dramatically reducing AI costs while maintaining response quality.
Cost Impact
| Metric | Before | After | Savings |
|---|---|---|---|
| Monthly AI Cost | $1,000-2,000 | $45-85 | 95%+ |
Cost savings verified in production deployment.
Model Tiers
The router uses a 6-tier model hierarchy, selecting the cheapest capable model for each query:
| Tier | Model | Cost (per M tokens) | Use Cases |
|---|---|---|---|
| T1 | Llama 4 Scout (Workers AI) | FREE | Simple queries, greetings, FAQ |
| T2 | Gemini 2.0 Flash | $0.10 / $0.40 | Real-time conversation |
| T3 | Gemini 3 Flash | $0.50 / $3.00 | Complex conversation |
| T4 | Claude Haiku 4.5 | $1.00 / $5.00 | Fast reasoning |
| T5 | Claude Sonnet 4.5 | $3.00 / $15.00 | High-quality analysis |
| T6 | Claude Opus 4.5 | $5.00 / $25.00 | Strategic planning |
Supported Providers
The AI Gateway integrates with 6 major AI providers:
| Provider | Models | Use Cases |
|---|---|---|
| Workers AI | Llama 4, Mistral 3.1, Gemma 3, QwQ-32B, Whisper, FLUX.2, BGE | FREE tier, embeddings, transcription |
| Anthropic | Claude 4.5 Haiku/Sonnet/Opus | High-quality reasoning, analysis |
| Gemini 2.0/2.5/3 Flash/Pro | Cost-effective conversation | |
| OpenAI | GPT-4o, GPT-4o-mini | General purpose |
| Grok (xAI) | Grok models | Real-time data, vision |
| ElevenLabs | Voice TTS models | Text-to-speech synthesis |
Architecture
Components
| Component | Location | Purpose |
|---|---|---|
| AI Gateway | Cloudflare Worker | Request routing, caching, rate limiting |
| Complexity Analyzer | Edge | Query classification for tier selection |
| Model Clients | Edge + GCP | Provider-specific API clients |
| Response Cache | Cloudflare KV | Cache common queries to reduce costs |
| Vectorize RAG | Cloudflare | Semantic search across 4 indexes |
| LangGraph Agent | Python | Multi-step workflow orchestration |
Routing Logic
Complexity Classification
The router analyzes queries across multiple dimensions:
| Factor | Description |
|---|---|
| Token Count | Shorter queries route to cheaper tiers |
| Query Type | Greetings/FAQ vs analysis/planning |
| Context Needed | Simple vs multi-turn with RAG |
| Domain Complexity | General vs technical/financial |
Routing Rules
The router selects tiers based on complexity analysis:
| Complexity | Tier | Model |
|---|---|---|
| Simple (greetings, FAQ) | T1 | Workers AI (free) |
| Basic conversation | T2 | Gemini Flash |
| Complex conversation | T3 | Gemini 3 Flash |
| Reasoning required | T4 | Claude Haiku |
| Analysis/generation | T5 | Claude Sonnet |
| Strategic/planning | T6 | Claude Opus |
Configuration
Environment Variables
| Variable | Description |
|---|---|
AI_GATEWAY_URL | Gateway endpoint URL |
AI_DEFAULT_TIER | Fallback tier when routing fails |
AI_CACHE_TTL | Response cache TTL in seconds |
AI_MAX_RETRIES | Maximum retry attempts |
See deployment configuration for actual values.
Per-Tenant Overrides
Tenants can configure tier restrictions in the platform settings:
{
"ai_config": {
"max_tier": "T4",
"min_tier": "T1",
"budget_limit_monthly": 100.00,
"allowed_models": ["workers-ai", "gemini"]
}
}
API Reference
Chat Completion (Tier-Routed)
POST /api/ai/chat
Content-Type: application/json
Authorization: Bearer YOUR_TOKEN
{
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"context": {
"tenant_id": "uuid",
"shell": "staff"
}
}
# Response
{
"response": "I'm doing well, thank you! How can I help you today?",
"metadata": {
"model": "llama-4-scout",
"tier": "T1",
"tokens_in": 12,
"tokens_out": 18,
"cost_usd": 0.0,
"latency_ms": 45,
"cached": false
}
}
Direct Model Access
POST /api/ai/chat/direct
{
"model": "claude-haiku-4.5",
"messages": [...],
"max_tokens": 1000
}
List Available Tiers
GET /api/ai/tiers
# Response
{
"tiers": [
{"id": "T1", "model": "llama-4-scout", "cost_input": 0, "cost_output": 0},
{"id": "T2", "model": "gemini-2.0-flash", "cost_input": 0.10, "cost_output": 0.40},
...
]
}
List Available Models
GET /api/ai/models
# Response
{
"providers": {
"workers-ai": ["llama-4-scout", "mistral-3.1", "whisper"],
"anthropic": ["claude-haiku-4.5", "claude-sonnet-4.5", "claude-opus-4.5"],
"google": ["gemini-2.0-flash", "gemini-3-flash"],
...
}
}
Provider Health
GET /api/ai/health
# Response (example)
{
"status": "healthy",
"providers": {
"workers-ai": {"status": "up"},
"anthropic": {"status": "up"},
"google": {"status": "up"}
}
}
Usage Analytics
GET /api/ai/usage/{tenant_id}?start=2026-01-01&end=2026-01-31
# Response (example structure)
{
"tenant_id": "uuid",
"period": {"start": "...", "end": "..."},
"total_requests": 0,
"total_cost_usd": 0.00,
"by_tier": {
"T1": {"requests": 0, "cost": 0},
"T2": {"requests": 0, "cost": 0},
...
}
}
Vectorize RAG Service
The AI Router includes semantic search over 4 Vectorize indexes.
Available Indexes
| Index | Dimensions | Content |
|---|---|---|
menu-rag | 384 | Menu items, ingredients, allergens |
policy-rag | 384 | HR policies, procedures |
support-rag | 384 | FAQ, tutorials, troubleshooting |
training-rag | 384 | Onboarding guides, role training |
RAG API Endpoints
# Single-index semantic search
POST /rag/query
{
"query": "vegetarian options",
"index": "menu-rag",
"limit": 5
}
# Multi-index search with score fusion
POST /rag/multi-query
{
"query": "new employee onboarding",
"indexes": ["policy-rag", "training-rag"],
"limit": 10
}
# Document indexing (admin)
POST /rag/index
{
"index": "support-rag",
"documents": [...]
}
# Health check
GET /rag/health
# Returns: {"status":"healthy","bindings":{"ai":true,"menu_rag":true,...}}
LangGraph Agent Orchestrator
Multi-step workflow execution with human-in-the-loop capabilities.
Features
| Feature | Description |
|---|---|
| Multi-Step Workflows | Chain complex operations across services |
| Human Approval | Pause workflows for human review |
| State Persistence | Resume interrupted workflows |
| Intent Classification | Route to appropriate workflow |
| Graceful Degradation | Falls back when dependencies unavailable |
Agent API Endpoints
# Start agent conversation/workflow
POST /api/agent/chat
{
"message": "Generate weekly sales report",
"context": {"tenant_id": "uuid"}
}
# Approve pending action
POST /api/agent/approve
{
"workflow_id": "uuid",
"approved": true,
"notes": "Looks good, proceed"
}
# Check agent status
GET /api/agent/status
# Returns: {"available": true, "pending_approvals": 2}
Workflow Types
| Type | Trigger | Actions |
|---|---|---|
| Report Generation | "Generate report" | Query data, format, email |
| Menu Updates | "Update menu" | Validate, sync, notify |
| Scheduling | "Schedule staff" | Analyze, propose, await approval |
| Support Escalation | High urgency | Create ticket, assign, alert |
Text-to-Speech
ElevenLabs integration for voice synthesis.
POST /api/ai/tts
{
"text": "Your order is ready for pickup",
"voice": "rachel",
"model": "eleven_multilingual_v2"
}
# Returns audio stream (MP3)
Monitoring
Metrics
| Metric | Description |
|---|---|
ai_requests_total | Total requests by tier |
ai_cost_usd | Cost by tier/tenant |
ai_latency_p99 | 99th percentile latency |
ai_cache_hit_rate | Cache effectiveness |
ai_tier_distribution | Requests per tier |
Alert thresholds configurable per deployment.
Cockpit Dashboard
The AI Router metrics are displayed in the Cockpit AI Cost Analytics tab:
- Real-time cost tracking by tenant
- Tier distribution visualization
- Model performance comparison
- Budget alerts and forecasting
Best Practices
- Use RAG for context - Reduces token count by providing relevant context
- Implement caching - Cache common queries to reduce costs
- Set budget limits - Configure per-tenant spending caps
- Monitor tier distribution - Alert on excessive T5/T6 usage
- Batch similar requests - Group related queries for efficiency
Troubleshooting
High T5/T6 Usage
Symptom: Excessive queries routing to expensive tiers
Causes:
- Complex prompts without context
- Missing RAG integration
- Incorrect complexity classification
Solution:
- Review query patterns in Cockpit
- Add RAG context for domain-specific queries
- Tune complexity thresholds
Cache Miss Rate High
Symptom: Low cache hit rate increasing costs
Solution:
- Increase TTL for stable responses
- Normalize query formatting
- Implement semantic caching