This endpoint requires a valid JWT Bearer token. Accessible via the API gateway at /v1/ai/*.
AI Gateway
The AI Gateway provides unified access to multiple AI providers with cost-optimized tier-based routing, response caching, text-to-speech, and per-tenant usage tracking.
Overview
| Attribute | Value |
|---|---|
| Base Path | /api/v1/ai |
| Authentication | Bearer Token |
| Required Roles | analytics_admin, manager, restaurant_manager, tenant_admin, platform_admin, system_admin, super_admin |
| Proxy | Go gateway forwards all /v1/ai/* requests to Python Analytics service at /api/ai/* |
The Go API gateway proxies all requests under /v1/ai/* to the Python Analytics service. The Python service routes requests to the appropriate AI provider via the AIGatewayClient.
Architecture
Client Applications (Flutter Shells, API Consumers)
|
Go API Gateway (/v1/ai/*)
|
Python Analytics Service (/api/ai/*)
|
AIGatewayClient (app.services.ai.gateway_client)
|
Cloudflare AI Gateway (unified proxy)
|
+----------+----------+----------+
| | | |
Workers AI Anthropic Google OpenAI/Grok
(FREE) (Claude) (Gemini)
Model Tiers
The ACP AI Router uses a tiered model system to optimize costs while maintaining quality. Each tier targets specific use cases.
| Tier | Model | Cost (per M tokens) | Use Case |
|---|---|---|---|
| T1 | Llama 4 Scout (Workers AI) | FREE | Simple queries, greetings, basic classification |
| T2 | Gemini 2.0 Flash | $0.10 / $0.40 | Fast inference, real-time conversation |
| T3 | Gemini 3 Flash | $0.50 / $3.00 | Complex conversation, multi-step reasoning |
| T4 | Claude Haiku 4.5 | $1.00 / $5.00 | High-quality fast reasoning |
| T5 | Claude Sonnet 4.5 | $3.00 / $15.00 | Premium quality analysis |
| T6 | Claude Opus 4.5 | $5.00 / $25.00 | Enterprise strategic planning |
Cost Optimization
By routing 70%+ of queries to T1-T2 models, the platform achieves up to 95% cost reduction compared to always using premium models.
Chat Completion
Tier-Based Chat
Get chat completion with cost-optimized tier routing. The specified tier determines which model handles the request.
POST /api/v1/ai/chat
Request Body
{
"messages": [
{"role": "system", "content": "You are a helpful restaurant assistant."},
{"role": "user", "content": "What are today's specials?"}
],
"tenant_id": "restaurant-1",
"tier": "t1",
"temperature": 0.7,
"max_tokens": 256,
"stream": false,
"cache_enabled": true
}
| Field | Type | Required | Description |
|---|---|---|---|
messages | array | Yes | Conversation history (objects with role and content) |
tenant_id | string | Yes | Tenant identifier for billing |
tier | string | No | Model tier: t1 through t6 (default: t1) |
temperature | float | No | Sampling temperature 0-2 (default: 0.7) |
max_tokens | integer | No | Maximum tokens in response (max: 4096) |
stream | boolean | No | Stream response (default: false) |
cache_enabled | boolean | No | Enable response caching (default: true) |
Response
{
"content": "Today's specials include our pan-seared salmon with lemon butter sauce...",
"model": "@cf/meta/llama-4-scout-17b-16e-instruct",
"provider": "workers-ai",
"tier": "t1",
"usage": {
"prompt_tokens": 25,
"completion_tokens": 45,
"total_tokens": 70
},
"cached": false,
"latency_ms": 320,
"estimated_cost": 0.0
}
| Field | Type | Description |
|---|---|---|
content | string | Generated response text |
model | string | Model identifier used |
provider | string | AI provider (workers-ai, anthropic, google, openai, grok) |
tier | string | Tier used for the request |
usage | object | Token usage breakdown |
cached | boolean | Whether the response was served from cache |
latency_ms | integer | Request latency in milliseconds |
estimated_cost | float | Estimated cost in USD |
Direct Model Access
Access a specific model directly without tier-based routing. Use this when you need specific provider capabilities.
POST /api/v1/ai/chat/direct
Request Body
{
"messages": [
{"role": "user", "content": "Analyze this menu item description for appeal"}
],
"tenant_id": "restaurant-1",
"provider": "anthropic",
"model": "claude-haiku-4.5-latest",
"temperature": 0.7,
"max_tokens": 256,
"stream": false
}
| Field | Type | Required | Description |
|---|---|---|---|
messages | array | Yes | Conversation history |
tenant_id | string | Yes | Tenant identifier for billing |
provider | string | Yes | AI provider: workers-ai, anthropic, google, openai, grok |
model | string | Yes | Specific model ID (e.g., claude-haiku-4.5-latest, gpt-4o, gemini-2.0-flash) |
temperature | float | No | Sampling temperature 0-2 (default: 0.7) |
max_tokens | integer | No | Maximum tokens (max: 4096) |
stream | boolean | No | Stream response (default: false) |
Response
Same format as the tier-based chat response.
Text-to-Speech
Synthesize Speech
Convert text to speech using ElevenLabs. Returns audio data as a streaming response.
POST /api/v1/ai/tts
Request Body
{
"text": "Welcome to our restaurant! Today's special is pan-seared salmon.",
"tenant_id": "restaurant-1",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"tts_model_id": "eleven_turbo_v2_5",
"output_format": "mp3_44100_128"
}
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Text to synthesize (max 5000 characters) |
tenant_id | string | Yes | Tenant identifier for billing |
voice_id | string | No | ElevenLabs voice ID (default: 21m00Tcm4TlvDq8ikWAM - Rachel) |
tts_model_id | string | No | TTS model (default: eleven_turbo_v2_5) |
output_format | string | No | Audio format (default: mp3_44100_128) |
Response
Returns audio data as a streaming binary response with appropriate content type:
audio/mpegfor MP3 formatsaudio/wavfor WAV formatsaudio/pcmfor PCM formats
Content-Disposition header: attachment; filename=speech.mp3
Models & Tiers
List Available Models
List all available AI models grouped by provider with pricing information.
GET /api/v1/ai/models
Response
{
"workers-ai": [
{
"model_id": "@cf/meta/llama-4-scout-17b-16e-instruct",
"provider": "workers-ai",
"tier": null,
"input_cost_per_m": 0.0,
"output_cost_per_m": 0.0,
"max_tokens": 4096,
"supports_streaming": true,
"supports_vision": false
}
],
"anthropic": [
{
"model_id": "claude-haiku-4.5-latest",
"provider": "anthropic",
"tier": null,
"input_cost_per_m": 1.0,
"output_cost_per_m": 5.0,
"max_tokens": 4096,
"supports_streaming": true,
"supports_vision": true
}
]
}
Get Tier Information
Get detailed information about each model tier including associated models, pricing, and use cases.
GET /api/v1/ai/tiers
Response
{
"tiers": [
{
"tier": "t1",
"name": "FREE",
"model": "Llama 4 Scout",
"provider": "workers-ai",
"input_cost_per_m": 0.0,
"output_cost_per_m": 0.0,
"use_cases": ["Simple queries", "Greetings", "Basic FAQ"]
},
{
"tier": "t2",
"name": "BUDGET",
"model": "Gemini 2.0 Flash",
"provider": "google",
"input_cost_per_m": 0.10,
"output_cost_per_m": 0.40,
"use_cases": ["Real-time conversation", "Summaries"]
},
{
"tier": "t3",
"name": "STANDARD",
"model": "Gemini 3 Flash",
"provider": "google",
"input_cost_per_m": 0.50,
"output_cost_per_m": 3.00,
"use_cases": ["Complex conversation", "Analysis"]
},
{
"tier": "t4",
"name": "QUALITY",
"model": "Claude Haiku 4.5",
"provider": "anthropic",
"input_cost_per_m": 1.00,
"output_cost_per_m": 5.00,
"use_cases": ["Fast reasoning", "Detailed analysis"]
},
{
"tier": "t5",
"name": "PREMIUM",
"model": "Claude Sonnet 4.5",
"provider": "anthropic",
"input_cost_per_m": 3.00,
"output_cost_per_m": 15.00,
"use_cases": ["High-quality analysis", "Report generation"]
},
{
"tier": "t6",
"name": "ENTERPRISE",
"model": "Claude Opus 4.5",
"provider": "anthropic",
"input_cost_per_m": 5.00,
"output_cost_per_m": 25.00,
"use_cases": ["Strategic planning", "Complex reasoning"]
}
],
"routing_strategy": "Cost-optimized with fallback. Uses lowest cost model capable of handling the request complexity."
}
Usage & Monitoring
Get Usage Statistics
Get AI usage statistics for a tenant including token counts, costs, and cache efficiency.
GET /api/v1/ai/usage/{tenant_id}
Path Parameters
| Parameter | Type | Description |
|---|---|---|
tenant_id | string | Tenant identifier |
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
period | string | No | Time period: hour, day, week, month (default: day) |
Response
{
"tenant_id": "restaurant-1",
"period": "day",
"total_requests": 1450,
"total_tokens": 125000,
"tokens_by_tier": {
"t1": 85000,
"t2": 25000,
"t4": 15000
},
"estimated_cost": 12.50,
"cache_hit_rate": 0.35
}
Health Check
Check AI Gateway health and provider availability.
GET /api/v1/ai/health
Response
{
"status": "healthy",
"gateway_url": "https://gateway.ai.cloudflare.com/v1/...",
"providers": {
"workers-ai": "healthy",
"anthropic": "healthy",
"google": "healthy"
},
"cache_enabled": true,
"checked_at": "2026-02-19T14:30:00Z"
}
When the health check fails, returns:
{
"status": "unhealthy",
"error": "Connection timeout",
"checked_at": "2026-02-19T14:30:00Z"
}
Best Practices
Model Selection
Start with T1/T2 tiers for most workloads. The ACP Router can achieve up to 95% cost reduction by routing 70%+ of queries to lower-cost models without sacrificing quality.
- Start with T1 for simple tasks (greetings, basic queries)
- Use T2 for most production workloads (real-time conversation)
- Reserve T4-T6 for complex reasoning and strategic planning
Caching Strategy
- Enable caching for identical queries (menu lookups, FAQ)
- Set appropriate TTL based on content freshness needs
- Use cache-busting for time-sensitive queries
Cost Optimization
- Batch similar requests when possible
- Use embeddings for semantic search instead of LLM queries
- Implement client-side caching for repeat queries
- Monitor usage stats to identify optimization opportunities
Error Responses
| Status | Code | Description |
|---|---|---|
| 401 | Unauthorized | Invalid or missing JWT token |
| 500 | Internal Error | AI provider request failed |
Related Documentation
- Voice AI - Voice AI assistant
- LangGraph Agents - Agent orchestration
- Recommendations - AI recommendations
- Forecasting - ML forecasting