Skip to main content
Authenticated API

This endpoint requires a valid JWT Bearer token. Accessible via the API gateway at /v1/ai/*.

AI Gateway

The AI Gateway provides unified access to multiple AI providers with cost-optimized tier-based routing, response caching, text-to-speech, and per-tenant usage tracking.

Overview

AttributeValue
Base Path/api/v1/ai
AuthenticationBearer Token
Required Rolesanalytics_admin, manager, restaurant_manager, tenant_admin, platform_admin, system_admin, super_admin
ProxyGo gateway forwards all /v1/ai/* requests to Python Analytics service at /api/ai/*

The Go API gateway proxies all requests under /v1/ai/* to the Python Analytics service. The Python service routes requests to the appropriate AI provider via the AIGatewayClient.


Architecture

Client Applications (Flutter Shells, API Consumers)
|
Go API Gateway (/v1/ai/*)
|
Python Analytics Service (/api/ai/*)
|
AIGatewayClient (app.services.ai.gateway_client)
|
Cloudflare AI Gateway (unified proxy)
|
+----------+----------+----------+
| | | |
Workers AI Anthropic Google OpenAI/Grok
(FREE) (Claude) (Gemini)

Model Tiers

The ACP AI Router uses a tiered model system to optimize costs while maintaining quality. Each tier targets specific use cases.

TierModelCost (per M tokens)Use Case
T1Llama 4 Scout (Workers AI)FREESimple queries, greetings, basic classification
T2Gemini 2.0 Flash$0.10 / $0.40Fast inference, real-time conversation
T3Gemini 3 Flash$0.50 / $3.00Complex conversation, multi-step reasoning
T4Claude Haiku 4.5$1.00 / $5.00High-quality fast reasoning
T5Claude Sonnet 4.5$3.00 / $15.00Premium quality analysis
T6Claude Opus 4.5$5.00 / $25.00Enterprise strategic planning

Cost Optimization

By routing 70%+ of queries to T1-T2 models, the platform achieves up to 95% cost reduction compared to always using premium models.


Chat Completion

Tier-Based Chat

Get chat completion with cost-optimized tier routing. The specified tier determines which model handles the request.

POST /api/v1/ai/chat

Request Body

{
"messages": [
{"role": "system", "content": "You are a helpful restaurant assistant."},
{"role": "user", "content": "What are today's specials?"}
],
"tenant_id": "restaurant-1",
"tier": "t1",
"temperature": 0.7,
"max_tokens": 256,
"stream": false,
"cache_enabled": true
}
FieldTypeRequiredDescription
messagesarrayYesConversation history (objects with role and content)
tenant_idstringYesTenant identifier for billing
tierstringNoModel tier: t1 through t6 (default: t1)
temperaturefloatNoSampling temperature 0-2 (default: 0.7)
max_tokensintegerNoMaximum tokens in response (max: 4096)
streambooleanNoStream response (default: false)
cache_enabledbooleanNoEnable response caching (default: true)

Response

{
"content": "Today's specials include our pan-seared salmon with lemon butter sauce...",
"model": "@cf/meta/llama-4-scout-17b-16e-instruct",
"provider": "workers-ai",
"tier": "t1",
"usage": {
"prompt_tokens": 25,
"completion_tokens": 45,
"total_tokens": 70
},
"cached": false,
"latency_ms": 320,
"estimated_cost": 0.0
}
FieldTypeDescription
contentstringGenerated response text
modelstringModel identifier used
providerstringAI provider (workers-ai, anthropic, google, openai, grok)
tierstringTier used for the request
usageobjectToken usage breakdown
cachedbooleanWhether the response was served from cache
latency_msintegerRequest latency in milliseconds
estimated_costfloatEstimated cost in USD

Direct Model Access

Access a specific model directly without tier-based routing. Use this when you need specific provider capabilities.

POST /api/v1/ai/chat/direct

Request Body

{
"messages": [
{"role": "user", "content": "Analyze this menu item description for appeal"}
],
"tenant_id": "restaurant-1",
"provider": "anthropic",
"model": "claude-haiku-4.5-latest",
"temperature": 0.7,
"max_tokens": 256,
"stream": false
}
FieldTypeRequiredDescription
messagesarrayYesConversation history
tenant_idstringYesTenant identifier for billing
providerstringYesAI provider: workers-ai, anthropic, google, openai, grok
modelstringYesSpecific model ID (e.g., claude-haiku-4.5-latest, gpt-4o, gemini-2.0-flash)
temperaturefloatNoSampling temperature 0-2 (default: 0.7)
max_tokensintegerNoMaximum tokens (max: 4096)
streambooleanNoStream response (default: false)

Response

Same format as the tier-based chat response.


Text-to-Speech

Synthesize Speech

Convert text to speech using ElevenLabs. Returns audio data as a streaming response.

POST /api/v1/ai/tts

Request Body

{
"text": "Welcome to our restaurant! Today's special is pan-seared salmon.",
"tenant_id": "restaurant-1",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"tts_model_id": "eleven_turbo_v2_5",
"output_format": "mp3_44100_128"
}
FieldTypeRequiredDescription
textstringYesText to synthesize (max 5000 characters)
tenant_idstringYesTenant identifier for billing
voice_idstringNoElevenLabs voice ID (default: 21m00Tcm4TlvDq8ikWAM - Rachel)
tts_model_idstringNoTTS model (default: eleven_turbo_v2_5)
output_formatstringNoAudio format (default: mp3_44100_128)

Response

Returns audio data as a streaming binary response with appropriate content type:

  • audio/mpeg for MP3 formats
  • audio/wav for WAV formats
  • audio/pcm for PCM formats

Content-Disposition header: attachment; filename=speech.mp3


Models & Tiers

List Available Models

List all available AI models grouped by provider with pricing information.

GET /api/v1/ai/models

Response

{
"workers-ai": [
{
"model_id": "@cf/meta/llama-4-scout-17b-16e-instruct",
"provider": "workers-ai",
"tier": null,
"input_cost_per_m": 0.0,
"output_cost_per_m": 0.0,
"max_tokens": 4096,
"supports_streaming": true,
"supports_vision": false
}
],
"anthropic": [
{
"model_id": "claude-haiku-4.5-latest",
"provider": "anthropic",
"tier": null,
"input_cost_per_m": 1.0,
"output_cost_per_m": 5.0,
"max_tokens": 4096,
"supports_streaming": true,
"supports_vision": true
}
]
}

Get Tier Information

Get detailed information about each model tier including associated models, pricing, and use cases.

GET /api/v1/ai/tiers

Response

{
"tiers": [
{
"tier": "t1",
"name": "FREE",
"model": "Llama 4 Scout",
"provider": "workers-ai",
"input_cost_per_m": 0.0,
"output_cost_per_m": 0.0,
"use_cases": ["Simple queries", "Greetings", "Basic FAQ"]
},
{
"tier": "t2",
"name": "BUDGET",
"model": "Gemini 2.0 Flash",
"provider": "google",
"input_cost_per_m": 0.10,
"output_cost_per_m": 0.40,
"use_cases": ["Real-time conversation", "Summaries"]
},
{
"tier": "t3",
"name": "STANDARD",
"model": "Gemini 3 Flash",
"provider": "google",
"input_cost_per_m": 0.50,
"output_cost_per_m": 3.00,
"use_cases": ["Complex conversation", "Analysis"]
},
{
"tier": "t4",
"name": "QUALITY",
"model": "Claude Haiku 4.5",
"provider": "anthropic",
"input_cost_per_m": 1.00,
"output_cost_per_m": 5.00,
"use_cases": ["Fast reasoning", "Detailed analysis"]
},
{
"tier": "t5",
"name": "PREMIUM",
"model": "Claude Sonnet 4.5",
"provider": "anthropic",
"input_cost_per_m": 3.00,
"output_cost_per_m": 15.00,
"use_cases": ["High-quality analysis", "Report generation"]
},
{
"tier": "t6",
"name": "ENTERPRISE",
"model": "Claude Opus 4.5",
"provider": "anthropic",
"input_cost_per_m": 5.00,
"output_cost_per_m": 25.00,
"use_cases": ["Strategic planning", "Complex reasoning"]
}
],
"routing_strategy": "Cost-optimized with fallback. Uses lowest cost model capable of handling the request complexity."
}

Usage & Monitoring

Get Usage Statistics

Get AI usage statistics for a tenant including token counts, costs, and cache efficiency.

GET /api/v1/ai/usage/{tenant_id}

Path Parameters

ParameterTypeDescription
tenant_idstringTenant identifier

Query Parameters

ParameterTypeRequiredDescription
periodstringNoTime period: hour, day, week, month (default: day)

Response

{
"tenant_id": "restaurant-1",
"period": "day",
"total_requests": 1450,
"total_tokens": 125000,
"tokens_by_tier": {
"t1": 85000,
"t2": 25000,
"t4": 15000
},
"estimated_cost": 12.50,
"cache_hit_rate": 0.35
}

Health Check

Check AI Gateway health and provider availability.

GET /api/v1/ai/health

Response

{
"status": "healthy",
"gateway_url": "https://gateway.ai.cloudflare.com/v1/...",
"providers": {
"workers-ai": "healthy",
"anthropic": "healthy",
"google": "healthy"
},
"cache_enabled": true,
"checked_at": "2026-02-19T14:30:00Z"
}

When the health check fails, returns:

{
"status": "unhealthy",
"error": "Connection timeout",
"checked_at": "2026-02-19T14:30:00Z"
}

Best Practices

Model Selection

Cost Optimization

Start with T1/T2 tiers for most workloads. The ACP Router can achieve up to 95% cost reduction by routing 70%+ of queries to lower-cost models without sacrificing quality.

  1. Start with T1 for simple tasks (greetings, basic queries)
  2. Use T2 for most production workloads (real-time conversation)
  3. Reserve T4-T6 for complex reasoning and strategic planning

Caching Strategy

  • Enable caching for identical queries (menu lookups, FAQ)
  • Set appropriate TTL based on content freshness needs
  • Use cache-busting for time-sensitive queries

Cost Optimization

  1. Batch similar requests when possible
  2. Use embeddings for semantic search instead of LLM queries
  3. Implement client-side caching for repeat queries
  4. Monitor usage stats to identify optimization opportunities

Error Responses

StatusCodeDescription
401UnauthorizedInvalid or missing JWT token
500Internal ErrorAI provider request failed