Authenticated API

This endpoint requires a valid JWT Bearer token. Accessible via the API gateway at /v1/ai/*.

AI Gateway

The AI Gateway provides unified access to multiple AI providers with cost-optimized tier-based routing, response caching, text-to-speech, and per-tenant usage tracking.

Overview

Attribute	Value
Base Path	`/api/v1/ai`
Authentication	Bearer Token
Required Roles	`analytics_admin`, `manager`, `restaurant_manager`, `tenant_admin`, `platform_admin`, `system_admin`, `super_admin`
Proxy	Go gateway forwards all `/v1/ai/` requests to Python Analytics service at `/api/ai/`

The Go API gateway proxies all requests under /v1/ai/* to the Python Analytics service. The Python service routes requests to the appropriate AI provider via the AIGatewayClient.

Architecture

Client Applications (Flutter Shells, API Consumers)
         |
    Go API Gateway (/v1/ai/*)
         |
    Python Analytics Service (/api/ai/*)
         |
    AIGatewayClient (app.services.ai.gateway_client)
         |
    Cloudflare AI Gateway (unified proxy)
         |
    +----------+----------+----------+
    |          |          |          |
Workers AI  Anthropic   Google    OpenAI/Grok
  (FREE)    (Claude)   (Gemini)

Model Tiers

The ACP AI Router uses a tiered model system to optimize costs while maintaining quality. Each tier targets specific use cases.

Tier	Model	Cost (per M tokens)	Use Case
T1	Llama 4 Scout (Workers AI)	FREE	Simple queries, greetings, basic classification
T2	Gemini 2.0 Flash	$0.10 / $0.40	Fast inference, real-time conversation
T3	Gemini 3 Flash	$0.50 / $3.00	Complex conversation, multi-step reasoning
T4	Claude Haiku 4.5	$1.00 / $5.00	High-quality fast reasoning
T5	Claude Sonnet 4.5	$3.00 / $15.00	Premium quality analysis
T6	Claude Opus 4.5	$5.00 / $25.00	Enterprise strategic planning

Cost Optimization

By routing 70%+ of queries to T1-T2 models, the platform achieves up to 95% cost reduction compared to always using premium models.

Chat Completion

Tier-Based Chat

Get chat completion with cost-optimized tier routing. The specified tier determines which model handles the request.

POST /api/v1/ai/chat

Request Body

{
  "messages": [
    {"role": "system", "content": "You are a helpful restaurant assistant."},
    {"role": "user", "content": "What are today's specials?"}
  ],
  "tenant_id": "restaurant-1",
  "tier": "t1",
  "temperature": 0.7,
  "max_tokens": 256,
  "stream": false,
  "cache_enabled": true
}

Field	Type	Required	Description
`messages`	array	Yes	Conversation history (objects with `role` and `content`)
`tenant_id`	string	Yes	Tenant identifier for billing
`tier`	string	No	Model tier: `t1` through `t6` (default: `t1`)
`temperature`	float	No	Sampling temperature 0-2 (default: `0.7`)
`max_tokens`	integer	No	Maximum tokens in response (max: `4096`)
`stream`	boolean	No	Stream response (default: `false`)
`cache_enabled`	boolean	No	Enable response caching (default: `true`)

Response

{
  "content": "Today's specials include our pan-seared salmon with lemon butter sauce...",
  "model": "@cf/meta/llama-4-scout-17b-16e-instruct",
  "provider": "workers-ai",
  "tier": "t1",
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 45,
    "total_tokens": 70
  },
  "cached": false,
  "latency_ms": 320,
  "estimated_cost": 0.0
}

Field	Type	Description
`content`	string	Generated response text
`model`	string	Model identifier used
`provider`	string	AI provider (`workers-ai`, `anthropic`, `google`, `openai`, `grok`)
`tier`	string	Tier used for the request
`usage`	object	Token usage breakdown
`cached`	boolean	Whether the response was served from cache
`latency_ms`	integer	Request latency in milliseconds
`estimated_cost`	float	Estimated cost in USD

Direct Model Access

Access a specific model directly without tier-based routing. Use this when you need specific provider capabilities.

POST /api/v1/ai/chat/direct

Request Body

{
  "messages": [
    {"role": "user", "content": "Analyze this menu item description for appeal"}
  ],
  "tenant_id": "restaurant-1",
  "provider": "anthropic",
  "model": "claude-haiku-4.5-latest",
  "temperature": 0.7,
  "max_tokens": 256,
  "stream": false
}

Field	Type	Required	Description
`messages`	array	Yes	Conversation history
`tenant_id`	string	Yes	Tenant identifier for billing
`provider`	string	Yes	AI provider: `workers-ai`, `anthropic`, `google`, `openai`, `grok`
`model`	string	Yes	Specific model ID (e.g., `claude-haiku-4.5-latest`, `gpt-4o`, `gemini-2.0-flash`)
`temperature`	float	No	Sampling temperature 0-2 (default: `0.7`)
`max_tokens`	integer	No	Maximum tokens (max: `4096`)
`stream`	boolean	No	Stream response (default: `false`)

Response

Same format as the tier-based chat response.

Text-to-Speech

Synthesize Speech

Convert text to speech using ElevenLabs. Returns audio data as a streaming response.

POST /api/v1/ai/tts

Request Body

{
  "text": "Welcome to our restaurant! Today's special is pan-seared salmon.",
  "tenant_id": "restaurant-1",
  "voice_id": "21m00Tcm4TlvDq8ikWAM",
  "tts_model_id": "eleven_turbo_v2_5",
  "output_format": "mp3_44100_128"
}

Field	Type	Required	Description
`text`	string	Yes	Text to synthesize (max 5000 characters)
`tenant_id`	string	Yes	Tenant identifier for billing
`voice_id`	string	No	ElevenLabs voice ID (default: `21m00Tcm4TlvDq8ikWAM` - Rachel)
`tts_model_id`	string	No	TTS model (default: `eleven_turbo_v2_5`)
`output_format`	string	No	Audio format (default: `mp3_44100_128`)

Response

Returns audio data as a streaming binary response with appropriate content type:

audio/mpeg for MP3 formats
audio/wav for WAV formats
audio/pcm for PCM formats

Content-Disposition header: attachment; filename=speech.mp3

Models & Tiers

List Available Models

List all available AI models grouped by provider with pricing information.

GET /api/v1/ai/models

Response

{
  "workers-ai": [
    {
      "model_id": "@cf/meta/llama-4-scout-17b-16e-instruct",
      "provider": "workers-ai",
      "tier": null,
      "input_cost_per_m": 0.0,
      "output_cost_per_m": 0.0,
      "max_tokens": 4096,
      "supports_streaming": true,
      "supports_vision": false
    }
  ],
  "anthropic": [
    {
      "model_id": "claude-haiku-4.5-latest",
      "provider": "anthropic",
      "tier": null,
      "input_cost_per_m": 1.0,
      "output_cost_per_m": 5.0,
      "max_tokens": 4096,
      "supports_streaming": true,
      "supports_vision": true
    }
  ]
}

Get Tier Information

Get detailed information about each model tier including associated models, pricing, and use cases.

GET /api/v1/ai/tiers

Response

{
  "tiers": [
    {
      "tier": "t1",
      "name": "FREE",
      "model": "Llama 4 Scout",
      "provider": "workers-ai",
      "input_cost_per_m": 0.0,
      "output_cost_per_m": 0.0,
      "use_cases": ["Simple queries", "Greetings", "Basic FAQ"]
    },
    {
      "tier": "t2",
      "name": "BUDGET",
      "model": "Gemini 2.0 Flash",
      "provider": "google",
      "input_cost_per_m": 0.10,
      "output_cost_per_m": 0.40,
      "use_cases": ["Real-time conversation", "Summaries"]
    },
    {
      "tier": "t3",
      "name": "STANDARD",
      "model": "Gemini 3 Flash",
      "provider": "google",
      "input_cost_per_m": 0.50,
      "output_cost_per_m": 3.00,
      "use_cases": ["Complex conversation", "Analysis"]
    },
    {
      "tier": "t4",
      "name": "QUALITY",
      "model": "Claude Haiku 4.5",
      "provider": "anthropic",
      "input_cost_per_m": 1.00,
      "output_cost_per_m": 5.00,
      "use_cases": ["Fast reasoning", "Detailed analysis"]
    },
    {
      "tier": "t5",
      "name": "PREMIUM",
      "model": "Claude Sonnet 4.5",
      "provider": "anthropic",
      "input_cost_per_m": 3.00,
      "output_cost_per_m": 15.00,
      "use_cases": ["High-quality analysis", "Report generation"]
    },
    {
      "tier": "t6",
      "name": "ENTERPRISE",
      "model": "Claude Opus 4.5",
      "provider": "anthropic",
      "input_cost_per_m": 5.00,
      "output_cost_per_m": 25.00,
      "use_cases": ["Strategic planning", "Complex reasoning"]
    }
  ],
  "routing_strategy": "Cost-optimized with fallback. Uses lowest cost model capable of handling the request complexity."
}

Usage & Monitoring

Get Usage Statistics

Get AI usage statistics for a tenant including token counts, costs, and cache efficiency.

GET /api/v1/ai/usage/{tenant_id}

Path Parameters

Parameter	Type	Description
`tenant_id`	string	Tenant identifier

Query Parameters

Parameter	Type	Required	Description
`period`	string	No	Time period: `hour`, `day`, `week`, `month` (default: `day`)

Response

{
  "tenant_id": "restaurant-1",
  "period": "day",
  "total_requests": 1450,
  "total_tokens": 125000,
  "tokens_by_tier": {
    "t1": 85000,
    "t2": 25000,
    "t4": 15000
  },
  "estimated_cost": 12.50,
  "cache_hit_rate": 0.35
}

Health Check

Check AI Gateway health and provider availability.

GET /api/v1/ai/health

Response

{
  "status": "healthy",
  "gateway_url": "https://gateway.ai.cloudflare.com/v1/...",
  "providers": {
    "workers-ai": "healthy",
    "anthropic": "healthy",
    "google": "healthy"
  },
  "cache_enabled": true,
  "checked_at": "2026-02-19T14:30:00Z"
}

When the health check fails, returns:

{
  "status": "unhealthy",
  "error": "Connection timeout",
  "checked_at": "2026-02-19T14:30:00Z"
}

Best Practices

Model Selection

Cost Optimization

Start with T1/T2 tiers for most workloads. The ACP Router can achieve up to 95% cost reduction by routing 70%+ of queries to lower-cost models without sacrificing quality.

Start with T1 for simple tasks (greetings, basic queries)
Use T2 for most production workloads (real-time conversation)
Reserve T4-T6 for complex reasoning and strategic planning

Caching Strategy

Enable caching for identical queries (menu lookups, FAQ)
Set appropriate TTL based on content freshness needs
Use cache-busting for time-sensitive queries

Cost Optimization

Batch similar requests when possible
Use embeddings for semantic search instead of LLM queries
Implement client-side caching for repeat queries
Monitor usage stats to identify optimization opportunities

Error Responses

Status	Code	Description
401	Unauthorized	Invalid or missing JWT token
500	Internal Error	AI provider request failed

Voice AI - Voice AI assistant
LangGraph Agents - Agent orchestration
Recommendations - AI recommendations
Forecasting - ML forecasting

Overview​

Architecture​

Model Tiers​

Cost Optimization​

Chat Completion​

Tier-Based Chat​

Direct Model Access​

Text-to-Speech​

Synthesize Speech​

Models & Tiers​

List Available Models​

Get Tier Information​

Usage & Monitoring​

Get Usage Statistics​

Health Check​

Best Practices​

Model Selection​

Caching Strategy​

Cost Optimization​

Error Responses​

Related Documentation​

Overview

Architecture

Model Tiers

Cost Optimization

Chat Completion

Tier-Based Chat

Direct Model Access

Text-to-Speech

Synthesize Speech

Models & Tiers

List Available Models

Get Tier Information

Usage & Monitoring

Get Usage Statistics

Health Check

Best Practices

Model Selection

Caching Strategy

Cost Optimization

Error Responses

Related Documentation